ThinkRL
GitHub
Algorithms

COPO

Count-based Online Preference Optimization - Uses exploration bonuses for exploration-heavy tasks.

Overview

COPO adds count-based intrinsic motivation to encourage exploration of under-visited response spaces. Ideal for tasks where the optimal policy is far from the initial supervised fine-tuned model.

Usage

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="copo",
    copo_exploration_coeff=0.1,
    learning_rate=1e-6,
)

trainer = RLHFTrainer(config)
trainer.train()