Algorithms
COPO
Count-based Online Preference Optimization - Uses exploration bonuses for exploration-heavy tasks.
Overview
COPO adds count-based intrinsic motivation to encourage exploration of under-visited response spaces. Ideal for tasks where the optimal policy is far from the initial supervised fine-tuned model.
Usage
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
reward_model_path="./rm_checkpoint",
algorithm="copo",
copo_exploration_coeff=0.1,
learning_rate=1e-6,
)
trainer = RLHFTrainer(config)
trainer.train()