Algorithms
PPO
Proximal Policy Optimization - The proven baseline algorithm optimized for RLHF scenarios.
Overview
PPO is the foundational algorithm for RLHF, providing stable policy updates through clipped objective functions. ThinkRL's implementation includes optimizations specific to language model training.
Usage
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
reward_model_path="./rm_checkpoint",
algorithm="ppo",
ppo_epochs=4,
clip_range=0.2,
learning_rate=1e-6,
)
trainer = RLHFTrainer(config)
trainer.train()