Algorithms

PPO

Proximal Policy Optimization - The proven baseline algorithm optimized for RLHF scenarios.

Overview

PPO is the foundational algorithm for RLHF, providing stable policy updates through clipped objective functions. ThinkRL's implementation includes optimizations specific to language model training.

Usage

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="ppo",
    ppo_epochs=4,
    clip_range=0.2,
    learning_rate=1e-6,
)

trainer = RLHFTrainer(config)
trainer.train()

PAPO REINFORCE