Algorithms
REINFORCE
Classic policy gradient with variance reduction techniques and baseline estimation.
Overview
REINFORCE is the simplest policy gradient algorithm, useful for understanding RLHF fundamentals and for simpler training scenarios where PPO complexity isn't needed.
Usage
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
reward_model_path="./rm_checkpoint",
algorithm="reinforce",
use_baseline=True,
learning_rate=1e-6,
)
trainer = RLHFTrainer(config)
trainer.train()