Algorithms

REINFORCE

Classic policy gradient with variance reduction techniques and baseline estimation.

Overview

REINFORCE is the simplest policy gradient algorithm, useful for understanding RLHF fundamentals and for simpler training scenarios where PPO complexity isn't needed.

Usage

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="reinforce",
    use_baseline=True,
    learning_rate=1e-6,
)

trainer = RLHFTrainer(config)
trainer.train()

PPO