Algorithms
DAPO
Decoupled Advantage Policy Optimization - Separates value and policy learning for improved generalization across tasks.
Overview
DAPO decouples the value function and policy optimization into separate learning streams. This separation enables better generalization by preventing overfitting to reward signals while maintaining policy flexibility.
Usage
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
reward_model_path="./rm_checkpoint",
algorithm="dapo",
dapo_decouple_ratio=0.5,
learning_rate=1e-6,
)
trainer = RLHFTrainer(config)
trainer.train()