ThinkRL
GitHub
Algorithms

DAPO

Decoupled Advantage Policy Optimization - Separates value and policy learning for improved generalization across tasks.

Overview

DAPO decouples the value function and policy optimization into separate learning streams. This separation enables better generalization by preventing overfitting to reward signals while maintaining policy flexibility.

Usage

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="dapo",
    dapo_decouple_ratio=0.5,
    learning_rate=1e-6,
)

trainer = RLHFTrainer(config)
trainer.train()