Training Guides
DPO / Online DPO
Direct Preference Optimization - Train on preferences without a separate reward model.
DPO Training
from thinkrl.training import DPOTrainer, DPOConfig
config = DPOConfig(
model_name_or_path="./sft_checkpoint",
dataset_name="your-org/preference-pairs",
output_dir="./dpo_checkpoint",
beta=0.1,
learning_rate=5e-7,
)
trainer = DPOTrainer(config)
trainer.train()Online DPO
Generate new preference pairs during training for improved alignment.
config = DPOConfig(
model_name_or_path="./sft_checkpoint",
online_dpo=True,
num_generations=4,
use_vllm=True,
beta=0.1,
)
trainer = DPOTrainer(config)
trainer.train()