Training Guides
Training Guides
Comprehensive guides for training models with ThinkRL, from supervised fine-tuning to advanced reasoning capabilities.
SFT & CoT Training
Supervised fine-tuning and Chain-of-Thought reasoning training
Learn how to prepare your model with SFT and add reasoning capabilities with Chain-of-Thought training.
DPO / Online DPO
Direct Preference Optimization training
Train models directly on preference data without requiring a separate reward model.
Process Reward Models
Step-by-step verification training
Train models to verify reasoning steps and provide process-level rewards.
Multimodal PAPO
Vision-language model training
Train multimodal models with perception-aware policy optimization.