API Reference
Algorithms
ThinkRL implements state-of-the-art reinforcement learning algorithms optimized for human feedback training, from cutting-edge research to proven baselines.
Algorithm Comparison
| Algorithm | Best For | Sample Efficiency | Stability |
|---|---|---|---|
| VAPO | General RLHF | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| DAPO | Transfer learning | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| GRPO | Preference learning | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| COPO | Exploration tasks | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| PAPO | Multimodal | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| PPO | Baseline | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| REINFORCE | Simple tasks | ⭐⭐ | ⭐⭐⭐ |
Available Algorithms
VAPO
Value-Augmented Policy Optimization
Enhanced PPO with improved value function estimation for better sample efficiency.
Sample Efficient
Stable Training
DAPO
Decoupled Advantage Policy Optimization
Separates value function and policy learning for improved generalization.
Better Generalization
Decoupled Updates
GRPO
Group Relative Policy Optimization
Leverages group-wise comparisons for better preference modeling.
Group Comparisons
Enhanced Alignment
COPO
Count-based Online Preference Optimization
Uses count-based exploration bonus for exploration-heavy tasks.
Exploration
Online Learning
PAPO
Perception-Aware Policy Optimization
Optimized for multimodal reasoning with vision-language models.
Multimodal
Vision-Language
PPO
Proximal Policy Optimization
The proven baseline algorithm optimized for RLHF scenarios.
Proven Stability
RLHF Optimized
REINFORCE
REINFORCE with Baseline
Classic policy gradient with variance reduction techniques.
Policy Gradients
Baseline Estimation