v1.0
ThinkRL Documentation
A comprehensive, state-of-the-art reinforcement learning from human feedback (RLHF) library designed to democratize advanced AI training.
Get Started
Install ThinkRL and run your first RLHF training
Algorithms
VAPO, DAPO, GRPO, COPO, PAPO, PPO, and REINFORCE
Training Guides
SFT, CoT, DPO, PRM, and multimodal training
Features
Reasoning, optimization, and integrations
Advanced
Custom rewards, LoRA merging, vLLM integration
Source Code
View the source code on GitHub
Architecture Foundation
ThinkRL is built on a high-performance stack designed for scale, combining vLLM for fast inference with PyTorch and DeepSpeed for efficient training.
vLLM Integration
High-throughput inference with PagedAttention and continuous batching for 80% faster experience collection during RLHF.
DeepSpeed Support
Native integration with ZeRO-2/3 enables training 70B+ parameter models on commodity hardware.
What's New
- [2026/01] ThinkRL 1.0: Full support for STaR (Self-Taught Reasoner) and Process Reward Models (PRM)
- [2026/01] Integrated PAPO (Perception-Aware Policy Optimization) for multimodal reasoning
- [2025/12] Added COPO (Count-based Online Preference Optimization) for exploration-heavy tasks
- [2025/11] VAPO and DAPO algorithms merged into core
- [2025/10] Complete vLLM Integration for 10x generation speedup during RLHF