ThinkRL
GitHub
v1.0

ThinkRL Documentation

A comprehensive, state-of-the-art reinforcement learning from human feedback (RLHF) library designed to democratize advanced AI training.

Architecture Foundation

ThinkRL is built on a high-performance stack designed for scale, combining vLLM for fast inference with PyTorch and DeepSpeed for efficient training.

vLLM Integration

High-throughput inference with PagedAttention and continuous batching for 80% faster experience collection during RLHF.

DeepSpeed Support

Native integration with ZeRO-2/3 enables training 70B+ parameter models on commodity hardware.

What's New

  • [2026/01] ThinkRL 1.0: Full support for STaR (Self-Taught Reasoner) and Process Reward Models (PRM)
  • [2026/01] Integrated PAPO (Perception-Aware Policy Optimization) for multimodal reasoning
  • [2025/12] Added COPO (Count-based Online Preference Optimization) for exploration-heavy tasks
  • [2025/11] VAPO and DAPO algorithms merged into core
  • [2025/10] Complete vLLM Integration for 10x generation speedup during RLHF