Getting Started

Quick Start

Train your first RLHF model in just a few lines of code.

Basic RLHF Training

Here's a minimal example to get started with RLHF training using ThinkRL:

train.py

from thinkrl import RLHFTrainer, ModelConfig

# Configure your model
config = ModelConfig(
    model_name_or_path="microsoft/DialoGPT-small",
    algorithm="vapo",  # or "ppo", "reinforce", "grpo", "dapo"
    learning_rate=1e-5,
    batch_size=32
)

# Initialize trainer
trainer = RLHFTrainer(config)

# Start training
trainer.train()

Chain-of-Thought Training

Train models with reasoning capabilities using the CoT trainer:

train_cot.py

from thinkrl.training import CoTTrainer, CoTConfig

config = CoTConfig(
    model_name_or_path="Qwen/Qwen2.5-7B",
    reasoning_type="cot",
    max_reasoning_steps=10
)

trainer = CoTTrainer(config)
trainer.train()

Command-Line Training

You can also launch training directly from the command line:

# Train with VAPO algorithm
python -m thinkrl.cli.train_rl \
    --algo vapo \
    --model_name_or_path meta-llama/Llama-3-8b \
    --reward_model_path ./rm_checkpoint \
    --use_vllm True

# Train with CoT
thinkrl cot --model gpt --algorithm vapo

# Train with REINFORCE
thinkrl train --algorithm reinforce

Using vLLM for Fast Generation

Enable vLLM for 10x faster experience collection during RLHF:

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    algorithm="grpo",
    use_vllm=True,  # Enable vLLM acceleration
    vllm_tensor_parallel_size=2,  # For multi-GPU
)

trainer = RLHFTrainer(config)
trainer.train()

Next Steps

Learn about the typical RLHF workflow
Explore available algorithms (VAPO, DAPO, GRPO, etc.)
Read the training guides for advanced configurations
Check out features like PRM, LoRA, and integrations

Installation Typical Workflow