Getting Started
Quick Start
Train your first RLHF model in just a few lines of code.
Basic RLHF Training
Here's a minimal example to get started with RLHF training using ThinkRL:
train.py
from thinkrl import RLHFTrainer, ModelConfig
# Configure your model
config = ModelConfig(
model_name_or_path="microsoft/DialoGPT-small",
algorithm="vapo", # or "ppo", "reinforce", "grpo", "dapo"
learning_rate=1e-5,
batch_size=32
)
# Initialize trainer
trainer = RLHFTrainer(config)
# Start training
trainer.train()Chain-of-Thought Training
Train models with reasoning capabilities using the CoT trainer:
train_cot.py
from thinkrl.training import CoTTrainer, CoTConfig
config = CoTConfig(
model_name_or_path="Qwen/Qwen2.5-7B",
reasoning_type="cot",
max_reasoning_steps=10
)
trainer = CoTTrainer(config)
trainer.train()Command-Line Training
You can also launch training directly from the command line:
# Train with VAPO algorithm
python -m thinkrl.cli.train_rl \
--algo vapo \
--model_name_or_path meta-llama/Llama-3-8b \
--reward_model_path ./rm_checkpoint \
--use_vllm True
# Train with CoT
thinkrl cot --model gpt --algorithm vapo
# Train with REINFORCE
thinkrl train --algorithm reinforceUsing vLLM for Fast Generation
Enable vLLM for 10x faster experience collection during RLHF:
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
algorithm="grpo",
use_vllm=True, # Enable vLLM acceleration
vllm_tensor_parallel_size=2, # For multi-GPU
)
trainer = RLHFTrainer(config)
trainer.train()Next Steps
- Learn about the typical RLHF workflow
- Explore available algorithms (VAPO, DAPO, GRPO, etc.)
- Read the training guides for advanced configurations
- Check out features like PRM, LoRA, and integrations