Getting Started
Typical Workflow
A complete RLHF training pipeline from data preparation to deployment.
Workflow Overview
A typical ThinkRL workflow consists of these stages:
1
SFT
Supervised Fine-Tuning
2
RM
Reward Model Training
3
RLHF
Policy Optimization
4
Deploy
Model Export & Serving
Step 1: Supervised Fine-Tuning (SFT)
Start with supervised fine-tuning on your instruction dataset:
step1_sft.py
from thinkrl.training import SFTTrainer, SFTConfig
config = SFTConfig(
model_name_or_path="meta-llama/Llama-3-8b",
dataset_name="your-org/instruction-dataset",
output_dir="./sft_checkpoint",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
)
trainer = SFTTrainer(config)
trainer.train()Step 2: Reward Model Training
Train a reward model on human preference data:
step2_rm.py
from thinkrl.training import RewardTrainer, RewardConfig
config = RewardConfig(
model_name_or_path="./sft_checkpoint",
dataset_name="your-org/preference-dataset",
output_dir="./rm_checkpoint",
num_train_epochs=1,
learning_rate=1e-5,
)
trainer = RewardTrainer(config)
trainer.train()Step 3: RLHF Training
Run policy optimization with your trained reward model:
step3_rlhf.py
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="./sft_checkpoint",
reward_model_path="./rm_checkpoint",
algorithm="vapo", # Choose: vapo, dapo, grpo, ppo
use_vllm=True,
output_dir="./rlhf_checkpoint",
num_train_epochs=1,
kl_coeff=0.1,
)
trainer = RLHFTrainer(config)
trainer.train()Step 4: Model Export
Export your trained model for deployment:
# Merge LoRA weights (if using LoRA)
python -m thinkrl.cli.merge_lora \
--base_model meta-llama/Llama-3-8b \
--lora_path ./rlhf_checkpoint \
--output_path ./exported_model
# Or export directly
python -m thinkrl.cli.export \
--checkpoint_path ./rlhf_checkpoint \
--output_path ./exported_model \
--format safetensors