Getting Started

Typical Workflow

A complete RLHF training pipeline from data preparation to deployment.

Workflow Overview

A typical ThinkRL workflow consists of these stages:

SFT

Supervised Fine-Tuning

Reward Model Training

RLHF

Policy Optimization

Deploy

Model Export & Serving

Step 1: Supervised Fine-Tuning (SFT)

Start with supervised fine-tuning on your instruction dataset:

step1_sft.py

from thinkrl.training import SFTTrainer, SFTConfig

config = SFTConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    dataset_name="your-org/instruction-dataset",
    output_dir="./sft_checkpoint",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
)

trainer = SFTTrainer(config)
trainer.train()

Step 2: Reward Model Training

Train a reward model on human preference data:

step2_rm.py

from thinkrl.training import RewardTrainer, RewardConfig

config = RewardConfig(
    model_name_or_path="./sft_checkpoint",
    dataset_name="your-org/preference-dataset",
    output_dir="./rm_checkpoint",
    num_train_epochs=1,
    learning_rate=1e-5,
)

trainer = RewardTrainer(config)
trainer.train()

Step 3: RLHF Training

Run policy optimization with your trained reward model:

step3_rlhf.py

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="./sft_checkpoint",
    reward_model_path="./rm_checkpoint",
    algorithm="vapo",  # Choose: vapo, dapo, grpo, ppo
    use_vllm=True,
    output_dir="./rlhf_checkpoint",
    num_train_epochs=1,
    kl_coeff=0.1,
)

trainer = RLHFTrainer(config)
trainer.train()

Step 4: Model Export

Export your trained model for deployment:

# Merge LoRA weights (if using LoRA)
python -m thinkrl.cli.merge_lora \
    --base_model meta-llama/Llama-3-8b \
    --lora_path ./rlhf_checkpoint \
    --output_path ./exported_model

# Or export directly
python -m thinkrl.cli.export \
    --checkpoint_path ./rlhf_checkpoint \
    --output_path ./exported_model \
    --format safetensors

Quick Start Algorithms