Training Guides

SFT & CoT Training

Supervised Fine-Tuning and Chain-of-Thought training for building foundational capabilities.

Supervised Fine-Tuning (SFT)

SFT is the first step in the RLHF pipeline, teaching the model to follow instructions.

from thinkrl.training import SFTTrainer, SFTConfig

config = SFTConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    dataset_name="tatsu-lab/alpaca",
    output_dir="./sft_checkpoint",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    max_seq_length=2048,
)

trainer = SFTTrainer(config)
trainer.train()

Chain-of-Thought Training

Train models to reason step-by-step before producing final answers.

from thinkrl.training import CoTTrainer, CoTConfig

config = CoTConfig(
    model_name_or_path="Qwen/Qwen2.5-7B",
    reasoning_type="cot",           # or "tot" for Tree-of-Thought
    max_reasoning_steps=10,
    thought_separator="<think>",
    dataset_name="your-org/cot-dataset",
)

trainer = CoTTrainer(config)
trainer.train()

Command Line Training

# SFT Training
python -m thinkrl.cli.train_sft \
    --model_name_or_path meta-llama/Llama-3-8b \
    --dataset_name tatsu-lab/alpaca \
    --output_dir ./sft_checkpoint

# CoT Training  
python -m thinkrl.cli.train_cot \
    --model_name_or_path ./sft_checkpoint \
    --reasoning_type cot \
    --max_reasoning_steps 10

DPO / Online DPO