Algorithms

VAPO

Value-Augmented Policy Optimization - Enhanced PPO with improved value function estimation for better sample efficiency and training stability.

Overview

VAPO (Value-Augmented Policy Optimization) is ThinkRL's flagship algorithm that enhances standard PPO with advanced value function approximation techniques. It significantly improves sample efficiency and training stability in RLHF scenarios through:

Improved value function estimation with auxiliary objectives
Better handling of preference data through reward modeling
Reduced variance in policy gradient estimates
Adaptive KL penalty scheduling

Key Features

Sample Efficiency

Achieves comparable performance with 40% fewer samples compared to standard PPO.

Training Stability

Robust to hyperparameter choices with improved convergence guarantees.

Preference Modeling

Native support for Bradley-Terry preference modeling with custom reward functions.

vLLM Integration

Optimized for vLLM-accelerated generation during experience collection.

Usage

Python API

train_vapo.py

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="vapo",
    
    # VAPO-specific parameters
    vapo_value_coeff=0.5,        # Value loss coefficient
    vapo_aux_coeff=0.1,          # Auxiliary objective coefficient
    vapo_entropy_coeff=0.01,     # Entropy bonus
    
    # General training parameters
    learning_rate=1e-6,
    kl_coeff=0.1,
    use_vllm=True,
)

trainer = RLHFTrainer(config)
trainer.train()

Command Line

python -m thinkrl.cli.train_rl \
    --algo vapo \
    --model_name_or_path meta-llama/Llama-3-8b \
    --reward_model_path ./rm_checkpoint \
    --vapo_value_coeff 0.5 \
    --vapo_aux_coeff 0.1 \
    --use_vllm True

Hyperparameters

Parameter	Default	Description
vapo_value_coeff	0.5	Coefficient for value function loss
vapo_aux_coeff	0.1	Coefficient for auxiliary objectives
vapo_entropy_coeff	0.01	Entropy bonus coefficient
kl_coeff	0.1	KL divergence penalty coefficient
clip_range	0.2	PPO clipping range

Best Practices

Start with default hyperparameters and tune kl_coeff first
Use vLLM for generation when training with 7B+ models
Monitor value function loss - it should decrease steadily
Use gradient checkpointing for models larger than 13B
Consider LoRA for memory-constrained environments

All Algorithms DAPO