ThinkRL
GitHub
Algorithms

VAPO

Value-Augmented Policy Optimization - Enhanced PPO with improved value function estimation for better sample efficiency and training stability.

Overview

VAPO (Value-Augmented Policy Optimization) is ThinkRL's flagship algorithm that enhances standard PPO with advanced value function approximation techniques. It significantly improves sample efficiency and training stability in RLHF scenarios through:

  • Improved value function estimation with auxiliary objectives
  • Better handling of preference data through reward modeling
  • Reduced variance in policy gradient estimates
  • Adaptive KL penalty scheduling

Key Features

Sample Efficiency

Achieves comparable performance with 40% fewer samples compared to standard PPO.

Training Stability

Robust to hyperparameter choices with improved convergence guarantees.

Preference Modeling

Native support for Bradley-Terry preference modeling with custom reward functions.

vLLM Integration

Optimized for vLLM-accelerated generation during experience collection.

Usage

Python API

train_vapo.py
from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="vapo",
    
    # VAPO-specific parameters
    vapo_value_coeff=0.5,        # Value loss coefficient
    vapo_aux_coeff=0.1,          # Auxiliary objective coefficient
    vapo_entropy_coeff=0.01,     # Entropy bonus
    
    # General training parameters
    learning_rate=1e-6,
    kl_coeff=0.1,
    use_vllm=True,
)

trainer = RLHFTrainer(config)
trainer.train()

Command Line

python -m thinkrl.cli.train_rl \
    --algo vapo \
    --model_name_or_path meta-llama/Llama-3-8b \
    --reward_model_path ./rm_checkpoint \
    --vapo_value_coeff 0.5 \
    --vapo_aux_coeff 0.1 \
    --use_vllm True

Hyperparameters

ParameterDefaultDescription
vapo_value_coeff0.5Coefficient for value function loss
vapo_aux_coeff0.1Coefficient for auxiliary objectives
vapo_entropy_coeff0.01Entropy bonus coefficient
kl_coeff0.1KL divergence penalty coefficient
clip_range0.2PPO clipping range

Best Practices

  • Start with default hyperparameters and tune kl_coeff first
  • Use vLLM for generation when training with 7B+ models
  • Monitor value function loss - it should decrease steadily
  • Use gradient checkpointing for models larger than 13B
  • Consider LoRA for memory-constrained environments