Algorithms
VAPO
Value-Augmented Policy Optimization - Enhanced PPO with improved value function estimation for better sample efficiency and training stability.
Overview
VAPO (Value-Augmented Policy Optimization) is ThinkRL's flagship algorithm that enhances standard PPO with advanced value function approximation techniques. It significantly improves sample efficiency and training stability in RLHF scenarios through:
- Improved value function estimation with auxiliary objectives
- Better handling of preference data through reward modeling
- Reduced variance in policy gradient estimates
- Adaptive KL penalty scheduling
Key Features
Sample Efficiency
Achieves comparable performance with 40% fewer samples compared to standard PPO.
Training Stability
Robust to hyperparameter choices with improved convergence guarantees.
Preference Modeling
Native support for Bradley-Terry preference modeling with custom reward functions.
vLLM Integration
Optimized for vLLM-accelerated generation during experience collection.
Usage
Python API
train_vapo.py
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
reward_model_path="./rm_checkpoint",
algorithm="vapo",
# VAPO-specific parameters
vapo_value_coeff=0.5, # Value loss coefficient
vapo_aux_coeff=0.1, # Auxiliary objective coefficient
vapo_entropy_coeff=0.01, # Entropy bonus
# General training parameters
learning_rate=1e-6,
kl_coeff=0.1,
use_vllm=True,
)
trainer = RLHFTrainer(config)
trainer.train()Command Line
python -m thinkrl.cli.train_rl \
--algo vapo \
--model_name_or_path meta-llama/Llama-3-8b \
--reward_model_path ./rm_checkpoint \
--vapo_value_coeff 0.5 \
--vapo_aux_coeff 0.1 \
--use_vllm TrueHyperparameters
| Parameter | Default | Description |
|---|---|---|
| vapo_value_coeff | 0.5 | Coefficient for value function loss |
| vapo_aux_coeff | 0.1 | Coefficient for auxiliary objectives |
| vapo_entropy_coeff | 0.01 | Entropy bonus coefficient |
| kl_coeff | 0.1 | KL divergence penalty coefficient |
| clip_range | 0.2 | PPO clipping range |
Best Practices
- Start with default hyperparameters and tune kl_coeff first
- Use vLLM for generation when training with 7B+ models
- Monitor value function loss - it should decrease steadily
- Use gradient checkpointing for models larger than 13B
- Consider LoRA for memory-constrained environments