Training Guides
Multimodal PAPO Training
Train vision-language models with Perception-Aware Policy Optimization.
Setup
pip install -e ".[multimodal]"PAPO Training
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="llava-hf/llava-1.5-7b-hf",
reward_model_path="./multimodal_rm",
algorithm="papo",
papo_vision_weight=0.3,
use_vllm=True,
)
trainer = RLHFTrainer(config)
trainer.train()