Algorithms
PAPO
Perception-Aware Policy Optimization - Designed for multimodal reasoning with vision-language models.
Overview
PAPO extends RLHF to multimodal settings by incorporating perception-aware reward signals that account for visual grounding and cross-modal alignment. Perfect for training vision-language models with human feedback.
Usage
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="llava-hf/llava-1.5-7b-hf",
reward_model_path="./multimodal_rm",
algorithm="papo",
papo_vision_weight=0.3,
learning_rate=1e-6,
)
trainer = RLHFTrainer(config)
trainer.train()