Algorithms
GRPO
Group Relative Policy Optimization - Leverages group-wise comparisons for enhanced preference modeling and alignment quality.
Overview
GRPO optimizes policies using relative rankings within groups of responses rather than absolute reward values. This approach provides more robust learning signals and better captures human preference patterns.
Usage
from thinkrl import RLHFTrainer, ModelConfig
config = ModelConfig(
model_name_or_path="meta-llama/Llama-3-8b",
reward_model_path="./rm_checkpoint",
algorithm="grpo",
grpo_group_size=4, # Number of responses per group
grpo_temperature=1.0, # Softmax temperature
learning_rate=1e-6,
)
trainer = RLHFTrainer(config)
trainer.train()