Algorithms

GRPO

Group Relative Policy Optimization - Leverages group-wise comparisons for enhanced preference modeling and alignment quality.

Overview

GRPO optimizes policies using relative rankings within groups of responses rather than absolute reward values. This approach provides more robust learning signals and better captures human preference patterns.

Usage

from thinkrl import RLHFTrainer, ModelConfig

config = ModelConfig(
    model_name_or_path="meta-llama/Llama-3-8b",
    reward_model_path="./rm_checkpoint",
    algorithm="grpo",
    grpo_group_size=4,           # Number of responses per group
    grpo_temperature=1.0,        # Softmax temperature
    learning_rate=1e-6,
)

trainer = RLHFTrainer(config)
trainer.train()

DAPO COPO