Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment
Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all …
Read more “Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment”