notesum.ai
Published at November 13R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback
cs.CL
cs.AI
Released Date: November 13, 2024
Authors: Jiahui Li1, Tai-wei Chang2, Fengda Zhang1, Kun Kuang1, Long Chen3
Aff.: 1Zhejiang University; 2Ant Group; 3HKUST

| Method | Win | Tie | Lose |
|---|---|---|---|
| PPO-RLHF | 37.5% | 24.5% | 38.0% |
| PPO-R3HF | 59.5% | 20.0% | 20.5% |
| DPO | 38.5% | 27.0% | 34.5% |