notesum.ai
Published at November 25Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models
cs.CL
cs.CV
Released Date: November 25, 2024
Authors: Hao Yi1, Qingyang Li2, Yulan Hu1, Fuzheng Zhang2, Di Zhang2, Yong Liu3
Aff.: 1Kuaishou Technology, Beijing, China; Remin University of China, Gaoling School of Artificial Intelligence, Beijing; 2Kuaishou Technology, Beijing, China; 3Remin University of China, Gaoling School of Artificial Intelligence, Beijing

| Method | Model | In-Domain | |||||
| WebVid | VIDAL | ActivityNet | |||||
| Score | Ratio | Score | Ratio | Score | Ratio | ||
| VideoChatGPT[6] | 3.93 | 79.26 | 3.52 | 70.66 | 3.75 | 76.60 | |
| LLaMA-VID[26] | 3.89 | 79.81 | 3.57 | 73.19 | 3.88 | 80.52 | |
| Pretrained | Video-LLaVA[7] | 4.21 | 86.88 | 3.75 | 77.00 | 3.73 | 75.11 |
| LLaVA-Next-Video-Base[27] | 4.28 | 89.00 | 3.88 | 80.84 | 4.13 | 86.21 | |
| Video-LLaMA2-Base[28] | 4.34 | 90.87 | 3.86 | 81.59 | 3.98 | 84.24 | |
| SFT | LLaVA-Hound-SFT[9] | 4.50 | 93.52 | 4.27 | 90.21 | 4.38 | 92.71 |
| Video-LLaMA2-Chat[28] | 3.82 | 77.14 | 3.62 | 74.23 | 4.13 | 86.49 | |
| VLM-RLAIF[21] | 4.14 | 88.58 | 3.8 | 80.14 | 4.04 | 85.68 | |
| LLaVA-Next-Video-DPO[27] | 4.47 | 93.17 | 4.10 | 86.77 | 4.30 | 90.50 | |
| RLHF/RLAIF | LLaVA-Hound-DPO[9] | 4.55 | 95.12 | 4.34 | 92.42 | 4.41 | 94.20 |
| LLaVA-Hound-DPO | 4.56 | 94.92 | 4.36 | 92.81 | 4.44 | 94.62 | |
| Iter-W2S-RLAIF(Ours) | 4.78 | 97.67 | 4.52 | 93.66 | 4.64 | 96.13 | |
| +0.28 | +4.15 | +0.25 | +3.45 | +0.26 | +3.42 | ||