notesum.ai
Published at November 26AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM
cs.CV
Released Date: November 26, 2024
Authors: Jiarui Wang1, Huiyu Duan2, Guangtao Zhai2, Juntong Wang1, Xiongkuo Min1
Aff.: 1Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China; 2Institute of Image Communication and Network Engineering, Shanghai Jiao Tong University, Shanghai, China; MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China

| Dimension | Static Quality | Temporal Smoothness | Dynamic Degree | TV Correspondence | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Methods / Metrics | Pair Acc | SRCC | PLCC | KRCC | Pair Acc | SRCC | PLCC | KRCC | Pair Acc | SRCC | PLCC | KRCC | Pair Acc | SRCC | PLCC | KRCC |
| NIQE [48] | 54.32% | 0.0867 | 0.1626 | 0.0615 | 52.67% | 0.0641 | 0.1152 | 0.0451 | 45.64% | 0.1765 | 0.2448 | 0.1194 | 46.99% | 0.1771 | 0.2231 | 0.1193 |
| QAC [79] | 49.96% | 0.1022 | 0.1363 | 0.0680 | 54.90% | 0.1633 | 0.2039 | 0.1105 | 54.72% | 0.0448 | 0.0427 | 0.0295 | 54.48% | 0.0303 | 0.0197 | 0.2233 |
| BRISQUE [47] | 59.98% | 0.2909 | 0.2443 | 0.1969 | 55.67% | 0.2325 | 0.1569 | 0.1553 | 44.60% | 0.1351 | 0.0959 | 0.0893 | 51.02% | 0.1294 | 0.1017 | 0.0869 |
| BPRI [45] | 52.28% | 0.2181 | 0.1723 | 0.1398 | 47.26% | 0.1766 | 0.0880 | 0.1138 | 46.83% | 0.1956 | 0.1688 | 0.1329 | 49.13% | 0.1569 | 0.1548 | 0.1052 |
| HOSA [76] | 61.54% | 0.2420 | 0.2106 | 0.1643 | 57.31% | 0.2311 | 0.1757 | 0.1559 | 44.97% | 0.0755 | 0.0449 | 0.0496 | 52.23% | 0.1645 | 0.1324 | 0.1097 |
| BMPRI [46] | 53.71% | 0.1690 | 0.1481 | 0.1075 | 49.31% | 0.1434 | 0.0844 | 0.0894 | 45.07% | 0.1153 | 0.0925 | 0.0777 | 48.43% | 0.1567 | 0.1500 | 0.1041 |
| V-Dynamic [24] | 51.34% | 0.0768 | 0.0792 | 0.0494 | 31.91% | 0.3713 | 0.4871 | 0.2557 | 53.11% | 0.1466 | 0.0253 | 0.0988 | 46.96% | 0.0405 | 0.0576 | 0.0223 |
| V-Smoothness [24] | 61.63% | 0.6748 | 0.4506 | 0.4590 | 76.59% | 0.8526 | 0.8313 | 0.6533 | 47.63% | 0.2446 | 0.2328 | 0.1580 | 61.28% | 0.3188 | 0.3073 | 0.2214 |
| CLIPScore [20] | 47.09% | 0.0731 | 0.0816 | 0.0473 | 46.33% | 0.0423 | 0.0334 | 0.0271 | 52.99% | 0.0675 | 0.0835 | 0.0439 | 55.62% | 0.1519 | 0.1731 | 0.1014 |
| BLIPScore [36] | 53.24% | 0.0492 | 0.0421 | 0.0330 | 53.07% | 0.0659 | 0.0487 | 0.0437 | 53.03% | 0.1786 | 0.1904 | 0.1205 | 61.53% | 0.1813 | 0.1896 | 0.1219 |
| AestheticScore [54] | 70.24% | 0.6713 | 0.6959 | 0.4784 | 54.82% | 0.5154 | 0.4946 | 0.3484 | 52.96% | 0.2295 | 0.2322 | 0.1527 | 59.64% | 0.2381 | 0.2440 | 0.1602 |
| ImageReward [77] | 56.69% | 0.2606 | 0.2646 | 0.1749 | 54.09% | 0.2382 | 0.2305 | 0.1600 | 53.90% | 0.1840 | 0.1836 | 0.1237 | 63.97% | 0.2311 | 0.2450 | 0.1568 |
| UMTScore [42] | 48.93% | 0.0168 | 0.0199 | 0.0117 | 49.93% | 0.0302 | 0.0370 | 0.0207 | 52.69% | 0.0168 | 0.0198 | 0.0117 | 53.82% | 0.0172 | 0.0065 | 0.0108 |
| Video-LLaVA [39] | 50.90% | 0.0384 | 0.0513 | 0.0297 | 50.36% | 0.0431 | 0.0281 | 0.0347 | 50.34% | 0.1561 | 0.1436 | 0.1176 | 50.54% | 0.1364 | 0.1051 | 0.1009 |
| Video-ChatGPT [44] | 51.20% | 0.1242 | 0.1587 | 0.0940 | 50.16% | 0.0580 | 0.0533 | 0.0453 | 50.47% | 0.0724 | 0.0436 | 0.0563 | 50.07% | 0.0357 | 0.0124 | 0.0274 |
| LLaVA-NeXT [35] | 52.85% | 0.1239 | 0.1625 | 0.0954 | 52.41% | 0.4021 | 0.3722 | 0.3052 | 51.84% | 0.1767 | 0.1655 | 0.1328 | 59.20% | 0.4116 | 0.3428 | 0.3261 |
| VideoLLaMA2 [12] | 52.73% | 0.2643 | 0.3271 | 0.1928 | 52.27% | 0.3608 | 0.2450 | 0.2696 | 50.78% | 0.1900 | 0.1561 | 0.1379 | 54.25% | 0.1656 | 0.1633 | 0.1210 |
| Qwen2-VL [65] | 56.50% | 0.4922 | 0.5291 | 0.3838 | 49.12% | 0.1681 | 0.4219 | 0.1233 | 52.08% | 0.1122 | 0.1335 | 0.0849 | 53.30% | 0.3111 | 0.2775 | 0.2306 |
| HyperIQA [56] | 68.30% | 0.7931 | 0.8093 | 0.5969 | 54.65% | 0.7426 | 0.6630 | 0.5407 | 53.32% | 0.2103 | 0.2100 | 0.1384 | 57.54% | 0.6226 | 0.6250 | 0.4432 |
| MUSIQ [25] | 66.46% | 0.7880 | 0.8044 | 0.5773 | 55.16% | 0.7199 | 0.6920 | 0.5034 | 52.85% | 0.5206 | 0.4846 | 0.3521 | 58.46% | 0.4125 | 0.4093 | 0.2844 |
| LIQE [83] | 63.86% | 0.8776 | 0.8691 | 0.7008 | 55.84% | 0.7935 | 0.7720 | 0.6084 | 49.02% | 0.5303 | 0.5840 | 0.3837 | 55.10% | 0.3862 | 0.3639 | 0.2640 |
| VSFA [34] | 46.43% | 0.3365 | 0.3421 | 0.2268 | 50.95% | 0.3317 | 0.3273 | 0.2202 | 51.46% | 0.1201 | 0.1362 | 0.0815 | 48.07% | 0.1024 | 0.1064 | 0.0666 |
| BVQA [32] | 29.98% | 0.4594 | 0.4701 | 0.3268 | 37.65% | 0.3704 | 0.3819 | 0.2507 | 55.08% | 0.4594 | 0.4701 | 0.3268 | 42.32% | 0.3720 | 0.3978 | 0.2559 |
| simpleVQA [57] | 68.12% | 0.8355 | 0.6438 | 0.8489 | 54.14% | 0.7082 | 0.7008 | 0.4978 | 53.08% | 0.4671 | 0.3160 | 0.3994 | 58.20% | 0.4643 | 0.5440 | 0.3163 |
| FAST-VQA [69] | 70.64% | 0.8738 | 0.8644 | 0.6860 | 62.93% | 0.9036 | 0.9134 | 0.7166 | 54.34% | 0.5603 | 0.5703 | 0.3895 | 65.05% | 0.6875 | 0.6704 | 0.4978 |
| DOVER [70] | 72.92% | 0.8907 | 0.8895 | 0.7004 | 58.83% | 0.9063 | 0.9195 | 0.7187 | 53.16% | 0.5549 | 0.5489 | 0.3800 | 62.35% | 0.6783 | 0.6802 | 0.4969 |
| Q-Align [71] | 71.86% | 0.8516 | 0.8383 | 0.6641 | 57.95% | 0.8116 | 0.7025 | 0.6195 | 53.71% | 0.5655 | 0.5012 | 0.3950 | 62.91% | 0.5542 | 0.5647 | 0.3870 |
| Ours | 79.83% | 0.9162 | 0.9190 | 0.7576 | 76.60% | 0.9232 | 0.9216 | 0.8038 | 60.30% | 0.6093 | 0.6082 | 0.4435 | 70.32% | 0.7500 | 0.7697 | 0.5591 |
| Improvement | + 6.9% | +2.7% | +3.0% | + 5.7% | 13.7% | +1.7% | +0.2% | +8.5% | +5.2% | +4.4% | +3.8% | + 4.4% | +5.3% | +6.3% | +9.9% | +6.13% |