notesum.ai
Published at December 5GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
cs.CV
Released Date: December 5, 2024
Authors: Kaiyi Huang1, Yukun Huang1, Xuefei Ning2, Zinan Lin3, Yu Wang2, Xihui Liu1
Aff.: 1The University of Hong Kong; 2Tsinghua University; 3Microsoft Research
![[Uncaptioned image]](https://arxiv.org/html/2412.04440v1/x2.png)
| Model | Consist-attr | Dynamic-attr | Spatial | Motion | Action | Interaction | Numeracy |
| Metric | Grid-LLaVA | D-LLaVA | G-Dino | DOT | Grid-LLaVA | Grid-LLaVA | G-Dino |
| ModelScope [54] | 0.5483 | 0.1654 | 0.4220 | 0.2552 | 0.4880 | 0.7075 | 0.2066 |
| ZeroScope [1] | 0.4495 | 0.1086 | 0.4073 | 0.2319 | 0.4620 | 0.5550 | 0.2378 |
| Latte [34] | 0.5325 | 0.1598 | 0.4476 | 0.2187 | 0.5200 | 0.6625 | 0.2187 |
| Show-1 [72] | 0.6388 | 0.1828 | 0.4649 | 0.2316 | 0.4940 | 0.7700 | 0.1644 |
| VideoCrafter2 [8] | 0.6750 | 0.1850 | 0.4891 | 0.2233 | 0.5800 | 0.7600 | 0.2041 |
| Open-Sora 1.1 [21] | 0.6370 | 0.1762 | 0.5671 | 0.2317 | 0.5480 | 0.7625 | 0.2363 |
| Open-Sora 1.2 [21] | 0.6600 | 0.1714 | 0.5406 | 0.2388 | 0.5717 | 0.7400 | 0.2556 |
| Open-Sora-Plan v1.0.0 [26] | 0.5088 | 0.1562 | 0.4481 | 0.2147 | 0.5120 | 0.6275 | 0.1650 |
| Open-Sora-Plan v1.1.0 [26] | 0.7413 | 0.1770 | 0.5587 | 0.2187 | 0.6780 | 0.7275 | 0.2928 |
| CogVideoX-5B [66] | 0.7220 | 0.2334 | 0.5461 | 0.2943 | 0.5960 | 0.7950 | 0.2603 |
| AnimateDiff [15] | 0.4883 | 0.1764 | 0.3883 | 0.2236 | 0.4140 | 0.6550 | 0.0884 |
| VideoTetris [51] | 0.7125 | 0.2066 | 0.5148 | 0.2204 | 0.5280 | 0.7600 | 0.2609 |
| Vico [65] | 0.7025 | 0.2376 | 0.4952 | 0.2225 | 0.5480 | 0.7775 | 0.2116 |
| LVD [29] | 0.5595 | 0.1499 | 0.5469 | 0.2699 | 0.4960 | 0.6100 | 0.0991 |
| MagicTime [70] | - | 0.1834 | - | - | - | - | - |
| Pika [2] (Commercial) | 0.6513 | 0.1744 | 0.5043 | 0.2221 | 0.5380 | 0.6625 | 0.2613 |
| Gen-3 [42] (Commercial) | 0.7045 | 0.2078 | 0.5533 | 0.3111 | 0.6280 | 0.7900 | 0.2169 |
| GenMAC (Ours) | 0.7875 | 0.2498 | 0.7461 | 0.3623 | 0.7273 | 0.8250 | 0.5166 |