notesum.ai
Published at December 4Mimir: Improving Video Diffusion Models for Precise Text Understanding
cs.CV
Released Date: December 4, 2024
Authors: Shuai Tan1, Biao Gong1, Yutong Feng2, Kecheng Zheng1, Dandan Zheng1, Shuwei Shi1, Yujun Shen1, Jingdong Chen1, Ming Yang1
Aff.: 1Ant Group; 2Tsinghua University

| Method | Background | Aesthetic | Imaging | Object | Multiple | Color | Spatial | Temporal |
| Consistency | Quality | Quality | Class | Objects | Consistency | Relationship | Style | |
| ModelscopeT2V [30] | 92.00% | 37.14% | 55.85% | 31.17% | 1.52% | 63.20% | 8.26% | 14.52% |
| OpenSora [60] | 97.20% | 58.57% | 63.38% | 90.79% | 64.81% | 84.67% | 76.63% | 25.51% |
| OpenSoraPlan [28] | 97.50% | 59.40% | 57.79% | 67.39% | 26.98% | 83.38% | 38.69% | 21.86% |
| CogVideoX-2B [53] | 94.71% | 60.27% | 60.52% | 84.86% | 65.70% | 86.21% | 70.49% | 25.10% |
| CogVideoX-5B [53] | 95.60% | 60.62% | 61.35% | 87.82% | 65.70% | 84.17% | 64.86% | 25.86% |
| Mimir | 97.68% | 62.92% | 63.91% | 92.87% | 85.29% | 86.50% | 78.67% | 26.22% |