notesum.ai

Published at December 4

Mimir: Improving Video Diffusion Models for Precise Text Understanding

cs.CV

Released Date: December 4, 2024

Authors: Shuai Tan¹, Biao Gong¹, Yutong Feng², Kecheng Zheng¹, Dandan Zheng¹, Shuwei Shi¹, Yujun Shen¹, Jingdong Chen¹, Ming Yang¹

Aff.: ¹Ant Group; ²Tsinghua University

Arxiv: http://arxiv.org/pdf/2412.03085v1

Refer to caption

Method	Background	Aesthetic	Imaging	Object	Multiple	Color	Spatial	Temporal
	Consistency	Quality	Quality	Class	Objects	Consistency	Relationship	Style
ModelscopeT2V [30]	92.00%	37.14%	55.85%	31.17%	1.52%	63.20%	8.26%	14.52%
OpenSora [60]	97.20%	58.57%	63.38%	90.79%	64.81%	84.67%	76.63%	25.51%
OpenSoraPlan [28]	97.50%	59.40%	57.79%	67.39%	26.98%	83.38%	38.69%	21.86%
CogVideoX-2B [53]	94.71%	60.27%	60.52%	84.86%	65.70%	86.21%	70.49%	25.10%
CogVideoX-5B [53]	95.60%	60.62%	61.35%	87.82%	65.70%	84.17%	64.86%	25.86%
Mimir	97.68%	62.92%	63.91%	92.87%	85.29%	86.50%	78.67%	26.22%