notesum.ai

Published at October 30

Aligning Audio-Visual Joint Representations with an Agentic Workflow

cs.CV

cs.AI

cs.LG

cs.MM

cs.SD

eess.AS

Released Date: October 30, 2024

Authors: Shentong Mo¹, Yibing Song²

Aff.: ¹Carnegie Mellon University MBZUAI; ²Alibaba Group Hupan Lab

Arxiv: http://arxiv.org/abs/2410.23230v1

Refer to caption

True Pairs	False Pairs	T-Alignment (%, $\uparrow$ )
50k	0	78.23
0	50k	42.05
50k	50k	52.29
50k	100k	45.65
100k	50k	63.71