notesum.ai
Published at November 5Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization
cs.MM
cs.AI
cs.CL
cs.HC
cs.IR
Released Date: November 5, 2024
Authors: Zhibin Wen1, Bin Li
Aff.: 1Systems Engineering Institute, Xi'an Jiaotong University

| Method | Publication | Src. | IoU=0.3 | IoU=0.5 | IoU=0.7 | mIoU |
| Random Pick | - | - | 5.71 | 4.65 | 3.58 | 3.97 |
| DEBUG (Lu et al., 2019) | EMNLP | V | 25.85 | 13.08 | 6.74 | 16.32 |
| GDP (Chen et al., 2020a) | AAAI | V | 27.34 | 15.21 | 6.82 | 16.87 |
| ACRM (Tang et al., 2021a) | TMM | V | 23.65 | 14.38 | 5.26 | 15.83 |
| MutualSL (Weng and Li, 2023) | ICASSP | VT | 40.55 | 29.11 | 14.54 | 28.98 |
| FMALG (Cheng et al., 2023) | NLPCC | VT | 40.99 | 28.63 | 15.44 | 29.77 |
| OCR-LLM (Zhang et al., 2024) | NLPCC | VT | 50.88 | 35.42 | 20.54 | 36.37 |
| VPTSL (Li et al., 2024b) | TPAMI | VT | 51.74 | 34.03 | 17.01 | 36.32 |
| PMI-LOC (Chen et al., 2020b) | ECCV | AV | 32.05 | 14.28 | 6.78 | 21.84 |
| ADPN (Chen et al., 2023a) | ACM MM | AV | 29.97 | 15.22 | 7.85 | 22.22 |
| ADPN (Chen et al., 2023a) | ACM MM | AVT | 32.64 | 17.36 | 9.38 | 24.43 |
| TR-DETR (Sun et al., 2024) | AAAI | AVT | 31.86 | 19.94 | 10.25 | 26.37 |
| AVTSL | - | AVT | 58.08 | 41.02 | 29.04 | 41.75 |