notesum.ai
Published at December 6Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models
cs.AI
cs.CL
cs.SD
eess.AS
Released Date: December 6, 2024
Authors: Kuofeng Gao1, Shu-Tao Xia2, Ke Xu1, Philip Torr3, Jindong Gu3
Aff.: 1Tsinghua University; 2Peng Cheng Laboratory; 3University of Oxford

| Models | Size | ADU-Bench | Average | |||
| General | Skill | Multilingual | Ambiguity | |||
| PandaGPT | 7B | 1.02 | 0.98 | 0.98 | 0.50 | 0.87 |
| NExT-GPT | 7B | 1.07 | 1.03 | 1.02 | 0.52 | 0.91 |
| Qwen-Audio | 7B | 1.32 | 1.08 | 1.07 | 0.61 | 1.02 |
| Mini-Omni | 0.5B | 2.31 | 1.96 | 1.55 | 1.67 | 1.87 |
| SALMONN | 7B | 2.47 | 2.01 | 1.83 | 1.73 | 2.01 |
| Qwen-Audio-Chat | 7B | 2.34 | 2.46 | 1.58 | 1.93 | 2.08 |
| SpeechGPT | 7B | 3.99 | 3.56 | 1.42 | 2.25 | 2.81 |
| SALMONN | 13B | 4.07 | 3.12 | 3.25 | 1.86 | 3.08 |
| BLSP | 7B | 4.66 | 4.49 | 2.89 | 3.37 | 3.85 |
| Whisper+LLaMA-2 | 7B | 6.30 | 6.26 | 4.92 | 4.39 | 5.47 |
| Whisper+LLaMA-3 | 8B | 6.94 | 7.88 | 6.27 | 4.92 | 6.50 |
| Whisper+LLaMA-3 | 70B | 7.26 | 8.03 | 6.12 | 5.13 | 6.64 |
| Whisper+GPT-4 | - | 8.42 | 8.62 | 8.07 | 5.54 | 7.66 |