notesum.ai
Published at November 27GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
cs.CV
Released Date: November 27, 2024
Authors: Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang

| Method | Human Evaluation | GPT Evaluation | IntJudge Evaluation | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| FDT | w/o Tie | w/ Tie (0) | w/ Tie (.5) | FDT | w/o Tie | w/ Tie (0) | w/ Tie (.5) | FDT | w/o Tie | w/ Tie (0) | w/ Tie (.5) | |
| Human | 83.28% | 86.03% | 68.17% | 78.55% | 82.49% | 82.69% | 82.03% | 82.43% | 87.46% | 91.49% | 75.49% | 84.23% |
| GPT-4o+DALL-E3 | 78.42% | 81.39% | 65.21% | 75.15% | 85.70% | 85.99% | 85.58% | 85.82% | 85.02% | 86.92% | 72.22% | 80.68% |
| Gemini1.5+Flux | 65.57% | 65.82% | 49.31% | 61.85% | 71.75% | 71.76% | 71.12% | 71.56% | 68.30% | 69.73% | 54.47% | 65.41% |
| SEED-X | 51.98% | 49.49% | 34.70% | 49.65% | 54.82% | 55.12% | 54.11% | 55.03% | 49.86% | 49.58% | 33.57% | 49.72% |
| Anole | 51.90% | 52.17% | 36.46% | 51.52% | 53.36% | 53.13% | 52.58% | 53.10% | 53.42% | 52.04% | 33.92% | 51.33% |
| SEED-LLaMA | 44.30% | 42.12% | 29.11% | 44.56% | 40.96% | 40.87% | 40.46% | 40.96% | 50.13% | 47.71% | 31.57% | 48.48% |
| Emu2 | 40.89% | 37.07% | 23.42% | 41.84% | 41.72% | 41.63% | 40.58% | 41.85% | 36.28% | 33.79% | 21.87% | 39.51% |
| Show-o | 36.28% | 34.02% | 21.63% | 39.84% | 30.77% | 30.22% | 29.61% | 30.62% | 31.49% | 21.08% | 12.48% | 32.87% |
| NExT-GPT | 33.67% | 26.93% | 17.09% | 35.36% | 22.61% | 22.39% | 22.11% | 22.74% | 30.96% | 21.70% | 13.36% | 32.58% |
| MiniGPT-5 | 30.69% | 26.72% | 17.11% | 35.09% | 28.64% | 28.37% | 28.02% | 28.64% | 24.47% | 15.46% | 9.91% | 27.85% |
| GILL | 25.80% | 19.57% | 12.71% | 30.23% | 30.55% | 30.24% | 29.65% | 30.62% | 24.87% | 19.72% | 12.82% | 30.32% |