notesum.ai
Published at November 27Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models
cs.CV
Released Date: November 27, 2024
Authors: Jingming Liu1, Yumeng Li1, Boyuan Xiao1, Yichang Jian1, Ziang Qin1, Tianjia Shao1, Yao-Xiang Ding1, Kun Zhou1
Aff.: 1State Key Laboratory of CAD&CG, Zhejiang University

| Settings | GPT-4o | VCoT | Ours (cursor-only) | GPT-4o Sampling | Ours | |
| Dense Counting | Success Rate | 39.8% | 15.5% | 39.0%/0% | - | 85.3% |
| Mean Error | 0.82 | 2.90 | 1.02/intractable | - | 0.17 | |
| Variance | 1.49 | 15.57 | 2.54/intractable | - | 0.22 | |
| Simple Jigsaw Puzzle | 4-Piece Missing | 29.5% | 9.1% | 27.3% | 43.2% | 68.2% |
| 6-Piece Missing | 9.1% | 3.3% | 24.2% | 30.3% | 51.5% | |
| Object Placement | Locating | 10.9% | 10.4% | 69.4%ā | - | 69.4% |
| Placement | 3.6% | 1.5% | 27.8% | 17.3% | 37.3% | |