notesum.ai
Published at December 6CompCap: Improving Multimodal Large Language Models with Composite Captions
cs.CV
cs.AI
cs.LG
Released Date: December 6, 2024
Authors: Xiaohui Chen1, Satya Narayan Shukla2, Mahmoud Azab2, Aashu Singh2, Qifan Wang2, David Yang2, ShengYun Peng3, Hanchao Yu2, Shen Yan2, Xuewen Zhang2, Baosheng He2
Aff.: 1Meta, Tufts University; 2Meta; 3Meta, Georgia Tech

| Category | Metadata | Image Simulator(s) | Caption Composition | #Samples | Avg. Char. |
| \faThLarge Collage | Image-Caption & Layout | OpenCV (Bradski, 2000) / PIL (Clark, 2015) | LGC∗ | 50K | 913 |
| \faNewspaper Image-Text | Image-Caption & Text & Layout | OpenCV / PIL / Augraphy (Groleau et al., 2023) | Text + LGC / Text | 37K | 221 |
| \faChartBar Chart | (Geo) Tabular data | Plotly (Inc., 2015) | LGC | 22K | 1468 |
| \faSitemap Diagram | Mermaid diagram code | Mermaid (Knsv, 2024) & Selenium | LGC | 3K | 2044 |
| \faCode Code | Code snippet | Carbon (Carbon, 2024) & Selenium | Code snippet + LGC | 2K | 1106 |
| \faTable Table | Tabular data | Matplotlib (Hunter, 2007) | Markdown table + LGC | 4K | 928 |