notesum.ai
Published at November 11Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
cs.CV
cs.AI
Released Date: November 11, 2024
Authors: Taihang Hu1, Linxuan Li1, Joost van de Weijer2, Hongcheng Gao1, Fahad Shahbaz Khan3, Jian Yang1, Ming-Ming Cheng1, Kai Wang2, Yaxing Wang1
Aff.: 1VCIP, College of Computer Science, Nankai University; 2Computer Vision Center, Universitat Autònoma de Barcelona; 3Mohamed bin Zayed University of AI

| Method | Base Model | Train | BLIP-VQA | Human-preference | GPT-4o | ||||
| Color | Texture | Shape | Color | Texture | Shape | ||||
| SDXLpodell2023sdxl | - | ✓ | 0.6369 | 0.5637 | 0.5408 | 0.7798 | 0.5140 | 0.4029 | 0.4907 |
| PlayG-v2playground-v2 | - | ✓ | 0.6208 | 0.6125 | 0.5087 | - | - | - | 0.5417 |
| Rannifeng2023ranni | SD1.5 | ✓ | 0.2414 | 0.3029 | 0.2857 | -0.8554 | -0.6853 | -0.8051 | 0.4166 |
| ELLAhu2024ella | ✓ | 0.6911 | 0.6308 | 0.4938 | 0.6586 | 0.2963 | 0.0565 | 0.6481 | |
| SynGenrassin2024linguistic_binding | ✗ | 0.6619 | 0.6451 | 0.4661 | 0.4326 | 0.5072 | 0.0426 | 0.5545 | |
| CoMatjiang2024comat | ✓ | 0.6561 | 0.6190 | 0.4975 | - | - | - | - | |
| Rannifeng2023ranni | SDXL | ✓ | 0.6893 | 0.6325 | 0.4934 | - | - | - | - |
| ELLAhu2024ella | ✓ | 0.7260 | 0.6686 | 0.5634 | - | - | - | - | |
| SynGenrassin2024linguistic_binding | ✗ | 0.7010 | 0.6044 | 0.5069 | 1.016 | 0.7867 | 0.4016 | 0.6458 | |
| CoMatjiang2024comat | ✓ | 0.7774 | 0.6591 | 0.5262 | - | - | - | - | |
| ToMe (Ours) | SDXL | ✗ | 0.7656 | 0.6894 | 0.6051 | 1.074 | 0.9281 | 0.5916 | 0.9549 |