notesum.ai
Published at November 27VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
cs.CV
cs.AI
Released Date: November 27, 2024
Authors: Donggoo Kang1, Dasol Jeong1, Hyunmin Lee1, Sangwoo Park1, Hasil Park1, Sunkyu Kwon1, Yeongjoon Kim1, Joonki Paik1
Aff.: 1Chung-Ang University, Seoul, Korea

| Method | Default | Known Object | ||||
|---|---|---|---|---|---|---|
| Full | Rare | Non-Rare | Full | Rare | Non-Rare | |
| IDN[24] | 24.58 | 20.33 | 25.86 | 27.89 | 23.64 | 29.16 |
| HOTR[16] | 25.10 | 17.34 | 27.42 | - | - | - |
| HOI-Trans[57] | 26.61 | 19.15 | 28.84 | 29.13 | 20.98 | 31.57 |
| QPIC[37] | 29.07 | 21.85 | 31.23 | 31.68 | 24.14 | 33.93 |
| MSTR[17] | 31.17 | 25.31 | 32.92 | 34.02 | 28.83 | 35.57 |
| CDN[49] | 32.07 | 27.19 | 33.53 | 34.79 | 29.48 | 36.38 |
| STIP[53] | 32.22 | 28.15 | 33.43 | 35.29 | 31.43 | 36.45 |
| UPT[51] | 32.62 | 28.62 | 33.81 | 36.08 | 31.41 | 37.47 |
| MUREN[18] | 32.87 | 28.67 | 34.12 | 35.52 | 30.88 | 36.91 |
| GEN-VLKT[26] | 33.75 | 29.25 | 35.10 | 36.78 | 32.75 | 37.99 |
| VLM-HOI | 34.25 | 30.22 | 35.20 | 36.88 | 33.30 | 37.75 |