notesum.ai
Published at November 17SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration
cs.AI
cs.CV
cs.NE
cs.PF
Released Date: November 17, 2024
Authors: Jintao Zhang1, Haofeng Huang1, Pengle Zhang1, Jia Wei1, Jun Zhu1, Jianfei Chen1
Aff.: 1Tsinghua University

| Model | Attention | WikiText (Ppl.) | Lambda (Acc.) | MMLU (Acc.) | Longbench |
| Llama2 | Full-Precision | 5.823 | 0.886 | 0.439 | - |
| HadmdAttn | 6.706 | 0.865 | 0.355 | - | |
| SmoothAttn | 6.690 | 0.871 | 0.395 | - | |
| SageAttn2-4b | 6.018 | 0.886 | 0.436 | - | |
| SageAttn2-mix | 5.883 | 0.883 | 0.431 | - | |
| Llama3.1 | Full-Precision | 6.013 | 0.815 | 0.635 | 49.40 |
| HadmdAttn | 7.661 | 0.756 | 0.502 | 44.62 | |
| SmoothAttn | 7.087 | 0.788 | 0.551 | 43.76 | |
| SageAttn2-4b | 6.219 | 0.808 | 0.617 | 48.61 | |
| SageAttn2-mix | 6.131 | 0.816 | 0.629 | 49.01 | |
| GLM4 | Full-Precision | 7.241 | 0.432 | 0.743 | 49.78 |
| HadmdAttn | 7.932 | 0.435 | 0.676 | 46.27 | |
| SmoothAttn | 8.841 | 0.442 | 0.599 | 43.10 | |
| SageAttn2-4b | 7.341 | 0.435 | 0.732 | 49.06 | |
| SageAttn2-mix | 7.303 | 0.434 | 0.737 | 49.77 |