notesum.ai
Published at November 8Recycled Attention: Efficient inference for long-context language models
cs.CL
Released Date: November 8, 2024
Authors: Fangyuan Xu1, Tanya Goyal2, Eunsol Choi3
Aff.: 1The University of Texas at Austin; 2Cornell University; 3New York University

| LLama-3.1 | QWEN-2 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 32K | 64K | 32K | 64K | |||||||
| Method | stride | K | Acc | time(s) | Acc | time(s) | Acc | time(s) | Acc | time(s) |
| Vanilla | - | - | 90 | 1.71 | 82 | 2.40 | 79 | 2.55 | 57 | 4.93 |
| H2O | - | 4096 | 21 | 2.15 | 11 | 2.29 | 16 | 1.94 | 11 | 1.94 |
| StreamingLLM | - | 4096 | 22 | 1.23 | 17 | 1.21 | 21 | 1.17 | 11 | 1.19 |
| StreamingLLM++ | 50 | 4096 | 22 | 1.25 | 17 | 1.33 | 21 | 1.21 | 11 | 1.29 |
| Recycled | 50 | 4096 | 63 | 1.27 | 50 | 1.38 | 32 | 1.21 | 20 | 1.29 |