notesum.ai

Published at November 8

Recycled Attention: Efficient inference for long-context language models

cs.CL

Released Date: November 8, 2024

Authors: Fangyuan Xu¹, Tanya Goyal², Eunsol Choi³

Aff.: ¹The University of Texas at Austin; ²Cornell University; ³New York University

Arxiv: http://arxiv.org/abs/2411.05787v1

Refer to caption

			LLama-3.1				QWEN-2
			32K		64K		32K		64K
Method	stride	K	Acc $\uparrow$	time(s) $\downarrow$	Acc $\uparrow$	time(s) $\downarrow$	Acc $\uparrow$	time(s) $\downarrow$	Acc $\uparrow$	time(s) $\downarrow$
Vanilla	-	-	90	1.71	82	2.40	79	2.55	57	4.93
H₂O	-	4096	21	2.15	11	2.29	16	1.94	11	1.94
StreamingLLM	-	4096	22	1.23	17	1.21	21	1.17	11	1.19
StreamingLLM++	50	4096	22	1.25	17	1.33	21	1.21	11	1.29
Recycled	50	4096	63	1.27	50	1.38	32	1.21	20	1.29