notesum.ai

Published at December 3

Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity

cs.CL

Released Date: December 3, 2024

Authors: Da Ma1, Lu Chen1, Situo Zhang1, Yuxun Miao1, Su Zhu2, Zhi Chen3, Hongshen Xu1, Hanqi Li1, Shuai Fan2, Lei Pan2, Kai Yu1

Aff.: 1X-LANCE Lab, Shanghai Jiao Tong University; 2AISpeech Co., Ltd.; 3ByteDance

Arxiv: http://arxiv.org/pdf/2412.02252v1