notesum.ai

Published at November 4

A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness

cs.CL

cs.AI

cs.LG

68T50 (Primary) 68T07 (Secondary)

I.2.7

Released Date: November 4, 2024

Authors: Fali Wang¹, Zhiwei Zhang¹, Xianren Zhang¹, Zongyu Wu¹, Tzuhao Mo², Qiuhao Lu³, Wanjing Wang³, Rui Li³, Junjie Xu¹, Xianfeng Tang⁴, Qi He⁴, Yao Ma⁵, Ming Huang³, Suhang Wang¹

Aff.: ¹The Pennsylvania State University, University Park, USA; ²University of Pennsylvania, Philadelphia, USA; ³UTHealth Houston, Houston, USA; ⁴Amazon, Palo Alto, USA; ⁵Rensselaer Polytechnic Institute, Troy, USA

Arxiv: http://arxiv.org/abs/2411.03350v1

Criteria	Pruning	Knowledge Distillation	Quantization	Low-Rank Techniques
Definition	Removes unneeded parameters	Transfers knowledge from a larger to a smaller model	Lowers the precision of parameters	Uses low-rank decomposition on weights
Goal	Reduces size and computation	Shrinks model while retaining performance	Decreases size and speeds up processing	Reduces parameters and computation
Method	Cuts weights or layers based on importance	Smaller model mimics larger model’s output	Converts parameters to lower precision	Decomposes matrices into smaller components
Advantages	Reduces size and computation significantly	Preserves performance in smaller models	Speeds up inference, less storage	Efficient, mostly preserves performance
Disadvantages	May reduce accuracy, irregular memory access	Resource-intensive, requires large teacher model	Potential accuracy loss, may need specific hardware	Effectiveness varies, requires rank selection
Model Size Impact	High reduction	Significant reduction through knowledge transfer	High, tied to precision reduction	Moderate, reduces redundant parameters
Performance Impact	Possible degradation if over-pruned	Maintains if well distilled	Minor to moderate loss, depends on method	Mostly retains performance, may need tuning
Complexity	Moderate, needs parameter evaluation	High, involves dual-model training	Moderate, varies with precision level	Moderate, requires matrix factorization
Use Cases	Resource-limited settings	Efficient model creation for limited resources	Fast inference needs, edge devices	When weight matrices are redundant