notesum.ai
Published at November 9Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques
cs.LG
cs.AI
cs.CL
Released Date: November 9, 2024
| Bit-Width Config. | Inference Latency (ms) | Energy Cons. (J) | Cost Red. (%) |
|---|---|---|---|
| FP32 (Baseline) | 100 | 50 | 0 |
| INT8 | 60 | 30 | 40 |
| INT4 | 35 | 20 | 65 |