notesum.ai
Published at November 14Communication Compression for Tensor Parallel LLM Inference
cs.LG
cs.AI
cs.CL
Released Date: November 14, 2024
Authors: Jan Hansen-Palmus1, Michael Truong-Le, Oliver Hausdörfer2, Alok Verma1
Aff.: 1Recogni; 2Technical University of Munich
| Perplexity | ||||||
|---|---|---|---|---|---|---|
| Model | Sub-variant | Value Dtype | Block Size | Bits | FP16 | Increase |
| Llama 3.1 | 8B | FP4 | 8 | 4.6 | 7.22 | 3.22% |
| 70B | FP5 | 32 | 5.2 | 3.86 | 1.68% | |
| Gemma 2 | 2B | FP5 | 32 | 5.2 | 14.27 | 1.39% |
| 9B | FP4 | 32 | 4.2 | 10.40 | 1.83% | |
| Mistral | 7B | FP4 | 32 | 4.2 | 5.23 | 1.18% |
| 22B | FP4 | 8 | 4.6 | 4.02 | 1.62% | |
| 123B | FP5 | 32 | 5.2 | 2.65 | 0.48% | |