notesum.ai
Published at November 18MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
cs.DC
cs.AI
Released Date: November 18, 2024
Authors: Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E. Gonzalez, Matei Zaharia, Ion Stoica

| Synthetic Reasoning | Summarization | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Settings | S1 | S2 | S1 | S2 | ||||||||
| Throughput | Throughput | Throughput | Throughput | |||||||||
| FlexGen(c) | 16.903 | 32 | 61 | 20.015 | 64 | 33 | 2.614 | 3 | 92 | 4.307 | 8 | 36 |
| FlexGen | 22.691 | 32 | 61 | 50.138 | 64 | 33 | 3.868 | 3 | 92 | 7.14 | 8 | 36 |
| DeepSpeed | 11.832 | 102 | 1 | 18.589 | 156 | 1 | 0.965 | 8 | 1 | 1.447 | 12 | 1 |
| MoE-Lightning(p) | 26.349 | 36 | 26 | 105.29 | 100 | 15 | 4.52 | 4 | 19 | 12.393 | 8 | 36 |