notesum.ai
Published at December 6Flash Communication: Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference
cs.AI
Released Date: December 6, 2024
Authors: Qingyuan Li1, Bo Zhang, Liang Ye, Yifan Zhang, Wei Wu, Yerui Sun, Lin Ma, Yuchen Xie
Aff.: 1Meituan

| Method | Ring All-Reduce | Flash All-Reduce |
|---|---|---|
| Total Volume | ||
| Reduce Step | ||
| Reduce-Scatter | ||
| Gather Step | ||
| All-Gather | ||
| QDQ Step |