notesum.ai
Published at December 3Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset
cs.CL
Released Date: December 3, 2024
Authors: Dan Su1, Kezhi Kong, Ying Lin, Joseph Jennings, Brandon Norick, Markus Kliegl, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro
Aff.: 1NVIDIA

| Dataset | ARC-E | ARC-C | H | W | RACE | PIQA | SIQA | CSQA | OBQA | MMLU | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| FineWebEdu-2 | 71.9 | 44.7 | 75.4 | 67.0 | 36.8 | 79.5 | 45.2 | 25.5 | 43.8 | 42.4 | 53.2 |
| FineWebEdu | 73.6 | 48.0 | 70.7 | 64.6 | 38.0 | 76.4 | 43.5 | 30.0 | 44.4 | 42.9 | 53.2 |
| DCLM | 74.7 | 47.0 | 76.3 | 69.1 | 36.5 | 79.7 | 45.6 | 44.1 | 44.0 | 53.4 | 57.0 |
| Nemotron-CC | 75.3 | 50.7 | 75.9 | 67.8 | 37.9 | 80.5 | 45.1 | 47.7 | 44.2 | 53.0 | 57.8 |
| Nemotron-CC-HQ | 78.8 | 52.9 | 76.6 | 69.4 | 36.4 | 80.1 | 46.6 | 55.8 | 45.4 | 59.0 | 60.1 |