notesum.ai
Published at December 4RedStone: Curating General, Code, Math, and QA Data for Large Language Models
cs.CL
Released Date: December 4, 2024
Authors: Yaoyao Chang1, Lei Cui, Li Dong, Shaohan Huang, Yangyu Huang, Yupan Huang, Scarlett Li, Tengchao Lv, Shuming Ma, Qinzheng Sun, Wenhui Wang, Furu Wei, Ying Xin, Mao Yang, Qiufeng Yin, Xingxing Zhang
Aff.: 1Microsoft Research

| Datasets | ARC-C | ARC-E | HellaSwag | OpenBookQA | PIQA | Winogrande | AVERAGE |
| RedPajama | 0.2270 | 0.4386 | 0.3171 | 0.1900 | 0.5968 | 0.5296 | 0.3832 |
| FineWeb | 0.1928 | 0.4428 | 0.3506 | 0.1740 | 0.6681 | 0.5288 | 0.3929 |
| RefinedWeb | 0.2125 | 0.4369 | 0.3380 | 0.2100 | 0.6491 | 0.5264 | 0.3955 |
| DCLM | 0.2159 | 0.4848 | 0.3614 | 0.1760 | 0.6615 | 0.5082 | 0.4013 |
| FineWeb-Edu | 0.2722 | 0.5648 | 0.3637 | 0.1940 | 0.6676 | 0.5051 | 0.4279 |
| RedStone-Web | 0.2662 | 0.5181 | 0.3722 | 0.2340 | 0.6795 | 0.5162 | 0.4310 |