notesum.ai
Published at November 10Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models
cs.CL
cs.AI
Released Date: November 10, 2024
Authors: Sultan Alrashed1, Dmitrii Khizbullin2, David R. Pugh
Aff.: 1Saudi Data & Artificial Intelligence Authority (SDAIA); 2King Abdullah University of Science & Technology (KAUST)

| Parameter | Value |
|---|---|
| Translation model | facebook/nllb-200-distilled-600M |
| Total passages | 189,405,457 (189M) |
| Number of shards, each language | 1813 |
| Total jsonl size, Ar | 937 GB |
| Total jsonl size, En | 844 GB |
| Total zip size, Ar | 267 GB |
| Total zip size, En | 311 GB |
| GPT-2 tokens, Ar | 551,094,680,322 (551B) |
| GPT-2 tokens, En | 192,121,988,036 (192B) |
| NLLB tokenizer tokens, Ar | 202,379,768,558 (202B) |
| NLLB tokenizer tokens, En | 210,765,476,860 (210B) |
| GPU resources | 20 days of 24 A100 GPUs (480 GPU-days) |