notesum.ai

Published at October 31

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

cs.CL
cs.AI

Released Date: October 31, 2024

Authors: Amir Hossein Kargaran1, François Yvon2, Hinrich Schütze1

Aff.: 1LMU Munich & Munich Center for Machine Learning, Munich, Germany; 2Sorbonne Université & CNRS, ISIR, Paris, France

Arxiv: http://arxiv.org/abs/2410.23825v1