notesum.ai
Published at October 31GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages
cs.CL
cs.AI
Released Date: October 31, 2024
Authors: Amir Hossein Kargaran1, François Yvon2, Hinrich Schütze1
Aff.: 1LMU Munich & Munich Center for Machine Learning, Munich, Germany; 2Sorbonne Université & CNRS, ISIR, Paris, France
![[Uncaptioned image]](https://arxiv.org/html/2410.23825v1/x4.png)
| Argument | Description | Value |
| -minCount | Minimal number of word occurrences | 1000 |
| -minCountLabel | Minimal number of label occurrences | 0 |
| -wordNgrams | Max length of word ngram | 1 |
| -bucket | Number of buckets | 106 |
| -minn | Min length of char ngram | 2 |
| -maxn | Max length of char ngram | 5 |
| -loss | Loss function | softmax |
| -dim | Size of word vectors | 256 |
| -epoch | Number of epochs | 1 |
| -lr | Learning rate | .8 |