notesum.ai
Published at October 24From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages
cs.LG
cs.AI
cs.NE
Released Date: October 24, 2024
Authors: Artur Kiulian1, Anton Polishko1, Mykola Khandoga1, Yevhen Kostiuk2, Guillermo Gabrielli1, Łukasz Gagała, Fadi Zaraket3, Qusai Abu Obaida4, Hrishikesh Garud5, Wendy Wing Yee Mak6, Dmytro Chaplynskyi1, Selma Belhadj Amor6, Grigol Peradze6
Aff.: 1OpenBabylon; 2OpenBabylon, ARG-Tech, University of Dundee, UK; 3Doha Institute for Graduate Studies; 4Arab Center for Research and Policy Studies; 5Google; 6PolyAgent

| Model | GCS | NEWR | CSWR |
| Ukrainian | |||
| Vanilla | 0.264 | 0.089 | 0.515 |
| Tuned | 0.388 | 0.032 | 0.002 |
| Ours | 0.503 | 0.030 | 0.001 |
| Arabic | |||
| Vanilla | 0.040 | 0.863 | 0.450 |
| Tuned | 0.238 | 0.079 | 0.004 |
| Ours | 0.548 | 0.050 | 0.002 |