notesum.ai
Published at November 4Training Compute-Optimal Protein Language Models
cs.LG
cs.AI
q-bio.QM
Released Date: November 4, 2024
Authors: Xingyi Cheng1, Bo Chen2, Pan Li1, Jing Gong1, Jie Tang2, Le Song3
Aff.: 1BioMap Research; 2Tsinghua University; 3BioMap Research and MBZUAI

| Datasets | Prot. Seq. | Tokens (AAs) | Samp. Prop. |
| Uniref50/S | 54M | 15.2B | 8.5% |
| Uniref90/50 | 102M | 37.8B | 19.5% |
| ColabFoldDBc | 208M | 37.7B | 19.5% |
| ColabFoldDBm | 575M | 103B | 52.5% |
| Total | 939M | 194B | - |