notesum.ai
Published at December 4Byte BPE Tokenization as an Inverse string Homomorphism
cs.CL
Released Date: December 4, 2024
Authors: Saibo Geng1, Sankalp Gambhir, Chris Wendler, Robert West
Aff.: 1EPFL, Switzerland

| Depth | String | Tokenization | Tokens |
| 0 | "" | [ 1 ] | BOS = 1 |
| 1 | "[]" | [ 1, 5159 ] | ␣[ = 518 |
| 2 | "[[]]" | [ 1, 518, 2636, 29962 ] | [] = 2636 |
| 3 | "[[[]]]" | [ 1, 5519, 2636, 5262 ] | ␣[[ = 5519 |
| 4 | "[[[[[]]]]" | [ 1, 5519, 29961, 2636, 5262, 29962 ] | [[ = 29961 |
| 5 | "[[[[[[]]]]]" | [ 1, 5519, 8999, 2636, 5262, 5262 ] | [[[ = 8999 |
| 6 | "[[[[[[[[]]]]]]" | [ 1, 5519, 8999, 29961, 2636, 5262, 5262, 29962 ] | ] = 29962 |
| 7 | "[[[[[[[[[]]]]]]]" | [ 1, 5519, 8999, 8999, 2636, 5262, 5262, 5262 ] | ]] = 5262 |
| 8 | "[[[[[[[[[[[]]]]]]]]" | [ 1, 5519, 8999, 8999, 29961, 2636, 5262, 5262, 5262, 29962 ] | ␣[] = 5159 |