notesum.ai
Published at November 26Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition
eess.AS
cs.LG
Released Date: November 26, 2024
Authors: Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim

| Transducer model | Attention | FoCCE network hyperparam. | # | LibriSpeech WER [%] | |||||
| chunk | conv. module | param. | dev | test | |||||
| size | # module stacks | module dim. | (train) | clean | other | clean | other | ||
| Zipformer (non-streaming) | full | - | 25.6M | 2.42 | 5.96 | 2.54 | 6.00 | ||
| Zipformer (streaming) | 8 | - | 25.6M | 3.27 | 9.41 | 3.53 | 9.17 | ||
| + FoCCE (proposed) | 0.01 | 8 | 320 | 29.1M | 3.20 | 9.31 | 3.47 | 9.06 | |
| 0.05 | ” | ” | ” | 3.13 | 8.95 | 3.27 | 8.76 | ||
| (160 ms) | 0.25 | ” | ” | ” | 3.32 | 9.40 | 3.60 | 9.20 | |
| 0.05 | 4 | 256 | 27.3M | 3.26 | 9.25 | 3.41 | 9.00 | ||
| ” | 8 | 512 | 33.2M | 3.14 | 8.90 | 3.34 | 8.78 | ||
| Transducer model | Attention | conv. module | # param. | TED-LIUM3 WER [%] | |||||
| chunk size | # module stacks | module dim. | (train) | dev | test | ||||
| Zipformer (non-streaming) | full | - | 25.6M | 6.46 | 5.91 | ||||
| Zipformer (streaming) | - | 25.6M | 9.43 | 8.57 | |||||
| + FoCCE (proposed) | 8 | 0.01 | 8 | 320 | 29.1M | 9.28 | 8.41 | ||
| (160 ms) | 0.05 | ” | ” | ” | 9.06 | 8.10 | |||
| 0.25 | ” | ” | ” | 9.35 | 8.51 | ||||