notesum.ai

Published at November 26

Towards Maximum Likelihood Training for Transducer-based Streaming Speech Recognition

eess.AS

cs.LG

Released Date: November 26, 2024

Authors: Hyeonseung Lee, Ji Won Yoon, Sungsoo Kim, Nam Soo Kim

Arxiv: http://arxiv.org/abs/2411.17537v1

Refer to caption

Transducer model	Attention	FoCCE network hyperparam.			#	LibriSpeech WER [%]
	chunk	$\mathbf{\lambda}_{\mathbf{\gamma}}$	$\mathrm{\mathbf{CausalEncoder}}_{\mathbf{\omega}}^{\mathbf{\chi}}\mathbf{(% \cdot)}$ conv. module		param.	dev		test
	size	$\mathbf{\lambda}_{\mathbf{\gamma}}$	# module stacks	module dim.	(train)	clean	other	clean	other
Zipformer (non-streaming)	full	-			25.6M	2.42	5.96	2.54	6.00
Zipformer (streaming)	8	-			25.6M	3.27	9.41	3.53	9.17
+ FoCCE (proposed)		0.01	8	320	29.1M	3.20	9.31	3.47	9.06
		0.05	”	”	”	3.13	8.95	3.27	8.76
	(160 ms)	0.25	”	”	”	3.32	9.40	3.60	9.20
		0.05	4	256	27.3M	3.26	9.25	3.41	9.00
		”	8	512	33.2M	3.14	8.90	3.34	8.78
Transducer model	Attention	$\lambda_{r}$	$\mathbf{CausalEncoder}^{\chi}_{\omega}(\cdot)$ conv. module		# param.	TED-LIUM3 WER [%]
Transducer model	chunk size	$\lambda_{r}$	# module stacks	module dim.	(train)	dev		test
Zipformer (non-streaming)	full	-			25.6M	6.46		5.91
Zipformer (streaming)		-			25.6M	9.43		8.57
+ FoCCE (proposed)	8	0.01	8	320	29.1M	9.28		8.41
	(160 ms)	0.05	”	”	”	9.06		8.10
		0.25	”	”	”	9.35		8.51