notesum.ai

Published at December 9

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

cs.CL

cs.AI

cs.LG

cs.MM

cs.SD

eess.AS

Released Date: December 9, 2024

Authors: Tianxin Xie¹, Yan Rong¹, Pengfei Zhang¹, Li Liu¹

Aff.: ¹Hong Kong University of Science and Technology (Guangzhou)

Arxiv: http://arxiv.org/pdf/2412.06602v1

Refer to caption

Method	Modeling	Code	Year
VQ-Wav2Vec [164]	SSCP	https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec#vq-wav2vec	2019
Wav2Vec 2.0 [165]	SSCP	https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec	2019
HuBERT [166]	SSCP	https://github.com/facebookresearch/fairseq/tree/main/examples/hubert	2021
W2v-BERT 2.0 [167]	SSCP	https://huggingface.co/facebook/w2v-bert-2.0	2023
SoundStream [168]	RVQGAN	https://github.com/wesbz/SoundStream	2021
Encodec [169]	RVQGAN	https://github.com/facebookresearch/encodec	2022
HiFi-Codec [170]	RVQGAN	https://github.com/yangdongchao/AcademiCodec	2023
SpeechTokenizer [171]	RVQGAN	https://github.com/ZhangXInFD/SpeechTokenizer	2023
Descript Audio Codec [172]	RVQGAN	https://github.com/descriptinc/descript-audio-codec	2023