notesum.ai
Published at December 10Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining
cs.SD
cs.MM
eess.AS
Released Date: December 10, 2024
Authors: Rui Zhoua, Akinori Itoa, Takashi Nosea

| BLEU Score | MOS Score | Similarity | Inference time(s/utt) | ||||
| ES-EN | FR-EN | ES-EN | FR-EN | ES-EN | FR-EN | FR-EN | |
| Cascade System | |||||||
| S2UT[21] | 18.01 | 24.02 | / | / | 0.796 | ||
| S2UT + FreeVC[17] | 17.68 | 23.53 | 0.581 | 0.592 | 1.574 | ||
| ASR + MT + SpeakerTTS[3] | 21.65 | 20.18 | 0.652 | 0.664 | 2.956 | ||
| End-to-End System | |||||||
| SC-S2UT[32] | 16.10 | 21.68 | 0.609 | 0.611 | 0.813 | ||
| Style-S2UT[5] | 16.30 | 22.00 | * | 0.73 | ** | ||
| Ours | |||||||
| Embedding SC-S2UT | |||||||
| ES | 16.93 | 22.41 | 0.667 | 0.671 | 0.864 | ||
| ES + Enhence | 16.84 | 21.73 | 0.655 | 0.663 | 0.911 | ||
| FR | 16.86 | 22.37 | 0.670 | 0.682 | 0.864 | ||
| FR + Enhence | 16.79 | 21.31 | 0.661 | 0.677 | 0.911 | ||
| Pretrain SC-S2UT | |||||||
| ES | 17.24 | 22.43 | 0.629 | 0.613 | 0.813 | ||
| ES + Enhence | 17.12 | 22.15 | 0.615 | 0.595 | 0.859 | ||
| FR | 17.16 | 22.82 | 0.574 | 0.621 | 0.813 | ||
| FR + Enhence | 16.75 | 22.44 | 0.575 | 0.619 | 0.859 | ||
| Ground Truth | 88.64 | 80.29 | 0.677 | 0.687 | 6.432 | ||