Voice synthesis has come a long way in the last decade with the emergence of neural networks and end-to-end modeling. Vocoders and acoustic models are now commonly used in cascaded text-to-speech (TTS) systems, with mel spectrograms as the intermediate representations. TTS systems can generate high-quality, recognizably human speech from a single speaker or an ensemble of voices, but they still require clean, high-quality data from the studio. Leveraging large amounts of internet-crawled material to train TTS models often results in a performance reduction due to the small training dataset. This poses a challenge in the zero-shot scenario where speaker resemblance and speech naturalness drop significantly.
To address this issue, researchers have introduced VALL-E: the first language model-based TTS framework which leverages the massive, varied, and multi-speaker voice data available to transfer the success seen in the field of text synthesis. By conditioning the acoustic tokens on the enrolled recording of 3 seconds and the phoneme prompt, VALL-E is able to synthesize individualized speech, including zero-shot TTS. Generative AI models like GPT-3 are used to create the appropriate waveform from the audio tokens produced by the VALL-E model.
LibriLight, a corpus of 60K hours of English speech from over 7000 speakers, was used to train VALL-E. The transcriptions were generated with a speech recognition model as the data was originally audio-only and included more loud speech and incorrect transcriptions than earlier TTS training datasets.
The results of the evaluation of VALL-E on LibriSpeech and VCTK showed that VALL-E is capable of producing realistic speech with high speaker similarity and naturalness. On LibriSpeech, VALL-E outperformed the most advanced zero-shot TTS system, with improvements of +0.12 in the comparative mean option score (CMOS) and +0.93 in the parallel mean option score (SMOS). On VCTK, VALL-E improved the baseline by +0.11 SMOS and +0.23 CMOS. The synthetic voice of unseen speakers even achieved a +0.04 CMOS score versus ground truth, indicating it is as natural as human recordings on VCTK. Additionally, VALL-E was shown to be able to preserve the acoustic context and mood.
Overall, VALL-E is a promising language model-based TTS framework which offers robust in-context learning capabilities, prompt-based techniques for zero-shot TTS, and the ability to generate a variety of outputs from the same input text while maintaining the acoustic context and the speaker’s mood of the acoustic prompt. Keep up to date with the latest AI research by joining our Reddit page, Discord channel, and email newsletter.