CMU Researchers Unveil AI System for Human-like Text-to-Speech Training with Diverse Speech

Are you looking for a way to make your Artificial Intelligence (AI) system sound more human-like? In this blog post, we will discuss a new technology called MQTTS (multi-codebook vector quantized TTS) that may be the answer you’re looking for. This research paper shows that MQTTS can produce a more natural sound than traditional text-to-speech systems.

Recent developments in deep learning have significantly improved the quality of synthesized speech produced by neural-based Text-to-Speech (TTS) systems. However, reading or acting address recorded in a controlled context makes up most of the standard corpora used for training TTS systems. On the other hand, humans make a speech on demand with various prosodies that express paralinguistic information, such as subtle emotions. The exposure to many hours of speech from the actual world gives one this skill.

The Use of Real-World Speech

The limitless number of utterances in the wild can be used by systems that have been effectively trained on real-world speech. It implies that human-level AI is made possible by TTS systems introduced in the real-world lesson. In this study, they investigate the use of real-world speech gathered from YouTube and podcasts on TTS. Although the ultimate objective is to utilize an ASR system to record real-world speech, in this case, they simplify the environment by leveraging a corpus of already registered speech and concentrating on TTS.


They introduced this technology called MQTTS (multi-codebook vector quantized TTS). To determine the characteristics required for real-world voice synthesis, they compare mel-spectrogram-based systems in Section 5 and undertake ablation analysis. They contrast MQTTS further with non-autoregressive methodology. They demonstrate that the intelligibility and speaker transferability of their autoregressive MQTTS are improved. MQTTS achieves a significantly better level of prosody variety and somewhat higher naturalness. However, non-autoregressive models outperform in terms of computing speed and resilience.


In conclusion, MQTTS is a great way to make your AI system sound more natural and human-like. It is able to produce a more natural sound than traditional text-to-speech systems and is able to achieve a significantly lower signal-to-noise ratio with a clear, quiet cue (SNR). Check out the paper and Github for more details. Don’t forget to join our 14k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Categorized as AI

Leave a comment

Your email address will not be published. Required fields are marked *