Website Speech Synthesis Whisper by OpenAI

Voicebox

Text-Guided Multilingual Universal Speech Generation at Scale

About Voicebox

Meta AI researchers have made significant progress with generative AI for speech, leading to the development of Voicebox, a model that is able to generalize to speech-generation tasks without the need of specific training or the use of prepared data. Voicebox can be used to synthesize speech for six languages, as well as to clean audio clips, edit content, transform styles, and generate varied samples.

Prior to Voicebox, generative AI for speech had to be specifically trained for each task with tailored data. Voicebox, on the other hand, only requires raw audio and its transcript. Additionally, unlike autoregressive models, Voicebox can modify any part of an audio clip, not just its end.

Voicebox is based on a method called Flow Matching, which has been proven to be more accurate than diffusion models.