Microsoft's VALL-E 2 AI Voice Mimicry is Now Indistinguishable

Microsoft has released the second version of its text-to-speech AI tool, VALL-E. However, Microsoft decided not to make VALL-E 2 available to the public due to its potential dangers.

In April of last year, Microsoft introduced VALL-E, a text-to-speech AI tool capable of mimicking human voices. At that time, VALL-E could imitate any voice after hearing only a short sample. However, the newly announced VALL-E 2 can mimic any voice with incredibly high quality. Because of this, Microsoft decided not to release VALL-E 2 to the public, as it produces examples that are too convincing.

“Microsoft VALL-E 2 is frightening”

We have seen text-to-speech (TTS) AI tools before, but VALL-E 2 is the first of its kind to reach the same level as humans in benchmark comparisons. This means the model can create very realistic voice imitations. This is precisely why Microsoft has chosen not to publicly release VALL-E 2. You can check out an example via the link below, and we also recommend looking at the examples on Microsoft’s own website.

It is reported that in the first trial with VALL-E 2, using a single audio file, the model demonstrated human-level performance. Furthermore, VALL-E 2 does not break down in speech synthesis, even with sentences traditionally challenging due to complexity or repetitive expressions. VALL-E 2 is essentially built upon the first model but is supported by two significant improvements: “Repetition-Sensitive Sampling” and “Clustered Code Modeling.”

The first improvement, “Repetition-Sensitive Sampling,” addresses the repetition of tokens during the decoding process, preventing infinite loops of sounds or sentences. This enhances the way the AI converts text to speech. In simpler terms, this feature helps VALL-E 2 alter its speech patterns, making it sound smoother and more natural.

Scroll to Top