Microsoft’s, Unreleased, AI Voice Generators Has Achieved Human Parity

Microsoft has created an advanced AI speech generator named VALL-E 2, which convincingly mimics human speech. The technology, though, is considered too dangerous for public release due to potential misuse.

What’s Happening & Why This Matters

Microsoft’s VALL-E 2 is an extraordinary text-to-speech (TTS) generator capable of reproducing human voices using just a few seconds of audio. According to a paper published on 17 June on the pre-print server arXiv, VALL-E 2 achieves human parity. What does that mean? Well, VALL-E 2’s generated speech is indistinguishable from real human speech.

Researchers tested VALL-E 2 using audio samples from LibriSpeech and VCTK datasets and evaluated its performance with the ELLA-V framework. The results showed that VALL-E 2 surpasses previous TTS systems in terms of speech robustness, naturalness, and speaker similarity, marking it as the first AI to achieve human parity in these benchmarks.

Repetition Aware Sampling: This feature enhances the AI’s ability to convert text into speech by addressing repetitive language units — creating more natural-sounding speech.
Grouped Code Modeling: This reduces the number of language units the model processes simultaneously, improving efficiency and speed.

Despite its impressive capabilities, VALL-E 2 will not be publicly released due to risks associated with misuse, such as voice spoofing and impersonation. Microsoft and other AI companies like OpenAI have imposed similar restrictions on their voice technologies.

Potential Applications:

Educational Tools: Enhancing learning experiences with personalized voice interactions.
Entertainment: Creating realistic voiceovers for media.
Accessibility Features: Assisting individuals with speech impairments.
Interactive Systems: Improving user experience in customer service and translation services.

The researchers emphasize the importance of ethical considerations and protocols to ensure the safe use of AI-generated speech, suggesting that any practical applications will need robust safeguards.

Deepfake technology is being used in Fraud, phishing, and cybersecurity scams . Credit: keepnet

TF Summary: What’s Next

The development of VALL-E 2 by Microsoft presents remarkable advancements in AI speech generation. However, due to the potential risks of misuse, the technology remains confined to research. Any AI speech applications requires stringent ethical protocols and consent mechanisms to ensure responsible use. As AI development flourishes, balancing innovation with safety is paramount to harness its full potential.