Microsoft’s, Unreleased, AI Voice Generators Has Achieved Human Parity

Z Patel

Microsoft has created an advanced AI speech generator named VALL-E 2, which convincingly mimics human speech. The technology, though, is considered too dangerous for public release due to potential misuse.

What’s Happening & Why This Matters

Microsoft’s VALL-E 2 is an extraordinary text-to-speech (TTS) generator capable of reproducing human voices using just a few seconds of audio. According to a paper published on 17 June on the pre-print server arXiv, VALL-E 2 achieves human parity. What does that mean? Well, VALL-E 2’s generated speech is indistinguishable from real human speech.

Researchers tested VALL-E 2 using audio samples from LibriSpeech and VCTK datasets and evaluated its performance with the ELLA-V framework. The results showed that VALL-E 2 surpasses previous TTS systems in terms of speech robustness, naturalness, and speaker similarity, marking it as the first AI to achieve human parity in these benchmarks.

  • Repetition Aware Sampling: This feature enhances the AI’s ability to convert text into speech by addressing repetitive language units — creating more natural-sounding speech.
  • Grouped Code Modeling: This reduces the number of language units the model processes simultaneously, improving efficiency and speed.

Despite its impressive capabilities, VALL-E 2 will not be publicly released due to risks associated with misuse, such as voice spoofing and impersonation. Microsoft and other AI companies like OpenAI have imposed similar restrictions on their voice technologies.

Potential Applications:

  • Educational Tools: Enhancing learning experiences with personalized voice interactions.
  • Entertainment: Creating realistic voiceovers for media.
  • Accessibility Features: Assisting individuals with speech impairments.
  • Interactive Systems: Improving user experience in customer service and translation services.

The researchers emphasize the importance of ethical considerations and protocols to ensure the safe use of AI-generated speech, suggesting that any practical applications will need robust safeguards.

Deepfake technology is being used in Fraud, phishing, and cybersecurity scams . Credit: keepnet

TF Summary: What’s Next

The development of VALL-E 2 by Microsoft presents remarkable advancements in AI speech generation. However, due to the potential risks of misuse, the technology remains confined to research. Any AI speech applications requires stringent ethical protocols and consent mechanisms to ensure responsible use. As AI development flourishes, balancing innovation with safety is paramount to harness its full potential.

Share This Article
Avatar photo
By Z Patel “TF AI Specialist”
Zara ‘Z’ Patel stands as a beacon of expertise in the field of digital innovation and Artificial Intelligence. Holding a Ph.D. in Computer Science with a specialization in Machine Learning, Z has worked extensively in AI research and development. Her career includes tenure at leading tech firms where she contributed to breakthrough innovations in AI applications. Z is passionate about the ethical and practical implications of AI in everyday life and is an advocate for responsible and innovative AI use.
Leave a comment