Microsoft’s New AI Can Simulate Anyone’s Voice From a 3-Second Sample

Helene Schmid

449 views 10 mins 0 Comments

Microsoft researchers have announced a new application that uses artificial intelligence to ape a person’s voice with just seconds of training. The model of the voice can then be used for text-to-speech applications.

The application called VALL-E can be used to synthesize high-quality personalized speech with only a three-second enrollment recording of a speaker as an acoustic prompt, the researchers wrote in a paper published online on arXiv, a free distribution service and an open-access archive for scholarly articles.

There are programs now that can cut and paste speech into an audio stream, and that speech is converted into a speaker’s voice from typed text. However, the program must be trained to emulate a person’s voice, which can take an hour or more.

“One of the standout things about this model is it does that in a matter of seconds. That’s very impressive,” Ross Rubin, the principal analyst at Reticle Research, a consumer technology advisory firm in New York City, told Businsiders.

According to the researchers, VALL-E significantly outperforms existing state-of-the-art text-to-speech (TTS) systems in both speech naturalness and speaker similarity.

Moreover, VALL-E can preserve a speaker’s emotions and acoustic environment. So if a speech sample were recorded over a phone, for example, the text using that voice would sound like it was being read through a phone.

‘Super Impressive’

VALL-E is a noticeable improvement over previous state-of-the-art systems, such as YourTTS, released in early 2022, said Giacomo Miceli, a computer scientist and creator of a website with an AI-generated, never-ending discussion featuring the synthetic speech of Werner Herzog and Slavoj Žižek.

“What is interesting about VALL-E is not just the fact that it needs only three seconds of audio to clone a voice, but also how closely it can match that voice, the emotional timbre, and any background noise,” Miceli told Businsiders. Ritu Jyoti, group vice president for AI and automation at IDC, a global market research company, called VALL-E “significant and super impressive.”

“This is a significant improvement over previous models, which require a much longer training period to generate a new voice,” Jyoti told Businsiders.

“It is still the early days for this technology, and more improvements are expected to have it sound more human-like,” she added.

Emotion Emulation Questioned

Unlike OpenAI, the maker of ChatGPT, Microsoft hasn’t opened VALL-E to the public, so questions remain about its performance. For example, are there factors that could cause degradation of the speech produced by the application?

“The longer the audio snippet generated, the higher the chances that a human would hear things that sound a little bit off,” Miceli observed. “Words may be unclear, missed, or duplicated in speech synthesis.”

“It’s also possible that switching between emotional registers would sound unnatural,” he added.

The application’s ability to emulate a speaker’s emotions also has skeptics. “It will be interesting to see how robust that capability is,” said Mark N. Vena, president and principal analyst at SmartTech Research in San Jose, Calif.

“The fact that they claim it can do that with simply a few seconds of audio is difficult to believe,” he continued, “given the current limitations of AI algorithms, which require much longer voice samples.”

Ethical Concerns

Experts see beneficial applications for VALL-E, as well as some not-so-beneficial. Jyoti cited speech editing and replacing voice actors. Miceli noted the technology could be used to create editing tools for podcasters, customize the voice of smart speakers, as well as being incorporated into messaging systems and chat rooms, videogames, and even navigation systems.

“The other side of the coin is that a malicious user could clone the voice of, say, a politician and have them say things that sound preposterous or inflammatory, or in general to spread out false information or propaganda,” Miceli added.

Vena sees enormous abuse potential in the technology if it’s as good as Microsoft claims. “At the financial services and security level, it’s not difficult to conjure up use cases by nefarious actors that could do really damaging things,” he said.

Jyoti, too, sees ethical concerns bubbling around VALL-E. “As the technology advances, the voices generated by VALL-E and similar technologies will become more convincing,” she explained. “That would open the door to realistic spam calls replicating the voices of real people that a potential victim knows.”

“Politicians and other public figures could also be impersonated,” she added.

“There could be potential security concerns,” she continued. “For example, some banks allow voice passwords, which raises concerns about misuse. We could expect an arms race escalation between AI-generated content and AI-detecting software to stop abuse.”

“It is important to note that VALL-E is currently not available,” Jyoti added. “Overall, regulating AI is critical. We’ll have to see what measures Microsoft puts in place to regulate the use of VALL-E.”

Enter the Lawyers

Legal issues may also arise around the technology. “Unfortunately, there may not be current, sufficient legal tools in place to directly tackle such issues, and instead, a hodgepodge of laws that cover how the technology is abused may be used to curtail such abuse,” said Michael L. Teich, a principal in Harness IP, a national intellectual property law firm.

“For example,” he continued, “voice cloning may result in a deepfake of a real person’s voice that may be used to trick a listener to succumb to a scam or may even be used to mimic the voice of an electoral candidate. While such abuses would likely raise legal issues in the fields of fraud, defamation, or election misinformation laws, there is a lack of specific AI laws that would tackle the use of the technology itself.”

“Further, depending on how the initial voice sample was obtained, there may be implications under the federal Wiretap Act and state wiretap laws if the voice sample was obtained over, for example, a telephone line,” he added.

“Lastly,” Teich noted, “in limited circumstances, there may be First Amendment concerns if such voice cloning was to be used by a governmental actor to silence, delegitimize or dilute legitimate voices from exercising their free speech rights.”

“As these technologies mature, there may be a need for specific laws to directly address the technology and prevent its abuse as the technology advances and becomes more accessible,” he said.

Making Smart Investments

In recent weeks, Microsoft has been making AI headlines. It’s expected to incorporate ChatGPT technology into its Bing search engine this year and possibly into its Office apps. It’s also reportedly planning to invest $10 million in OpenAI — and now, VALL-E.

“I think they’re making a lot of smart investments,” said Bob O’Donnell, founder and chief analyst of Technalysis Research, a technology market research and consulting firm in Foster City, Calif.

“They jumped on the OpenAI bandwagon several years ago, so they’ve been behind the scenes on this for quite a while. Now it’s coming out in a big way,” O’Donnell told Businsiders.

“They’ve had to play catch-up with Google, who’s known for its AI, but Microsoft is making some aggressive moves to come to the forefront,” he continued. “They’re jumping on the popularity and the incredible coverage that all these things have been getting.”

Rubin added, “Microsoft, having been the leader in productivity in the last 30 years or so, wants to preserve and extend that lead. AI could hold the key to that.”

BusInsiders