Do you think you could tell the difference between a human and a machine speaking? If you’re familiar with the voices of old-school text-to-speech AI (like Microsoft’s Sam, Mike, and Mary) or even those of Siri and Alexa, you’re bound to answer with a resounding yes. But if you’ve heard Google’s latest text-to-speech AI Tacotron 2, you might not feel so confident. Google launched the program in late-December along with its own paper comparing Tacotron’s voice to that of a real human. And according to a paper written by Google researchers at the University of Berkeley, it’s almost impossible to distinguish between the two. To find out for yourself, make sure you check out the Tacotron sound samples here before you keep reading.
TACOTRON 2: THE LATEST IN TEXT-TO-SPEECH AI
Now that you’ve heard the samples of Google’s Tacotron 2, you’re probably astounded by just how realistic they sound. The system, developed by Google’s in-house engineers, consists of two deep neural networks that help it translate text into speech. The first network works by turning text into a spectrogram, which provides the system with a visual representation of how the text should sound. That spectrogram is then fed into WaveNet, which then reads the spectrogram and produces the relative sounds.
DEVELOPMENTS IN SPEECH SYNTHESIS
While speech recognition has come a long way in recent years (just look at Google Voice Search or Apple’s Siri as an example), text-to-speech technology lagged behind. For years, text-to-speech technology relied on so-called concatenative systems. These systems basically consisted of a library of small speech fragments recorded from a real human speaker which were then combined to form sentences. While these systems worked, they made it very hard to replicate the intricacies of human speech, such as emphasis or emotion. In order to capture these details, the entire sound library would have to be recorded from scratch. For a long time, the only alternative to concatenative speech synthesis systems were parametric text-to-speech systems. While these systems feature the ability to control the contents and characteristics of speech using specific inputs, they tended to sound far less natural. WaveNet, the system behind Google’s Tacotron 2, however, completely revolutionizes the way machines synthesize speech.
WAVENET: REVOLUTIONIZING TEXT-TO-SPEECH AI
WaveNet was developed by DeepMind, an AI company based in the UK, and the science behind this system is very complex. According to DeepMind, WaveNets are first trained using sound waveforms recorded from real human speakers. Once the system has been trained with these samples, it is able to sample them to create new, synthetic utterances. It then uses complex algorithms to predict the next steps in a piece of text, ultimately producing rich, natural sounding audio. Using Google’s existing text-to-speech datasets, researchers at DeepMind tested the performance of WaveNet against Google’s existing best speech synthesis systems (parametric and concatenative). The results were expressed using a 1-5 scale of Mean Opinion Scores (MOS), a standard measurement used in audio tests. When synthesizing US English, WaveNet produced a MOS of 4.21. Google’s concatenative and parametric systems produced scores of 3.86 and 2.6 respectively, while real human speech got a score of 4.55. The researchers at DeepMind conducted the same tests in Chinese Mandarin, producing the following results:
- Human speech: 4.21
- WaveNet: 4.08
- Parametric: 3.79
- Concatenative: 3.47
WaveNet differs from other speech synthesis systems in many ways. In order to know what to say, WaveNet needs to be presented with a text which has been transformed into a sequence of linguistic and phonetic cues about the syllables, words, or other sounds it is supposed to replicate. Without this information, the system still works, but it has to make up what to say. When it does, it usually produces a range of random sounds, with the occasional word thrown in. Because the system relies on raw audio, WaveNet is also able to produce natural sounds like breathing or the sound of mouth movements. Interestingly, WaveNet can be taught to replicate all kinds of sounds, not just speech. For example, researchers at DeepMind trained the system on classical piano music rather than a human speaker. The result? Fascinating samples of AI improvised piano. You can read more about WaveNet on DeepMind’s website.
TAKE THE TEST: CAN YOU SPOT THE BOT?
Now that you know how Google’s Tacotron 2 works, its time to take the test: Do you think you can tell Tacotron apart from a real human speaker? To take the test, follow this link and scroll to the last audio samples, titled “Tacotron 2 or Human?” You’ll find a total of 8 samples; 4 by a human speaker and 4 by Tacotron 2. Can you spot the bot? Once you’ve listened, scroll down for the answer to which samples were produced by Tacotron 2.
ANSWERS
So, which of the above samples came from a human? Well, Google hasn’t said. However, they’ve left a big clue: If you download the files, you’ll notice that some of the file names contain the term “gen” while others contain the code “gt.” While we can’t be certain, Google’s paper suggests that the files labelled “gen” were generated by Tacotron 2, while those labelled “gt” came from a human. Assuming that’s correct, here are the answers to the above test:
“That girl did a video about Star Wars lipstick.”
- Sample 1: Real human
- Sample 2: Tacotron 2
“She earned a doctorate in sociology at Columbia University.”
- Sample 1: Tacotron 2
- Sample 2: Real human
“George Washington was the first President of the United States.”
- Sample 1: Tacotron 2
- Sample 2: Real human
“I’m too busy for romance.”
- Sample 1: Real human
- Sample 2: Tacotron 2