Virtual voices: Azure’s neural text-to-speech service

How Google’s WaveNet tech has paved the way for appliances that talk back to you
Voysis founder and CEO Peter Cahill on how recent advances in machine-generated speech will change how we interact with machines, speaking at the AI Conference presented by O’Reilly and Intel AI.

The days of the keyboard and screen as our sole method of interacting with a computer are long gone. Now we’re surrounded by more natural user interfaces, adding touch and speech recognition to our repertoire of interactions. The same goes for how computers respond to us, using haptics and speech synthesis.

SEE: Alexa Skills: A guide for business pros (free PDF) (TechRepublic)

Speech is increasingly important, as it provides a hands-free and at-a-distance way of working with devices. It’s not necessary to touch them or look at them — all that’s needed are a handful of trigger words and a good speech recognition system. We’re perhaps most familiar with digital assistants like Cortana, Alexa, Siri, and Google Assistant, but speech technologies are appearing in assistive systems, in in-car applications, and in other environments where manual operations are difficult, distracting or downright dangerous.

Artificial voices for our code

The other side of the speech recognition story is, of course, speech synthesis. Computers are good at displaying text, but not very good at reading it to us. What’s needed is an easy way of taking text content and turning it into recognisable human-quality speech, not the eerie monotone of a sci-fi robot. We’re all familiar with the speech synthesis tools in automated telephony systems or in GPS apps that fail basic pronunciation tests, getting names and addresses amusingly wrong.

High-quality speech synthesis isn’t easy. If you take the standard approach, mapping text to strings of phonemes, the result is often stilted and prone to mispronunciation. What’s more disconcerting is that…