In 2013 Warner Bros. Entertainment pictures showed a movie which is called Her. It tells a story about a man who falls in love with a new operating system. The name of the system is Samantha and it is nothing but an earplug in the ear. The only way to interact with the system is voice. But how can a person fall in love with a voice knowing that it is just an artificial speech assistant talking to him?
It’s all about voice …
One of the reasons may lay in the roots of a human voice perception which consists of different layers (linguistic and paralinguistic) and is influenced by a number of factors. The linguistic layer carries semantic information and phonetic representation of the speech. Let’s take a closer look at the paralinguistic side of speech. The acoustic information contains special markers which help us to create an image of the person we are speaking to. In most cases, by hearing somebody’s voice we can detect the gender of the person, his or her mood and emotions. Scientists even say that we are able to judge physical characteristics such as height, weight and racial group of the person we are speaking to1. An interesting thing is that we can extract the above mentioned information even without knowing the language the person is speaking, just taking into account the vocal side of speech.
Being able to adequately model the vocal side of the human speech is a great challenge faced by the speech synthesis industry. In order to sound natural, artificial voice interfaces should be able to convey different shades of vocal information that comprise the speech.
Convolutional Neural Networks(CNNs) is one of the possible ways to approach the problem. The raw signal is taken as an input. The output is synthesized sample by sample (one sample at a time). A number of hidden layers filter each element of the input matrix. Per second the system generates 16,000 samples, the prediction for each sample is influenced by all previous samples.
The proposed technology has the advantage that the voices sound more natural and pleasant. At the beginnig the WaveNet was a bit slow and could be used only in research projects. However later the DeepMind Team improved the technology and made it 1,000x faster. The new WaveNet is used in the Google Assistant. And according to Google Research, the gap between human performance is reduced by more than 70%. For now Google Cloud Text-to-Speech provides an access to more than 50 WaveNet voices.
 Kreiman, J. (1997) Listening to voices: theory and practice in voice perception research. In Talker Variability in Speech Research (Johnson, K. and Mullenix, J., eds), pp. 85–108, Academic Press