Neural Network to Draw Peoples' Faces by Voice

Speech2Face, created by MIT, has been training on the on data from the AVSpeech dataset with short clips, which has a total of a million videos with 100k people
27 May 2019   323

Scientists from MIT created the ML-model Speech2Face, which generates portrait from the spectrogram of a person’s speech. It recognizes gender, age and, by emphasis, ethnicity.

Real Images of People, Reconstructed Images and Voice-Based Images
Real Images of People, Reconstructed Images and Voice-Based Images

The model is based on data from the AVSpeech kit with short clips. Audio and video tracks in them are pre-divided. A total of a million such files in the collection, among them there are about a hundred thousand people.

Having received a short video at the beginning, one part of the algorithm reworkes a person's face on the basis of frames so that it is in full face, with a neutral expression. Another part of the algorithm works with the audio track. It recreates the spectrogram, recognizes the voice and generates a portrait using a parallel neural network.

The quality check showed that the model copes well with the definition of sex, but is not yet able to correctly estimate age with an accuracy of 10 years. In addition, a problem with race definition was discovered: the algorithm best dealed with the drawing of faces of people of European or Asian origin. As the researchers say, this is due to the uneven distribution of the races in the training set.

MelNet Algorithm to Simulate Person's Voice

It analyzes the spectrograms of the audio tracks of the usual TED Talks, notes the speech characteristics of the speaker and reproduces short replicas
11 June 2019   337

Facebook AI Research team has developed a MelNet algorithm that synthesizes speech with characteristics specific to a particular person. For example, it learned to imitate the voice of Bill Gates.

MelNet analyzes the spectrograms of the audio tracks of the usual TED Talks, notes the speech characteristics of the speaker and reproduces short replicas.

Just the length of the replicas limits capabilities of the algorithm. It reproduces short phrases very close to the original. However, the person's intonation changes when he speaks on different topics, with different moods, different pitches. The algorithm is not yet able to imitate this, therefore long sentences sound artificially.

MIT Technology Review notes that even such an algorithm can greatly affect services like voice bots. There just all communication is reduced to an exchange of short remarks.

A similar approach - analysis of speech spectrograms - was used by scientists from Google AI when working on the Translatotron algorithm. This AI is able to translate phrases from one language to another, preserving the peculiarities of the speaker's speech.