Voice Assistant to Recognize Voiceless Commands

Technology, based on neural network, can be used in public places without the risk of disturbing others
22 October 2018   938

Developers from Tsinghua University have developed a voice assistant for smartphones that recognizes commands from the user's lip movements. This technology can be applied in public places without the risk of disturbing others.

Yuanchun Shi and colleagues presented an article at the UIST 2018 conference in which they described lip recognition technology and its translation into text. Such a voice assistant uses the front camera and the convolutional neural network. The algorithm tracks 20 control points that accurately describe the shape of the lips, and also determines how open the user's mouth is. This allows you to recognize the beginning and end of the command. The second algorithm decrypts the data. In this case, while all the calculations occur separately on a powerful PC.

For recognition, a limited set of commands is used — a total of 44, which apply to both individual applications and specific functions, such as turning Wi-Fi on and off. System-wide tasks are also supported, such as responding to a message or highlighting text.

The developers claim that the average recognition accuracy is 95.5%. It is based on the results of training on the speech of 21 people. Tests were conducted in the Beijing subway. As a result, it turned out that this method is considered more comfortable by users.

So far, the developers do not specify when the new application will appear in the release. However, if a powerful computer is still needed for recognition, it will not happen soon. Or the system will require a permanent connection to the network.

MelNet Algorithm to Simulate Person's Voice

It analyzes the spectrograms of the audio tracks of the usual TED Talks, notes the speech characteristics of the speaker and reproduces short replicas
11 June 2019   318

Facebook AI Research team has developed a MelNet algorithm that synthesizes speech with characteristics specific to a particular person. For example, it learned to imitate the voice of Bill Gates.

MelNet analyzes the spectrograms of the audio tracks of the usual TED Talks, notes the speech characteristics of the speaker and reproduces short replicas.

Just the length of the replicas limits capabilities of the algorithm. It reproduces short phrases very close to the original. However, the person's intonation changes when he speaks on different topics, with different moods, different pitches. The algorithm is not yet able to imitate this, therefore long sentences sound artificially.

MIT Technology Review notes that even such an algorithm can greatly affect services like voice bots. There just all communication is reduced to an exchange of short remarks.

A similar approach - analysis of speech spectrograms - was used by scientists from Google AI when working on the Translatotron algorithm. This AI is able to translate phrases from one language to another, preserving the peculiarities of the speaker's speech.