Introduced Mozilla’s DeepSpeech 0.6 speech recognition engine, which implements the eponymous speech recognition architecture proposed by Baidu researchers. The implementation is written in Python using the TensorFlow machine learning platform and is distributed under the free MPL 2.0 license. Supported work in Linux, Android, macOS and Windows. There is enough performance to use the engine on LePotato, Raspberry Pi 3 and Raspberry Pi 4 boards.
The kit also offers trained models, sample audio files, and command line recognition tools. To embed the speech recognition function in their programs, ready-to-use modules for Python, NodeJS, C ++ and .NET are offered (third-party developers separately prepared modules for Rust and Go). The finished model is delivered only for the English language, but for other languages according to the attached instructions, you can train the system yourself using voice data collected by the Common Voice project.
DeepSpeech is much simpler than traditional systems and at the same time provides higher recognition quality in the presence of extraneous noise. The development does not use traditional acoustic models and the concept of phonemes; instead, they use a well-optimized machine learning system based on a neural network, which eliminates the need to develop separate components for modeling various deviations, such as noise, echo, and speech features.
The flip side of this approach is that to obtain high-quality recognition and training of a neural network, the DeepSpeech engine requires a large amount of heterogeneous data dictated in real conditions by different voices and in the presence of natural noise. The Common Voice project created by Mozilla is engaged in the collection of such data. It provides a proven data set with 780 hours in English, 325 in German, 173 in French and 27 hours in Russian.
The ultimate goal of the Common Voice project is the accumulation of 10 thousand hours with recordings of various pronunciation of typical phrases of human speech, which will achieve an acceptable level of recognition errors. In the current form, the project participants have already dictated a total of 4.3 thousand hours, of which 3.5 thousand passed the test. When teaching the final English model for DeepSpeech, 3816 hours of speech were used, except for Common Voice covering data from LibriSpeech, Fisher and Switchboard projects, as well as including about 1700 hours of transcribed radio show recordings.
When using the ready-made English model for downloading, the recognition error level in DeepSpeech is 7.5% when evaluated with the LibriSpeech test suite. For comparison, the level of errors in human recognition is estimated at 5.83%.
DeepSpeech consists of two subsystems - an acoustic model and a decoder. The acoustic model uses deep machine learning methods to calculate the probability of the presence of certain characters in the input sound. The decoder uses a ray search algorithm to convert character probability data into a text representation.