Mozilla reported on the new speech synthesis system LPCNet, which effectively translates text to speech while reducing resource demands. This is achieved through a combination of traditional digital signal processing methods (DSP, digital signal processing) with speech synthesis mechanisms based on a recurrent neural network.
The main problem of modern systems for real-time speech synthesis based on neural networks is high computational complexity. It does not allow to use them on smartphones and tablets.
LPCNet uses DSP for LPC filtering (Linear Predictive Coding) and voice path modeling. Then, instead of all the selected samples, the neural network receives only the forecast of each subsequent one. This frees the AI from modeling the vocal tract and leaves it with only an adjustment to the problems in forecasting. Neural networks need only to monitor the accuracy of the forecast, and not to generate each sample in real time.
The technology can be used in other areas where you need to improve the quality of the voice signal. For example, for transmitting speech over low-speed communication channels, eliminating noise, filtering data and restoring fragments of speech lost during transmission.
LPCNet is written in C using a high-level framework for building Keras neural networks. A GTX 1080 Ti video card is desirable for operation. Ready-made models are available for download, but the system can be trained on your own data. LPCNet is distributed under the BSD license.
Mozilla's speech synthesis system is being developed as an alternative to WaveNet by Google. The WaveNet code was open to developers in March 2018.