Scientists from the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Lab published a report on a new model of machine learning, which is able to compare objects on the image with their voice description. As a basis, the researchers took the work of 2016 and improved it by teaching it to combine certain spectrograms of the voice with certain fragments of pixels. Engineers hope that in the future their model will be useful in simultaneous translation.
The MIT algorithm is based on two convolutional neural networks. The first divides the image into a grid of cells, and the second composes a voice spectrogram - a visual representation of the frequency spectrum - and also breaks it into segments in a single word length. Then the system compares each cell of pixels with a segment of the spectrogram and considers the degree of similarity. Based on this parameter, the neural network determines which pair "object-word" is correct and which is not.
We wanted to do speech recognition in a way that’s more natural, leveraging additional signals and information that humans have the benefit of using, but that machine learning algorithms don’t typically have access to.
After studying the database of 400,000 images, the system was able to match several hundred words with objects. After each iteration, it narrowed the matching parameter to determine specific words associated with specific objects.
MIT believes that this approach will simplify the automatic translation between several languages, since it does not require a text description of objects.
Image recognition systems and voice are already coping with their task, but they require a lot of resources for this. In April 2018, Google announced a development competition in the field of deep networks and computer vision on smartphones. It is designed to find ways to optimize the operation of real-time recognition systems.