Humans are great at picking out and focusing on a single voice in a noisy environment. Computers are getting much better at speech recognition as the growing number of digital assistants proves, but they still struggle when there’s multiple voices or lots of background noise. Google seems to have solved the problem, though, by using both audio and video to train a system to isolate speech.
The phenomenon Google was attempting to copy is known as the cocktail party effect. It’s the brain’s ability to selectively focus on audio while filtering out all other stimuli. A good example of this being listening to someone talk in a very noisy room.
Google Research tackled the problem by combining video and audio in order to identify who is speaking based on mouth movements and linking that up to the audio being heard. Training a “multi-stream convolutional neural network” to carry out this task required collecting 100,000 high quality video lectures and talk from YouTube, then extracting clean speech segments from them.
This resulted in 2,000 hours of clean data with which to create “synthetic cocktail parties.” Google achieved that by mixing the video together so two people were talking simultaneously. Non-speech background noise was also added just to make things more realistic (and difficult).
As the video above demonstrates, once trained the system is capable of focusing on a single voice and filtering out everything else. The same is possible when only one person is speaking but the background noise is bad enough that you struggle to hear what is being said.
Here’s a good example of how Google’s system can improve audio using a noisy cafeteria setting:
As you can imagine, there are many situations where this technology could have a positive impact. For pre-recorded video, it should make automatic captioning much more accurate because each voice can be focused on as part of the process. It may take multiple passes, but it’s worth it if the recognition accuracy increases significantly.
For the hard of hearing, the system could be used as part of a hearing aid and smart glasses combo. The wearer looks at the person they want to listen to in a noisy environment and the hearing aid they are wearing can filter out all but the voice because the camera on the glasses is tracking the mouth movements. The same is possible when watching TV, which could benefit from a new “speech focus” setting for audio output. YouTube would probably get such a feature first, though.
Google is already exploring how it can incorporate the technology into its products, and it’s obvious Google’s digital assistant will be a focus. Being able to converse with Google Home devices in a noisy family environment, or instructing Google to do something using your smartphone in any number of noisy public situations are clear near-future beneficiaries of this tech.