Lip-reading with Google’s DeepMind AI: what it means for disabled people, live subtitling and espionage!

Lip-reading is difficult. Many deaf people can do it, but there are situations when it is a struggle... but now, Artificial Intelligence like Google's DeepMind is getting its virtual teeth into the challenge - and doing an even better job than humans. So what does this mean for disabled people, TV subtitling and the shady world of cloak and dagger espionage...?

The biggest TV binge-fest in history

Researchers at Oxford University used Google's DeepMind to watch more than 5,000 hours of TV including shows such as Newsnight, BBC Breakfast and Question Time for the 'Lip Reading Sentences in the Wild' study. The AI analysed a total of 118,000 sentences – a much larger sample than in previous pieces of research such as the LipNet study for example - which only contained 51 unique words.

Man eating popcorn in front of TV looking surprised

The sample used in this DeepMind study comprised no fewer than 17,500 unique words, which made it a significantly harder challenge, but ultimately resulted in a much more accurate algorithm.

Tweaking the timing...

What added to the task was the fact that often the video and audio in the recordings were out of sync by up to a second.

To initially prepare all samples to be ready for the machine learning process, DeepMind first had to assume that the majority of clips were in sync, watch them all and try to learn from them a basic relationship between mouth shapes and sounds, and using that knowledge, rewatch all clips and correct the audio of anywhere the lips were out of sync with the speech.

It was only then that it was able to go through all 5,000 hours once more to do the deep analysis of learning exactly which words related to which mouth shapes and movements.

A deeply impressive result – not just lip-service

The result of this research and development was a system that can interpret human speech across a wide range of speakers found in a variety of lighting and filming environments.

The system successfully deciphered phrases such as “We know there will be hundreds of journalists here as well” and “According to the latest figures from the Office of National Statistics”.

Here is an example of a clip without subtitles: 

A close up image of a woman speaking clearly

And now the same clip with subtitles created by the DeepMind algorithm:

close up of woman speaking with GoogleDeep Mind live subtitling underneath

The result of this research and development was a system that can interpret human speech across a wide range of people speaking in a variety of lighting and filming environments.

DeepMind significantly outperformed a professional lip-reader and all other automatic systems. Given 200 randomly selected clips from the data set to decipher, the professional translated just 12.4% of words without errors. The AI correctly translated 46.8% - and many of its mistakes were very small, such as missing an 's' off the end of a word.

So what do technological lip-reading advancements mean for disabled people?

Going mobile with DeepMind’s lip-reading smarts

For people with hearing loss the benefits of such tech are obvious. There's long been voice recognition and this can aid with the real-time translation of speech-into-text – as we see here in this video of someone using Google Glass to subtitle a conversation with a colleague.

This approach, however, relies on someone being able to speak clearly into a microphone (in this case a linked smartphone) but what about a noisy office or hallway? In such a situation, the ability to use the head-mounted camera (which is unaffected by noise or the proximity of the speaker) combined with lip-reading software would give a similar result without the restrictions.

Could DeepMind's lip reading skills also help blind people and those with sight loss?

As a blind person I’d also find such a set-up extremely useful as, for me, hearing people in a noisy environment is twice as hard as it is for someone who can see the speaker’s lips. Sighted, hearing people lip-read, however subconsciously that is. If you fit this description and don’t believe me, next time you are in that situation, try closing your eyes and see if you can still hear the person next to you.

The boon of Google Glass synethic speech at a party

It’s assumed that a blind person’s hearing must be twice as sharp and, while we do certainly pay more attention to the sounds around us, this inability to hear people in a noisy place when everyone else can, is an ironic twist to being blind. So Google Glass, feeding me a clearly-spoken synthetic speech interpretation into the bone-conducting speaker behind my ear, would be a boon at noisy parties where I know I’ve got several hours of hard listening ahead.

brightly coloured lips cartoon style

A new world of real-time subtitling on offer

While many programmes are currently subtitled, the broad range of live television doesn’t allow for pre-written subtitling. Instead, professional transcribers have to listen and watch and rapidly transcribe on the fly (using a combination of voice recognition and stenographer-style keyboards), which is costly and often means programmes are neglected. 

This new advance will make real-time subtitling much more efficient – helping deaf people but also aiding everyone watching TV in a noisy office, café or bar. And if you’re a spy,  you too of course can benefit from this remote and clandestine ability to understand what people are discussing! 

What does new lip reading tech mean for Youtube videos?

This automated approach can also help to subtitle the thousands of hours of video uploaded to Youtube every day – and also help it sync the audio to the speech too. Let’s look forward to that time when every video’s spoken content is readable, and thus searchable, by everyone who would find either option helpful.

Related articles: