Speech Recognition for Vikings
Robin Christopherson | 15 Nov 2013The three main areas where language and techno-linguists get excited are broadly; talking to your computer, your computer understanding what you said, and your computer speaking back to you in as human-sounding a way as possible. Robin Christopherson, AbilityNet's Head of Digital Inclusion reports on the latest developments in all three areas, all of which could have a profound impact on the lives of disabled people.
I’ve just been to Oslo to speak on language technology. The event was hosted at the fantastic National Library in Oslo where the walls are floor to ceiling cases of old leatherbound goldleaf-embossed books behind glass. The conference, however, was all high-tech –more specifically about language and how it is used in today’s technology. The three main areas where language and techno-linguists get excited are broadly;
- talking to your computer (for inputting text or to issue commands),
- your computer understanding you with as much intelligence as code can currently muster, and
- your computer speaking back to you in as human-sounding a way as possible.
Let’s deal briefly with each in turn as each is interesting in its own way and flying ahead as fast as a parrot persistnetly seeking Scandinavian solace.
Speech recognition
Speaking to your computer (speech or voice recognition) has been around for well over two decades and improving slowly with each new iteration. However in recent years processor power has enabled some very serious number-crunching to be done quickly enough to provide extremely accurate recognition without tedious delays. Today’s speech recognition software uses extensive statistical analysis of both what you are saying (including the words surrounding each word in context) and what you have said in the past (by analysing previous documents to provide the probability of your saying particular words again and in a given context).
This can lead to 98 or 99% accuracy. Unless you live in Norway.
There must be something in the water in Norway that affects the vocal chords, or perhaps it’s the Viking genes, that means that they are not able to benefit from such advancements in speech recognition.
Or it could just be that Dragon Naturally Speaking isn’t available in Norwegian!
It is of course for the latter reason and it’s easy for us English-speakers to assume that all the technologies we take for granted are available to everyone. Norway has a population of just over five million, the chap from Nuance attending the conference was chagrined to point out, and this just doesn’t make creating a Norwegian recognition engine an economically viable proposition. The Norwegian government are trying to muster two million pounds of funding to make it viable but in the meantime the fjords will be free from any Dragon-related activity for the time being (which is probably good news for the parrots).
Why can't apps fill the gap?
There are a number of apps (SayHi is an excellent example that I demonstrated as part of my presentation) that allow you to speak in one language and have it translated into another – including Norwegian. You can speak in Norwegian too and have it speak back the translation in English. The accuracy of these apps is only around 80-90% which isn’t sufficient for use in a professional environment where productivity is important and mistakes can be costly.
However, as these apps don’t actually process the phrases on the device itself but rather send a compressed recording of your words to be crunched on a central server somewhere, this means that we aren’t limited by the power of the device (only the speed of the connection) so the lack of accuracy is made up for in large part by the sheer numbers who will be able to use smartphones or tablets that aren’t themselves powerful enough to do the statistical gymnastics but who will nevertheless be able to access these inexpensive and still very useful apps given a half-decent mobile connection.
Moreover it is my hope that, as these massive servers receive inputs from thousands or tens of thousands of users, the statistical algorithms can get to work and begin to rapidly close this gap in recognition accuracy and what Nuance needs to do manually to collate data on how and what Norwegians say to their devices will happen automatically due to the sheer number of users alone.
So best of luck to Norway (and the hundreds of other countries still pining for pro recognition tools) but the good news is that speech recognition, as a method of interacting with your computer, is well and truly here to stay and advancing apace. And if you live in a Dragon-free country Hopefully it will come to a device near you soon.
Understanding natural language
This is the area where things are possibly moving most rapidly. The ability for devices (well, more accurately the software on those devices) to be able to interpret what you say and act upon it is helping us more and more each month.
For some this is a convenience (being able to issue a command to read out recent emails, dictate a text and who it should be sent to, or set a reminder about buying more post-it notes when you arrive at work – all while you’re driving on your daily commute is nice and handy), but for others who are permanently unable to use their hands (or eyes) such an ability could mean the difference between being digitally included or being left out in the cold. If you’re interested in reading more about virtual assistants and how their artificial intelligence is aiding everyone regardless of ability check out my Guardian article about how AI is helping the disabled or a related rambling on our own blog. Once again, however, we must spare a thought for our Norwegian cousins. Siri is still not available in Norway and, understandably, has left them feeling as sick as a parrot.
Speech output
The last area – speech output – is one of particular interest to me as a blind person. Synthetic speech needs to be clear so I can work fast and be confident in what I’m hearing, but up till now it hasn’t been anything to write home about – more like a parrot’s parody of speech than a real person. But even in this area things are flying fjorward fast. My computer now <sounds like this>.
Would you also like to have Ava available on your every device? Well your wish may well be granted as Nuance has really come up trumps this time with their ‘Vocaliser Expressive’ range of voices and they’re popping up all over the place. Soon Siri will be sounding like this although, alas, this is one instance where we are currently bobbing in Norway’s boat. The US version of Siri already has the new expressive voice but UK’s Siri is still without. It was rumoured for iOS7 but as yet is US-only.
On the whole, though, things are all moving in the right direction and if Norway can deal with their delay in Dragon, then we can be patient and not pine for a sexier Siri.
So to sum up – speech is here to stay. It will soon be as natural to hold a conversation with your computer as it is to with your wife or children. Actually a good deal more productive than trying to talk with your kids when they have something electronic in their hands – which is something of an irony. Of all those set to benefit from language technology it is the disabled (either temporarily by environment or permanently by impairment) who are set to gain the most and who are, speaking personally for a moment, the most excited by it all. I had a great time in Norway and would thoroughly recommend you pay a visit, but it’s not the fjords I’m pining for as much as the future that is fast approaching.