Tuesday, September 25, 2012

Technology Understanding Languages: Don't Be Siri!

So you've got a new smartphone and you'd rather tell it what to do whilst it's in your hand than touch the screen. You probably decide to use its speech recognition software. Then, you tell it to make an imaginary appointment in your calendar... and it does!

"I'm sorry, I can't do that Dave."

How does it understand language? Well... it doesn't. It simulates it pretty well, that's all. It deciphers which phonemes have been said and puts them together in the most probable order.

If you speak a language, understanding words is quite simple. Your brain should be many times more powerful than the average smartphone. IBM simulated an apparent 4.5% of the human brain with a supercomputer, requiring 147,456 processors. That's the equivalent of your brain after a night of vodka and that's still pretty impressive.

It's very difficult to separate individual sounds with just one input. Because of that, some horrific mind-numbing mathematics is involved. To put it simply, the software hears audio and then guesses at the most probable phoneme you may have said. It does this by ruling out impossible combinations or very rare occurrences.

First, the hardware on your smartphone converts the analogue information into digital information. Computers like 1s and 0s.

The software cleans up the digital data, then removes background noise and frequencies beyond our range of hearing. The information is divided into very small sections (hundredths of a second) and sampled by the software in order to process the phonemes.

Can you decipher this? Didn't think so...

The phonemes are processed by means of probability. The most likely phonemes are considered first, but if they're followed by unlikely phonemes or expressions, they are disregarded and replaced with the more likely alternative.

An example of a stumbling block for speech recognition would be the following:

"Real eyes realise real lies".

Its output could easily be realise repeated three times. So a speech recognition program would probably get this wrong. There are so many examples that could be wrong, so how does it occasionally get it right?

"Where are you?" could be "wear are you?" - we know it couldn't be, but a computer doesn't. The only way to stop this being mistaken is to have included likely and unlikely word combinations. The best method is to pick the most likely option, but that can be difficult if you don't know what any of the words mean.

The phone has as much chance of understanding you as any member of the opposite sex, but that doesn't mean you can do those sorts of things with it, even though it does vibrate.