Jack Schofield 

Here comes the stutter cutter

The next version of Lernout & Hauspie's Voice Xpress program, which is expected shortly, is a huge improvement on anything released before, writes Jack Schofield
  
  


The next version of Lernout & Hauspie's Voice Xpress program, which is expected shortly, is a huge improvement on anything released before, the company claims. Rather than simply trying to recognise speech, it also tries to recognise things that aren't speech, and discard them.

Voice Xpress 5.0's new "disfluency filter" is designed to pick up the ums and ahs and ha-ha-has and remove them. These unwanted elements are not only tedious to edit out of recognised text, but they also make it harder for the software to follow sentence structure and use this knowledge to improve accuracy.

Dr Alex Waibel, a professor at both Carnegie-Mellon University in the US and the University of Karlsruhe in Germany, has been working on the problem for some time. His company, Interactive Systems Inc, is also one of many that the fast-growing Belgian language giant has taken over.

Waibel and his students have been researching the problems involved in recognising conversations, rather than dictation. "Conversational speech includes a lot of stutter, people leave out a lot of things, they mumble a lot of things," he says. "Some of the voice material is simply not there any more."

The solution is to exploit much more of the information that's available. For example, it's possible to extract an image of the lips from the face and get the computer to do "visual lip reading". On its own, the approach doesn't work very well, says Waibel, but it can be used to increase the accuracy of speech recognition systems. For example, lip reading systems can easily distinguish between b and v, which to speech recognition systems sound the same. As with humans, visual lip reading also becomes more valuable in noisy environments. Some of Waibel's students have also developed a "meeting browser" which can recognise emotions -whether people are neutral, angry or sad, for example. The browser makes it possible to move backwards and forwards through a recording of a meeting and identify emotions. It's accuracy is limited, but people aren't always much better at the task.

Waibel says it's important to look at gesture, speech, and eye movements to extract meaning: "Who are they talking to? What are they looking at? The purpose of speech is only known when you know something about the scene. We've just scratched the surface."

 

Leave a Comment

Required fields are marked *

*

*