Monday, August 2, 2010

Steno Versus Qwerty Versus Automatic SR

First of all, have some lovely Egyptian Plovers from the Bronx Zoo, taken by Abigail on a recent photo jaunt.

Thanks, Abigail!

Second, a low-cost wearable computer rig that'll make you look significantly less dorky than your average gargoyle.

I'm a little concerned about overheating brought on by lack of ventilation, but even so it's a good start. This guy uses a foldable qwerty keyboard. You'll remember from my article on mobile and wearable computing that I think steno is just dying to be turned into an efficient wearable text input system. I think that two flat multitouch panels with keypads for haptic feedback tied snugly around each thigh and connected to the rest of the rig via Bluetooth would be ideal. Still not sure how feasible it would be to manufacture, but as Plover inches ever closer to full keyboard emulation (we're a session or two away from getting it working in Linux, and then hopefully we'll be able to port that bit into the Windows version without much trouble), I'd love to try it out in a mobile context, even if that means sawing open a SideWinder, cutting off the steno-irrelevant sections, snapping the circuitboard in half, and rewiring everything so that it can be attached to a wearable substrate like jeans, a hoodie, or a small bag.

Third, I'm going to check out the exhibit hall of SpeechTek2010 tomorrow. I'll let you know what I think, but I doubt I'll be out of a job just yet. And, anyway, even positing true AI and human-equivalent automated speech recognition, there are many contexts in which text input via fingers is more private and more convenient than dictating, which means that steno will always have a place in the game -- assuming I can get enough people aware of its many benefits.

Speaking of automated speech recognition, I made a short video the other day, and YouTube's autocaption transcript just kicked in today, so I decided to post it.

I stitched together several snippets of timed audio, dictated at speeds between 40 and 185 words per minute, and made screencasts of myself transcribing them first using the qwerty keyboard (top portion) and then the steno keyboard (bottom portion). First notice how much more work I was doing on the qwerty keyboard than on the steno keyboard. In What Is Steno Good For: Raw Speed, I discuss all the wasted energy and effort required to type out each individual letter, backspacing and retyping if I inadvertently type them in the wrong order, and the inherent limitations on speed imposed by this method. My fingers are moving much more quickly and working much harder in the qwerty window than in the steno window. As you can see by the end of the video, I was losing huge chunks of the audio, my accuracy was dismal, and I was positively sweating bullets between 160 and 185 WPM in the qwerty window. When I got to use my steno machine, 185 felt like a stroll in the daisies. I usually don't start feeling like my fingers are really going full blast until I get up to 240 or more on my steno machine, which this video doesn't show; I chose to stop at the point where my qwerty typing completely broke down. The difference in terms of ease, comfort, and ergonomics is hard to miss.

If you want to see how YouTube's automated speech recognition did at the same task, turn the captions on, then select "transcribe audio". You'll notice that even though the audio is crystal clear and each word is perfectly enunciated, it makes so many mistakes that it's almost impossible to understand what's being said by reading the autocaptions alone. Later on the quality improves a bit, though there are still significant errors, because the 185 WPM section of the transcript is a judge's jury charge, which contains lots of boilerplate big words such as "testimony" and "circumstances". Speech recognition has a much easier time with multisyllabic words, especially in dictation where there's a clear space inserted between each word (as in these samples, since they're intended for dictation, though they sounds quite unnatural to most ears) because there are comparatively fewer possibilities to rule out than in words of one or two syllables, which automated SR always has a great deal of trouble with. The problem is that small words are often the most important words in the sentence (especially if they're negatory words such as "no" or "not"), and you can see if you look at the auto captions that even making an error in 1 word out of 20 (a 95% accuracy rate) can cause great confusion.

The most successful and least error-ridden passage in the transcription was the phrase "the convergence of resources and bolstering of partnerships to support a coherent program for science and mathematics for all students". Because there were so many multisyllabic words in that passage, the autocaptioning system got them all right except for the phrase "for all", which it translated as "football". This is an error that a human transcriber would never make, because "football" and "for all" are stressed entirely differently in English, and syntactically it makes no sense to insert the word "football" between "mathematics" and "students". But you can see how that one tiny error brings the rest of the sentence into doubt, and can steer a Deaf or hard of hearing person off on entirely the wrong train of thought. It's this complete lack of ability to understand speech in context and to correct errors semantically that makes automatic speech recognition ultimately unreliable without extensive babysitting by human editors.

No comments: