The Magic Behind the TypeWell Dictionary (Part 1)
TypeWell is celebrating its 15-year anniversary, which means we’re not only thinking about how technology will change in the next fifteen years, but also reflecting on how far the TypeWell system has already come. Where transcribers once lugged around 20-pound laptops and “linked” them with a serial cable, some are now using lightweight tablets while clients view real-time text on their iPhones.
But mobile technology aside, the true foundation of the TypeWell system is the transcribing software with its robust and intuitive abbreviation system.
I interviewed Steve Colwell — TypeWell’s co-founder and chief software architect — about how the abbreviation system was developed, and how the underlying dictionary has evolved over the years.
What makes the TypeWell abbreviation system so unique?
The trick of designing an effective abbreviation system is to come up with a system humans can remember. It's not a matter of packing each word down to the shortest possible abbreviation. Doing so leads to an unpredictable system that causes a lot of mistakes.
"There's no point in making rules so complex the people can't learn them."
Counter to what most people might think, minimizing keystrokes is not the only goal of the TypeWell abbreviation system. We also have to consider the issue of learnability, and how real users function in a classroom transcribing environment.
It sounds like TypeWell was designed to balance efficiency with learnability. Can you elaborate?
In the early days, the TypeWell abbreviation system was different. In some ways it was a more powerful abbreviation system than it is now — but in a bad way. It would expand tiny little 3-letter abbreviations into big long words that were the closest match it could find. It was ecstasy if you typed perfectly, but it was agony when you made any typos at all, because the most outrageous expansions would appear!
A big part of making TypeWell as easy to use as it now is was tuning and refining that line between the power of fast expansions, and the convenience of working well even with real-life typos.
Screenshot from Steve's computer during the dictionary-building process.
A big advancement came when we made a system that “holds together” even with hundreds of thousands of abbreviations. (That’s where most other abbreviation systems begin to break down.)
Version 1 of TypeWell would let you abbreviate anything, very concisely — but it was hard for a transcriber to learn how to steer it. It worked very well for the time, but that's one reason we didn't handle proper names back then — they just overloaded the rule system so that it was too tricky for transcribers to learn.
That was a big advance in Version 2 or 3, where Judy [TypeWell's co-founder] and I figured out a coherent set of rules that were teachable, usable at high speed by the transcriber, had few conflicts, and worked with huge dictionaries. We were happily surprised when it proved possible to do all those things at once! But it is a bit of a balancing act so there's always tuning, tuning, tuning to keep it performing smoothly in all those ways at once.
TypeWell V7 was released in 2013, but it uses essentially the same dictionary as V6. There was a major dictionary expansion when you released V6 back in 2011. How did you go about expanding it from 200,000 to over 500,000 words in V6?
To make the new dictionary we needed a source of fairly accurate text, which used all the most modern words in English. Plus we really wanted something that included every possible proper noun as well, since the old dictionary was weak in the area of proper names.
The answer was Wikipedia. We downloaded the entire contents of Wikipedia, and extracted all the words and how they were used. That worked even better than I'd hoped, because even the imperfections in the text in Wikipedia work well for us — it's important that TypeWell allow transcribers to type the vernacular used by a teacher, even if it's not exactly "correct" usage. And it turns out that's exactly the level of formality that much of Wikipedia has as well.
It was an interesting project because the size of Wikipedia overloaded all the computers I could acquire, and it took a long time for the computer to process it all. Of course, we had to process it many, many times as we improved and tuned our use of Wikipedia. Our computers breathed a sigh of relief when that was all done!
Was the dictionary further expanded in V7?
The dictionary was still so comprehensive from V6 that it wasn't necessary to do a major expansion. Of course, it's true we're always updating the dictionary as new words turn up, and it needed a little work to function well with the other features of V7. So there were some changes. But it's still about 500,000 words.
How did you decide which new abbreviations to include in Turbo 2?
Judy had always planned to make a successor for Turbo 1, so we already had a good idea what sorts of things Turbo 2 might do. Judy followed the same rules as with Turbo 1 — picking the types of abbreviations that would help accelerate typing, but were also pretty consistent about working as expected. There's no point in making rules so complex the people can't learn them. So we left out several possible abbreviations that had only a 30% success rate, and focused on the ones that work 90% of the time.
In TypeWell, a “dictionary conflict” occurs when multiple words share the same string of consonants. For example, maintain, mention, and mountain could all expand from the abbreviation mntn. When you’re programming the dictionary, how do you determine which words will expand first, second, third, etc.?
Conflicts are the bane of my existence, and yet they’re also what makes TypeWell a unique system, because we handle them so well. Most abbreviation systems "fall apart" after they grow to a few hundred abbreviations, or even fewer — they just start being such a confusion of special rules that it's impossible to guess whether one's abbreviation is going to give the right word. That's a conflict.
The amazing thing about TypeWell, ever since the beginning, is the lack of conflicts. We still have them but even with 500,000 words, any of which can be abbreviated, it all still makes sense, and the conflicts are relatively low.
To choose the order of expansions, we use the frequency of occurrence of the words. The most common word with that abbreviation comes up first, the second most common next, and so on. With the 500,000-word dictionary, we use that same Wikipedia data to determine the frequency of occurrence of the words. The system literally counts every occurrence of each word in all of Wikipedia, so it knows which words occur the most often.
Of course, now you can override that order in which word expand if you prefer a different order, by using the MultiPAL feature in V7.
Next week, the interview continues as Steve answers some of the trickier questions like, “What is a dictionary ‘bug’ and how do you fix it?” and, “When someone suggests a new word to add to the dictionary, how do you decide whether to accept it?”