Spellchecking for VisualWorks
Return to home page
Comments Loading...
2008-10-29

Many moons ago I made a Spellchecker package for VisualWorks with my friend David Price. I used it in TypeLess my experimental irc client and in Bottom Line, the blog posting tool for BottomFeeder and recently it's found its way in to the RBCodeHighlighter so that you can spell check your code, comments and strings. It's also made it in to Martin Kobetic's pet project "Stampy" as well as our internal bug tracking software called Mars.

Spelling checking is really really useful... so it's no surprise to me that people wanted to use it in many things. The implementation we did originally was very efficient, speed-wise. It could spell check in less than a microsecond and likewise, find spell suggestions in less than a microsecond. The algorithms we applied were (and are) state of the art... however, we did nothing about the memory consumption.

The free English wordlist that comes with debian is ~900k big. The Spellchecker implementation we had made used 6.8mb. Woops.. we never actually looked at it, because it never mattered to us. But now that we're pondering putting it in to VisualWorks proper, the size does tend to stick out like a sore thumb.

Martin goaded and dared me to shrink its size. In half an hour I'd dropped it down to 1.95mb, which is a pretty good improvement. While munching on lunch, I realized I could shrink it even further and I got it own to first 1.33mb and then 1.17mb. I was pretty pleased at this point and going further would mean dumping objects and programming it like a C programmer.

In the last few years, especially with the OpenGL programming I've been doing, I've become more and more comfortable with using the powerful object system we have to do low level programming more easily than if I were in C itself. Having stopped at 1.17mb, I pontificated with Martin that in theory, it would be possible to shrink the dictionary down to 376kb at the most extreme and 640kb at its simplest.

I was reluctant to throw away the objects and try a new approach.. because it worked. Make it work, make it right, make it fast - it'd already been done, all three categories had been satisfied. More than that - a fourth category had been satisfied - make it efficient. So going any further would mean there was a fifth category could be satisfied - make it anal.

Martin suggested that I was simply trying to convince myself that it couldn't be done.. challenge accepted. I made a new package called Spellchecker2, which replaced Spellchecker and SpellcheckerEnglishDictionary. It stores its network of nodes in a DwordArray and the final outcome was this: Same speed for spell checking, word suggesting, but the memory foot print is a mere 484.5kb.

As well as that, I used the #once technique so that the DwordArray that is created for the vocabulary is shared between the CompiledMethod and the Vocabulary object you create. That means that the true foot print of the English dictionary running in the image is actually 484.5kb and nothing more.

The API changed between Spellchecker and Spellchecker2. For one thing - the dictionary is now stored in utf8, so if you want to do any spell checking, you have to convert your strings in to utf8 first. We could add some convenience methods for this.

You also have to transform the vocabulary object in to a mutable vocabulary before you can add words. This turns the raw byte data in to objects that you can manipulate and play with. When you're done, they collapse back in to the byte data and are saved on the vocabulary object. That means if you start adding your own words, you don't pollute the original.

vocabulary := Spellchecker2.English vocabulary.
vocabulary hasWord: ('testing' asByteArrayEncoding: #utf8).
vocabulary suggestWords: ('tseting' asByteArrayEncoding: #utf8).
vocabulary wordsStrings.
vocabulary mutableWhile: [:node | node addWord: ('smalltalk' asByteArrayEncoding: #utf8)]

So if you're interested, take a look and have a play. The software is free to use for any purposes. I'm hoping this version will be the version that makes it in to VisualWorks itself.

(Note: You need VisualWorks 7.7)