Premature deoptimization with the Smalltalk enumeration protocols -- My misinterpretations of life -- Michael Lucas-Smith

2007-10-17

As I sit here in front of my shiny MacBook Pro, I'm trying to figure out how to word my argument right without sounding like I want to prematurely optimize Smalltalk. So here goes:

There is this neat package in public store called ComputingStreams (and ComputingStreamsTests). This work came out roughly around the same time we were doing StreamWrappers at Wizard. So this was kind of exciting, because the two works sort of mirrored each other. StreamWrappers was about implementing new streams that would wrap around other streams and ComputingStreams was about a generic set of Streams that could be customized with blocks to wrap around other Streams.

The code works roughly along the lines of the regular Smalltalk enumeration protocols, do:, select:, detect:, reject:, collect:, etc. But instead of working on collections, they work on streams, such that you can write code like this:

((((people selecting: [:each | each isMale]) rejecting: [:each | each age > 80]) collecting: [:each | each mother]) rejecting: [:each isNil]) do: [:each | each receiveGift]

Compare with the collection code example:

The two bits of code are almost identical except that one uses -ing: on the end of each message to indicate that it continues to "wrap up".

There is another part to this story that isn't entirely evident though - and that's the part of the story I want to get two. Actually, there's two parts to the story. The first is to do with duplicate protocol explosion. Here, let me give you a small taste of how popular the "do" paradigm is in Smalltalk:

keysDo:, treeDo:, datesDo:, itemsDo:, linesDo:, nodesDo:, pairsDo:, starsDo:, weeksDo:, yearsDo:, fieldsDo:, modelsDo:, monthsDo:, pixelsDo:, probesDo:, thingsDo:, valueDo:, bundlesDo:, classesDo:, membersDo:, methodsDo:, resultsDo:, allItemsDo:, bindingsDo:, childrenDo:, elementsDo:

That's a small taste. The list goes on and on. The big problem with this approach is that now there's no way to pick up select:, reject:, collect: for "free" with all these different kinds of domains. You'd actually have to implement those other enumeration methods, monthsSelect:, monthsReject:, monthsCollect:... so on and so forth. That doesn't seem like good programming to me.

The second side of the story is to do with performance. If we look at our people example up above, it'll work fine if we have, say, one hundred people in the collection and only one process making the image do it at once. Any more than that and we'll start to get some noticeable garbage collection. Now imagine that we have 100,000 people in the collection and 1,000 concurrent users doing the search. Sure, the Smalltalk code will be fine, but the garbage collector will suddenly be taxed when it creates 4,000 100,000 element collections and then has to clean it up.

If instead we use the ComputingStreams approach, this problem of "big collection" overload disappears - and more importantly, the developer didn't have to think about scalability and even more importantly - they didn't accidentally write code that can't scale. Avoiding mistakes like that is what made Smalltalk famous - you didn't have to worry about how big your numbers were, because they would magically more in to LargeIntegers as soon as they overflowed.

So imagine if you will that everybody stopped creating *Do: methods and returned streams or collections from their models - preferably streams. That means you can programmatically construct a stream that will lazily provide elements for the stream as the program iterates over it, giving you plenty of time-sharing between processes and avoiding big collection instantiations. It also means that we get the rest of the enumeration APIs for free to use too and we can blend or "mashup" streams easily too.

Today I was reviewing some code that added new protocol to iterate up the parents of the view hierarchy in our UI framework. Neat stuff - but it used the Do: pattern. Specifically, the two methods were superpartsDo: and findSuperpart: - which I find a shame, that we do not adopt the ComputingStreams approach more readily, because we could then have superpartsReadStream instead which you can then detect on, collect on, select on, do on, so on and so forth just like you can with any other stream or collection - except that the superparts didn't get instantiated in to a collection fully unless you specifically needed too - and as an added bonus, the entire collection of superparts was never instantiated in to a full collection any way, because its a stream and we can use circular buffers underneath the hood.

In many respects, the do:, select:, reject:, detect:, collect: protocols on Collection were a mistake - they should have been streaming protocols from the beginning. If they had been, then we wouldn't be accidently writing memory-hungry code right now and our code would plug together even better. For example, if we suddenly change our people collection to be a people stream from Glorp - well the rest of our code continues to work just fine without scaling problems too. Better yet, Glorp might be smart enough to respond to the *-ing: methods and construct a lazy SQL select for when you do a #do: call on it.

There is much that could be done here. I suppose the first step would be to prove its usefulness and push it in to the Cincom Smalltalk base so that its as easy to use as the collection enumeration APIs - then may be people will start using it regularly.

Bruce Badger is attempting to kick start the ANSI process - perhaps the streaming -ing: protocols should be raised there as a potential new addition to Smalltalk the language.