For a while now I've been pondering the best approach to adding more parsing support to Cincom Smalltalk. There are all kinds of parsing technology out there, but I always find that unless it's a recursive descent parser, I never feel entirely comfortable with giant state and transition tables produced by technology like Smacc.
Add to that the success that O'Meta has had in producing new parsers for multiple languages across multiple languages (Javascript, Smalltalk and Cola)... the outcomes speak for themselves really, very impressive stuff. Finally, one other thing that really made me start to ponder was the sheer number of hand written parsers that exist - they're pretty much all recursive descent parsers.
Newspeak showed us the virtues of being able to define a grammar, then add the interpretation of its output by subclassing. O'Meta showed us the virtues of using a BNF like syntax to parse using recursive descent parsers. Packrat showed us that recursive descent parsing could be made to run in linear time by memoizing. They even found a way to make Packrat parsing able to be left recursive.
Keeping with the idiom of "simplest thing that could possibly work", I built a new parser on top of Xtreams, an experimental new streaming library for Smalltalk that achieves a simplification of the existing Smalltalk-80 streams by using composition and transformations.
I did not implement a packrat parser, thus no memoizing and no left recursive, but then again, you don't actually -need- those. And when you do need them, you can easily implement them in a subclass of RecursiveDescentParser, so the door is open on that particular optimization, not closed.
So how do you go about implementing a new parser technology in Smalltalk (or any language for that matter?). Well, step #1 is to figure out what kind of objects you'll need - in the case of PEGs, you need ordered choices, ordered sets, nots, ands, terminals and a parser.. not much to it really.
Here's the PEG grammar, written in PEG:
Now we instantiate our objects based on the PEG specification, to create the objects how we expect them to be built once parsed from the PEG grammar itself:
This gives us the objects:
Now that we have the objects, we can parse any PEG grammar we want and check whether it fits the definition we have of it or not. In this case, parsing the PEG grammar itself yields us a true result, which means we've successfully parsed it.
That's all nice and exciting, but we really want to create parser objects from the PEG grammer, so that we can parse things with the output of what we just parsed. So now we decorate our objects with actionable code:
Here we have followed the Newspeak doctorine - that the grammar is separate from how we wish to re-interprete it in our environment. In this case, we're interpreting something that conforms to the PEG grammar as a parser.
Now that we have the ability to parse a PEG grammar and produce a new parser from it, we can leave behind our bootstrap - our original Smalltalk code, and build a PEG parser from the PEG grammer itself:
We add to the parsed objects, the same code that we added to the Smalltalk created objects:
At this point, we now have a PEG parser built from the PEG grammar. To test it out, we need a different grammar - so I knocked together a simple HTTP URL parser which will produce Smalltalk HttpURL and HttpUser objects. This could easily replace the hand-written URL parser that's implemented on URLwithPath class.
We can see from the output of running the new grammar against a http url, in this case the link to my blog, that we get a string of nodes in arrays. Now we can add some interpretation to this tree of nodes to produce the real Smalltalk object:
With the parser nodes decorated with interpretation code, we can see that we are able to produce a native HttpUrl object that will work with the rest of the system. We did it with a straight forward and easy to understand grammar.
If we wanted to head in to the Newspeak world, we might consider making a specialized class for writing grammars that have Smalltalk code, perhaps something like this:
Http <definition: HTTP User?:user String:host Port?:port Component*:path> ^HttpUser new user: user; host: host port: port; path: path; yourself
But for now, the fundamentals are there - objects, easy to manipulate and use. I'm interested in making a set of parsers for the more common languages out there: Smalltalk, C, CSS, Javascript, Forth, X86 Assembler, Self (or something like it), Applescript, SQL.
If you're interested in trying out PEG parsing for yourself, load up XtreamsDevelopment and look in the package Xtreams-Parsing. I don't know if I'll keep it there, since it's not really part of streaming it just happens to use it.