A while back I posted on the question of which should be considered more authoritative, source code or byte code. The conclusion I came to was that neither is ideal as a "canonical" representation of a program; an abstract syntax tree would better fill that role.
Well, that notion stuck with me and I've started working on a simple tree-shaped representation for Smalltalk code. The idea driving this project is fairly simple: there's no "one true representation" of a program. It's really quite an abstract thing and it needs to be represented differently in different contexts. However, the most abstract representation is the AST, which can be easily converted to other forms as needed.
The AST form is most natural for manipulation by tools such as browsers, debuggers, type inferencers, version control, translators etc. ASTs can be executed directly - this is how the Ruby interpreter works, for example, but Smalltalk traditionally compiles to bytecode, which is can be more efficiently interpreted by the VM.
For presentation to the programmer, you want yet another form - class browsers and source code. And there may be other representations that are useful for presenting to the programmer: class diagrams, pattern summaries etc. (This is one of the core concepts of Intentional Programming as Darius Clarke commented on my last post on this topic.)
So the idea is to shift between these representations as fluidly as possible, and preserve as much of the available information as possible. So the AST form preserves much of the formatting information that the programmer originally entered with the source code, and can reconstruct that source faithfully.
However, that goal shouldn't get in the way of optimizing a particular representation for its context, which is the whole point of multiple representations in the first place. I'm really interested in and excited by projects for optimizing Smalltalk execution, such as Eliot Miranda's AOStA or Bryce Kampjes Exupery. In optimizing bytecodes or native code for fast execution, we may loose the information-equivalence between compiled methods and their ASTs, and that's OK.
In fact it's a good thing, because decoupling the representations used by the tools and the VM can make each more flexible and more powerful. Take the "senders" button in the browser, for example. If we optimize away certain messages sends by inlining the methods they call, we interfere with the browser's ability to trace the senders. If the browser is operating the AST, however, we don't have that problem. We are free to optimize the compiled methods for fast execution, the AST for ease of analysis, and the programmer's representation for clarity.
The first application of this new representation will be in OmniBrowser, which I'm in the process of adapting to operate on syntax trees rather than directly on the runtime. (Actually, OmniBrowser already has a layer of indirection between it and the runtime - this is what makes things like the Package Browser possible - so this is will actually be a simplification of that layer.)
Further down the road, I'd also like to use the same package representations in Monticello, since they provide a much richer model of the package, and could allow versioning and merging at a finer grain than the current model allows.
Posted in compilers