A new parser for Ioke


Last week I finally bit the bullet and rewrote the Ioke parser. I’m pretty happy about the end result actually, but it does involve moving away from Antlr’s as a parser generator. In fact, the new parser is handwritten – and as such goes against my general opinion to generate everything possible. I would like to quickly take a look at the reasons for doing this and also what the new parser will give Ioke.

For reference, the way the parser used to work was that the Antlr generated lexer and parser gave the Ioke runtime an Antlr Tree structure. This tree structure was then walked and transformed into chained Message’s, which is the AST that Ioke uses internally. Several other things were also done at this stage, including separating message chains on comma-borders. Most significantly the processing to put together interpolated strings and regular expressions happened at this stage. Sadly, the code to handle all that was complex, ugly, slow and frail. After this stage, operator shuffling happened. That part is still the same.

There were several problems I wanted to solve, but the main one was the ugliness of the algorithm. It wasn’t clear from the parser how an interpolated expression mapped into the AST, and the generated code added several complications that frankly weren’t necessary.

Ioke is a language with an extremely simple base syntax. It is only slightly more complicated than the typical Lisp parser, and there is almost no parser-level productions needed. So the new parser does away with the lexer/parser distinction and does everything in one pass. There is no need for lookahead at the token level, so this turns out to be a clear win. The code is actually much simpler now, and the Message AST is created inline in the new parser. When it comes to interpolation, instead of the semantic predicates and global stacks I had to use in the Antlr parser, I just do the obvious recursive interpolation. The code is simple to understand and quite efficient too.

At the end of the day, I did expect to see some performance improvements too. They turned out to be substantial. Parsing is about 2.5 times faster, and startup speed has improved by about 30%. The distribution size will be substantially smaller since I don’t need to ship the Antlr runtime libraries. And building the project is also much faster.

But the real gain is actually in maintainability of the code. It will be much easier for me to extend the parser now. I can do nice things to make the syntax more open ended and more powerful in ways that would be very inconvenient in Antlr. The error messages are much better since I have control over all the error states. In fact, there are only 13 distinct error messages in the new parser, and they are all very clear on what has gone wrong – I never did the work in the old parser to support that, but I get that almost for free in the new one.

Another thing I’ve been considering is to add reader macros to Ioke – and that would also have been quite painful with the Antlr parser generator. So all in all I’m very happy about the new parser, and I think it will definitely make it easier for the project going forward.

This blog post is in no way saying that Antlr is bad in any way. I like Antlr a lot – it’s a great tool. But it just wasn’t the right tool for Ioke’s syntax.



New projects


OK, I’ve hinted and talked about in several contexts that I have some projects going on that haven’t been announced.  So today I’m going to just mention two of them (so that they will be released at some point). (And no, ioke won’t be on this list).

Xample

This one I have real hopes for. It’s already functioning to a degree. So what does it do? Well, I call it example-driven DSLs. Instead of writing a parser/regexps for handling external DSLs, Xample will take some examples and then derive code to do something based on those examples. It’s one of those things that will not do everything, but it might solve 80% of all the simple DSLs. Those cases where the overhead of creating the DSL framework would not be worth it. Xample will help in those cases in making it really easy to create quite sophisticated DSLs. It’s in Ruby, and I have plans that include a generic work bench for handling these DSLs. Since you have examples of how they should look, you can use the information in all kinds of cool ways.

I have just made the Xample repository public on github.

Antlr-ELisp

This is probably one of the worst cases of yak shaving I’ve ever started on. But it’s a good and worthwhile project. And now that I’ve mentioned it to Terence I guess there’s no way of getting away from it… =)

So, simple, Antlr-ELisp is a backend for Antlr, that allows you to create parsers for Emacs Lisp. This should make it quite easy to get Emacs modes written for a language, without having to resort to all the awful hacks Emacs generally does to handle different languages. This one is really in the infant stage right now, but it looks like it shouldn’t be too hard. Of course I’m not sure what performance will be like, but I’m trying to use as much macros as possible so the low level operations can still be efficient without looking ugly.

This project is also on github right now.

So, now I’m stuck – talking about these things on my blog means I will actually have to get both of them to a release, really soon now!



Antlr lexing problem


I should probably post this on a mailing list instead, but for now I want to document my problem here. If anyone has any good suggestions I’d appreciate it.

I’m using Antlr to lex a language. The language is fixed and has some cumbersome features. One in particular is being really annoying and giving me some trouble to handle neatly with Antlr 3.

This problem is about sorting out Identifiers. Now, to make things really, really simple, an identifier can consist of the letter “s” and the character “:” in any order, in any quantity. An identifier can also be the three operators “=”, “:=” and “::=”. That is the whole language. It’s really easy to handle with whitespace separation and so on. But these are the requirements that give me trouble. The first three are simple baseline examples:

  • “s” should lex into “s”
  • “s:” should lex into “s:”
  • “s::::” should lex into “s::::”
  • “s:=” should lex into “s” and “:=”
  • “s::=” should lex into “s:” and “:=”
  • etc.

Now, the problem is obviously that any sane way of lexing this will end up eating the last colon too. I can of course use a semantic predicate to make sure this isn’t allowed when the last character is a colon and the next is “=”. This helps for the 4th case, but not for the 5th.

Anyone care to help? =)