Re2j – a small lexer generator for Java


There is a tool called re2c. It’s pretty neat. Basically it allows you to intersperse a regular expression based grammar in comments inside of C code, and those comments will be transformed into a basic lexer. There are a few things that make re2c different from other similar tools. The first one is that the supported features are pretty limited (which is good). The code generated is fast. The other good part is that you can have several sections in the same source file. The productions for any specific piece of code are constrained to the specific comment.

As it happens, why the lucky stiff used re2c when he made Syck (the C-based YAML processor used in Ruby and many other languages). So when I set set out to port Syck to Java, the first problem was to figure out the best way to port the lexers using re2c. I ended up using Ragel for the implicit-scanner, and thought about doing the same for the token scanner, but Ragel is pretty painful to use for more than one main production in the same source file. The syntax is not exactly the same either, so it would add to the burden of porting the scanner if I decided to switch.

At the end of the day the most pragmatic choice was to port the output generator in re2c to generate Java instead. This turned out to be pretty easy, and the result is now used in Yecht, which was merged as the YAML processor for JRuby a few days ago.

You can find re2j in my github repository at http://github.com/olabini/re2j. This is still a C++ program, and it probably won’t compile very well on windows. But it’s good enough for many small use cases. Everything works exactly as re2c except for one small difference, namely that you can define a parameter called YYDATA that points to a byte or char buffer that should be the place to read from. For an example usage, take a look at the token scanner: http://github.com/olabini/yecht/blob/master/src/main/org/yecht/TokenScanner.re.

I haven’t put any compiled binaries out anywhere, and at some point it might be nice to merge this with the proper re2c project so you can give a flag to generate Java instead of C, but for now this is all there is to the project.