Porting Syck – A story of C, parser generators and general ugliness


As mentioned earlier, a few weeks back I decided to port Syck to Java, for the sake of JRuby. I will detail some interesting experiences in this blog post.

First some introductions. Syck is broadly divided into two pieces – the core library and the language adaptations. The language adaptions are the stuff that is specific to each language and provides the language level APIs. I won’t talk that much about this side of things – most of this post is about the core library.

Any YAML processor is divided into a parser and a emitter. The Syck emitter is pretty straightforward, and so is the Yecht port of it. The parser on the other hand need to be able to do several different things. First of all, it need to be able to handle the fact that YAML is context dependent, which means you can’t do it with a typical parser generator. So the way all YAML engines do it is by having a pretty smart scanner/tokenizer, and then a straight forward parser on top of this. The scanner takes care of the context dependent bits (which are mainly regarding indentation) and abstracts away this so the parser can just make sure to put documents together.

Another piece that is generally necessary is something that checks what type a value is of. Since YAML supports several different core types, and a YAML engine should always strive to read things without needing hints, the processor need to know that the YAML scalar “foo” has tag “tag:yaml.org,2002:str” while “42” has tag “tag:yaml.org,2002:int”. Of course there are many more types, including several different versions of integers, floats and timestamps. All of this handling is incidental to the parsing of the document, though. A scalar is a scalar no matter what type it is. So the recognizing of implicits is generally not done in the scanner or the parser, but in a separate scanner.

Syck uses re2c for the token scanner and the implicit scanner, and it uses Bison for the parser. My first version of Yecht was as straight forward as I could make it. I ported the backend of re2c to generate Java, and then I used JACC instead of Bison. So far all good. But when it was done, this initial version was pretty slow. I benchmarked the core library by supplying a nodehandler that didn’t do anything, and the did the same thing for Syck. I also added a small piece that just ran the scanner piece. And what I saw got me a bit depressed. I had gotten Yecht to be totally compatible with Syck, but it was buttslow. For comparison I used a random big YAML document without any real advanced features. Just a combination of mappings, sequences and scalars. For reference, the Syck numbers were these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  13959ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  16101ms

And JvYAMLb (the engine I was replacing in JRuby):

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   5325ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15794ms

And my first version of Yecht:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  93658ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took 117213ms

Ouch. The scanner is 7 times slower, and so is the parsing. And comparing with JvYAMLb, it looks even worse. What’s going on here?

As it happens, I didn’t tell you exactly how I ported the token scanner. The Syck implementation of this used about 10 different gotos to jump between different cases. Now, that’s a pretty fast operation in C. Using Java, I had to do something different. Since I wanted to make the first version of Yecht as close a port as possible, I decided to use a standard transformation to mimic the gotos. In Java you can do this by wrapping the area with a while(true) loop that has a label – and then you can use a variable that contains a number to indicate which goto point to go to next. The actual code lives in a switch statement inside of the while-loop. This code is a bit ugly but if you define some constants you end up with code that looks reasonable much like the C code. Just replace “goto plain3;” with “gotoPoint = plain3; break gotoNext;”.

The first thing I did after finding everything was slow was to try to pinpoint the place where all the performance disappeared. This is actually surprisingly hard. But in this case it turned out to be yylex, which contains the full token scanning logic including my homegrown gotos. Then Charles reminded me about one of those lessons we have learned while developing JRuby – namely that Hotspot really doesn’t like large switch statements. So my first step was to try to untangle the logic of all the gotos.

That was actually not that hard, since the logic bottomed out quite quickly. I managed to replace the switch-based goto-logic with separate methods that called each other instead. That was the only thing I did, and suddenly the scanner wasn’t a problem anymore. These were the numbers after I removed that goto-logic:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12207ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  31280ms

Yes, you’re reading that right. The scanning process became 7 times faster by just removing that switch statement and making it into separate methods. Hotspot really doesn’t like large switch statements, and it does like smaller methods. In this case it also made it easier to find the methods that were hotspots in the scanner too.

After this fix it was obvious that the parsing component was the main problem now. Since the approach of removing large switch statements had worked so well in the scanner, I decided to try the same approach in the parser. The parser generator I used – JACC – happens to be one of those generators that generates only code, no tables. Once upon a time this was probably the right behavior for Java, to get good performance, but it’s definitely not the right choice anymore. So I switched from JACC to Jay, and ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12960ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  15411ms

Nice, huh? At this point I could have felt good about myself and stopped. But I decided to see if re2c/re2j was a bottleneck too. The two main methods in the token scanner that is used on basically all calls are “document()” and “plain()”. I rewrite these sections by hand. Re2j generate switch statements that in most cases take more than one jump to get to the right place. By doing some thinking on these algorithms I made the shortest path much shorter in these two methods. After fixing document() I got these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took  11226ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  14059ms

And after fixing plain() I got it down to these:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   9581ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took  12838ms

At this point I had a hunch, and decided to check the performance of the implicit-scanner. The first thing I did was write tests to check how fast it was, and compare it with JvYAMLb and Syck. My hunch was right, the re2j-based implicit-scanner was 10 times slower than the equivalent in Syck. The implicits are called during scanning and parsing when a scalar node is done, so I thought it might contribute to the performance. But instead of rewriting it by hand, I decided to move from re2j to Ragel. Since the implicit scanner is pretty limited, this was a move that worked there, but would never have worked for the token scanner. Ragel generates finite state machines in tables, and presto – it turned out to work like a charm. After rewriting the implicit scanner I ended up with these numbers:

scanning ../jruby_newyaml/bench/big_yaml.yml 10000 times took   4804ms
parsing  ../jruby_newyaml/bench/big_yaml.yml 10000 times took   8424ms

And that’s where I decided to stop. I’m not sure what the moral of this tale is, except to generate as much as possible, be careful with switch-statements on hotspot, and rewrite by hand the pieces that really matter.

Also, use the right tool for the job. JACC was obviously not right for me, but Jay seems to be. Re2j is still right for some parts of the token scanner, while Ragel is right for the implicit-scanner. Of course, using the right tool for the job means knowing the alternatives. I knew about Ragel and I know the implementation of both re2j and Ragel, so I could make educated decisions based on their characteristics.



Re2j – a small lexer generator for Java


There is a tool called re2c. It’s pretty neat. Basically it allows you to intersperse a regular expression based grammar in comments inside of C code, and those comments will be transformed into a basic lexer. There are a few things that make re2c different from other similar tools. The first one is that the supported features are pretty limited (which is good). The code generated is fast. The other good part is that you can have several sections in the same source file. The productions for any specific piece of code are constrained to the specific comment.

As it happens, why the lucky stiff used re2c when he made Syck (the C-based YAML processor used in Ruby and many other languages). So when I set set out to port Syck to Java, the first problem was to figure out the best way to port the lexers using re2c. I ended up using Ragel for the implicit-scanner, and thought about doing the same for the token scanner, but Ragel is pretty painful to use for more than one main production in the same source file. The syntax is not exactly the same either, so it would add to the burden of porting the scanner if I decided to switch.

At the end of the day the most pragmatic choice was to port the output generator in re2c to generate Java instead. This turned out to be pretty easy, and the result is now used in Yecht, which was merged as the YAML processor for JRuby a few days ago.

You can find re2j in my github repository at http://github.com/olabini/re2j. This is still a C++ program, and it probably won’t compile very well on windows. But it’s good enough for many small use cases. Everything works exactly as re2c except for one small difference, namely that you can define a parameter called YYDATA that points to a byte or char buffer that should be the place to read from. For an example usage, take a look at the token scanner: http://github.com/olabini/yecht/blob/master/src/main/org/yecht/TokenScanner.re.

I haven’t put any compiled binaries out anywhere, and at some point it might be nice to merge this with the proper re2c project so you can give a flag to generate Java instead of C, but for now this is all there is to the project.