Ragel performance


I did some performance testing on the old and new Resolver implementation. The testing have some stupid tests that exercise bad parts of both implementations (like longest match, where it can’t be decided what type something is until we have to backtrack about 20 characters). I placed these 24 strings in an array, and pounded on it with an instance of the ResolverImpl that is used in exactly the same way on all scalar values in an YAML document. The objective is to find out if the value is an implicit type or not. So basically, we give it a String, and get back a tag URI. So it’s not like I’m parsing a language or anything. I’m just doing some recognizing here.

The old implementation was based on a Map> where the first letter of the string to resolve was used as an index to find a list of patterns to try sequentially. This worked fine, and made it extensible. But not very fast. This is the baseline. For 24 different strings, iterated 100 000 times for 2 400 000 resolves it takes 7879ms. That’s OK, but not great.

Now, the new Ragel implementation is dead simple. It’s just a translation of the regexps in the aforementioned Pattern’s into a state machine. At EOF out actions (%/ for people in the Ragel knowhow), I execute an action that sets a local variable to a tag, and at the end of the resolve method returns that tag. Dead simple, and not exercising the full strength of Ragel, of course.
So, for the same number of resolves, this ResolverImpl takes 1288ms. That’s 611% improvement in speed. Ain’t it nice to have a friend such as Ragel? And the best part is, for harder tasks, these improvements would be even larger.
Finite State Machines are your friends. All your base are belongs to us.



Results of jvYAMLb


Well, the YAML-based loading is in JRuby trunk. On the way, some parts of the codebase got seriously simplified. Very nice. The final result, with regard to performance, is about 20-30% on speed. But the important gain is in memory usage. The new implementation takes only about one fourth of the memory the original used. So that’s great.

Regarding the Resolver, as I mentioned in the last post, it required a different approach, since regular JvYAML uses regular expressions to recognize implicit tags. Since that approach isn’t good with byte arrays, I decided to use Ragel to generate a recognizer. That approach was very successful. As soon as I got that working it was the obvious approach. Ragel is good. Ragel is great. Ragel is wonderful. I will use the same approach for regular JvYAML to get away from all those Java regexps.

So, next step will be to do the same conversion of the emitter. Of course, at that point performance isn’t that important. It’s more about memory usage and the need to get away from another external dependency in JRuby.



Faster YAML with byte processing


As noted in my last post, I have started work on converting JvYAML into JvYAMLb. Right now I have finished the work on the Scanner and the Parser, and it’s looking quite good. The numbers I reported in the last post for regular JvYAML performance was wrong though. We’re looking at about 7.8s to 10.0s for scanning that 3.5MB gemspec file. (And that’s only the scanning, not file IO). But with the Scanner converted to use bytes and ByteList, the same processing takes 2.8s. That’s a substantial difference. But it doesn’t end with that.

As I said I also converted the Parser. It doesn’t do any String processing at all, so I didn’t expect either a speedup or slowdown except for that from the Scanner. But… Before, parsing the gemspec took 18.515s, but after, it runs in 4s. That’s a dramatic speedup, and I don’t really know where it comes from. Unless the earlier implementation generated so much more garbage, and used more memory, that it was noticeable in speed. Anyway, this looks good for JRuby YAML processing, since I expect big reductions in complexity in the callpath and generation of objects after the YAML processor is byted all the way through.

But tomorrow it’s time to work on the Resolver, and that’s going to be hard. Optimally, it would be nice to have a byte-based Regexp engine. And maybe that would be something for JRuby too, know? Our Regular Expressions must be dead slow now that they have to convert to strings all the time.



Announcing JvYAMLb, a fork


The conversion to using byte-arrays as the basis of our String work in JRuby has led me to realize that JvYAML just doesn’t cut it anymore. The performance wasn’t good to begin with, and it’s even worse having to convert EVERY SINGLE STRING read into bytes. That’s no good. As an example why something needs to be done I’m going to describe the transformations that happen to data in JRuby if executing this code:

YAML.load_file "gems.yml"

First, the file is opened, and wrapped inside a RandomAccessFile. Then data is read from it by YAML. Reading will proceed like this:
1. Bytes are read through the RAF, hopefully in chunks.
2. Those bytes are wrapped in a RubyString so they can be returned from the IO#read method.
3. An IOReader wraps that RubyIO object, gets the RubyString and converts it from bytes into a String, and this String gets converted into a char array.
4. That char array is returned to the YAML Scanner.
5. The chars from the char array is collected in a StringBuffer, and saved in various Strings as token values.
6. The parser, resolver and constructor work on these Strings in various ways.
7. The JRubyConstructor takes these Strings and creates RubyString objects from them and in the process converting the String back to a byte array.

Is there any doubt that this process is slow? Well, it hasn’t been that big of a problem until now, since we are doing so well on performance in other parts of the system.

So, the radical decision is to rewrite JvYAML, making it more SYCK-compliant, working with InputStreams and byte-arrays, and in the process get away from several of the steps above. So that’s what I’m going to do. I hereby create JvYAMLb. It will only be a part of the JRuby codebase, but it will be reasonably separate, so it can be extracted for other purposes. I will not stop work on regular JvYAML, but will maintain both projects.

Since the objective of this new project is blazing speed, I will post some numbers on this now and again. But first I will show you the speed of the regular system. JvYAML’s Scanner can scan an old gem source index (about 3.5MB) of 435654 tokens in about 1654ms. This is the baseline I’m going to use to test performance, and I’ll post more on this as soon as the byte-based Scanner is ready to try out.



The FINAL OpenSSL post?


Possibly.

I’ve checked in all functionality I will add to OpenSSL support in JRuby at this point. Of course, there will be more, but not concentrated in a spurt like this. Tomorrow I will modify the build process and then merge everything I’ve done into trunk.

Let’s back up a little. What have I accomplished? This: All OpenSSL tests from MRI run (except PKCS#7). That includes tests of SSL server and SSL client. Simple https-request also works. This is sweet. Everything else there is tests for in Ruby works. But… this is also the problem. Roughly half of Ruby’s OpenSSL library is not tested at all. And since the current OpenSSL initiative from my part is based on tests, I haven’t done anything that isn’t tested for.

So, some things won’t work. There is no support for Diffie-Hellman keys right now, for example. Will be easy to add when the time comes, but there isn’t any testing so I haven’t felt the need.

The only thing not there, as I said, is PKCS#7. That was just too involved. I’ll take care of that some other time, when someone says they want it… Or someone else can do it? =)

So, what this boils down too is that JRuby trunk will have OpenSSL support sometime tomorrow. Hopefully it will be useful and I can get on to other JRuby things. I have a few hundred bugs I would like to fix, for example…

Oh yeah, that’s true. Tomorrow will also be YAML day. I’ll probably fix some bugs and cut a new release of JvYAML. It’s that time, the bug count is bigger than it was, and JRuby needs some fixes. So that’s the order of day for tomorrow. First OpenSSL and then YAML. Any comments on this, please mail or comment directly here.

G’night.



YAML and JRuby – the last bit


An hour ago I sent the patches to make JRuby’s YAML support completely Java-based. What I have done more specifically, is to remove RbYAML completely, and instead used the newly developed 0.2-support of JvYAML. There were a few different parts that had to be done to make this possible, especially since most of the interface to YAML was Ruby-based, and used the slow Java proxy-support to interact with JvYAML.

So, what’s involved in an operation like this? Well, first I created custom versions of the Representer and the Serializer. (I had a custom JRubyConstructor since May). These weren’t that big, mostly just delegating to the objects themselves to decide how they wanted to be serialized. And that leads me to the RubyYAML-class, which is what will get loaded when you write “require ‘yaml'” in JRuby from now on. It contains two important parts. First, the module YAML, and the singleton methods on this module, that is the main interface to YAML functionality in Ruby. This was implemented in RbYAML until now.

The next part is several implementations of the methods “taguri” and “to_yaml_node” on various classes. These methods are used to handle the dumping, and it’s really there that most of the dumping action happens. For example, the taguri method for Object says that the tag for a typical Ruby object should be “!ruby/object:#{self.class.name}”. The “to_yaml_node” for a Set says that it should be represented as a map where the values of the set are keys, and the values for these keys are null.

So, when this support gets into JRuby trunk it will mean a few things, but nothing that is really apparent for the regular JRuby user. The most important benefits of this is part performance, and part correctness. Performance will be increased since we now have Java all the way, and correctness since I have had the chance to add lots of unit tests and also to fix many bugs in the process. Also, this release makes YAML 1.0-support a reality, which means that communication with MRI will work much better from now on.

So, enjoy. If we’re lucky, it will get into the next minor release of JRuby, which probably will be here quite soon.



Announcing JvYAML 0.2.1


The last few days have been spent integrating the JvYAML dumper with JRuby, and also to make YAML support in JRuby totally implemented in Java. As a side effect I have been able to root out a few bugs in JvYAML. Enough of them to warrant a minor release, actually. So, what’s new? Working binary support, support for better handling of null types, better 1.o-support and a few hooks to make it possible to remove anchors in places where it doesn’t make sense. (Like empty sequences.)

The url is http://jvyaml.dev.java.net and I recommend everyone to upgrade.



Announcing JvYAML 0.2


I’m very pleased to announce that JvYAML 0.2 was released a few minutes ago. The new release contains all the things I’ve talked about earlier and a few extra things I felt would fit good. The important parts of this release are:

  • The Dumper – JvYAML is now a complete YAML processor, not just a loader.
  • Loading and dumping JavaBeans – This feature is necessary for most serious usage of YAML. It allows people to read configuration files right into their bean objects.
  • Loading and dumping specific implementations of mappings and sequences. Very nice if you happen to need your mapping to be a TreeMap instead of a HashMap.
  • Configuration options to allow 1.0-compatibility with regard to the ! versus !! tag prefixes.
  • The simplified interface have been substantially improved, adding several utility methods.
  • Lots and lots of bug fixes.

So, as you can see, this release is really something. I am planning on spending a few nights this week integrating it with JRuby too. And soon after that we will be able to have YAML completely in Java-land. That is great news for performance. It also makes it easier to just have one YAML implementation to fix bugs in, instead of two.

A howto? Oh, you want a guide to the new features? Hmm. Well, OK, but it really isn’t much to show. How to dump and object and get the YAML string back:

 YAML.dump(obj);

or dump directly to a file:

 YAML.dump(obj,new FileWriter("/path/to/file.yaml"));

or dump with version 1.0 instead of 1.1:

 YAML.dump(obj, YAML.options().version("1.0"));

dumping a JavaBean:

 String beanString = YAML.dump(bean);

and loading it back again:

 YAML.load(beanString);

That’s more or less it. Nothing fancy. Of course, all the different parts underneath is still there, and you can provide your own implementation of YAMLFactory to add your own specific hacks. If you want to dump your object in a special way, you can implement the YAMLNodeCreator interface, and your own object will be in charge of creating the information that should be used to represent your object.