Ola Bini: Programming Language Synchronicity

June 4th, 2010

RSpec matchers and regexp comments – a possibly useful hack

A few days back, I was sitting at my client and working on a hacky spec to validate some assumptions in a very dirty data set. I wanted to figure out some limits. The basic idea was that I needed to go through an entry for every day the last 8 years. Getting these entries is potentially expensive, and the validation was based on checking that a specific value never turns up. This was quite easy, and I ended up with something like this:

require 'spec_helper'

describe "Data invariant" do
  it "holds" do
    (8*365).times do |n|
      date = n.days.ago
      calculate_token_at(date).should_not == "MAGIC TOKEN"
    end
  end
end

Here you can see that I just simply use a method to calculate the invariant, then use the “should_not ==” to find out if it’s true. Nothing fancy. The problem comes when I want to get information about a failure. Now, I could insert a print statement. That means I’d have to look at all the output until I get to the end, to see which one failed. I could also rescue all exceptions, print the offending information and then reraise. But the best solution would be to give RSpec a failure message. Now, you can definitely do this for RSpec in other matchers, but I couldn’t find a way of doing it with the == matcher. One thing I could have done, was to just write my own matcher.That also seemed inefficient. This was a throw away thing, run once and then delete.

What I ended up doing was actually quite elegant, in a very disgusting way. It works, and it might be useful for someone else, sometime. But don’t EVER do anything like this in code you will save.

require 'spec_helper'

describe "Data invariant" do
  it "holds" do
    (8*365).times do |n|
      date = n.days.ago
      calculate_token_at(date).should_not =~ /\AMAGIC TOKEN(?#Invariant failed on: #{date})\Z/
    end
  end
end

So why does this work? Well, it turns out that you can have comments in regular expressions. And you can interpolate arbitrary values into regexps, just like with strings. So I can embed the failure information in a comment in the regexp. This will only be displayed when the match fails, since RSpec by default says something like “expected MAGIC TOKEN to not match /\AMAGIC TOKEN(?#Invariant failed on: 2010-06-04)\Z/”, so you get the information necessary. The comment does not contribute to the matching in anyway. There’s another subtle point here. I haven’t used ^ and $ for anchoring the pattern. Instead I use \A and \Z. The reason is that otherwise, my regexp wouldn’t have the same behavior as comparing against a string, since ^ and $ match the beginning and end of lines too, not only the beginning and end of buffer.

Anyway, I thought I’d share this. In basically all cases, don’t do this. But it’s still a bit funny.

7 Comments | By Ola Bini | In: ruby | tags: hacks, programming, regular expressions, ruby, tricks, unintended consequences. | #

July 28th, 2009

Re2j – a small lexer generator for Java

There is a tool called re2c. It’s pretty neat. Basically it allows you to intersperse a regular expression based grammar in comments inside of C code, and those comments will be transformed into a basic lexer. There are a few things that make re2c different from other similar tools. The first one is that the supported features are pretty limited (which is good). The code generated is fast. The other good part is that you can have several sections in the same source file. The productions for any specific piece of code are constrained to the specific comment.

As it happens, why the lucky stiff used re2c when he made Syck (the C-based YAML processor used in Ruby and many other languages). So when I set set out to port Syck to Java, the first problem was to figure out the best way to port the lexers using re2c. I ended up using Ragel for the implicit-scanner, and thought about doing the same for the token scanner, but Ragel is pretty painful to use for more than one main production in the same source file. The syntax is not exactly the same either, so it would add to the burden of porting the scanner if I decided to switch.

At the end of the day the most pragmatic choice was to port the output generator in re2c to generate Java instead. This turned out to be pretty easy, and the result is now used in Yecht, which was merged as the YAML processor for JRuby a few days ago.

You can find re2j in my github repository at http://github.com/olabini/re2j. This is still a C++ program, and it probably won’t compile very well on windows. But it’s good enough for many small use cases. Everything works exactly as re2c except for one small difference, namely that you can define a parameter called YYDATA that points to a byte or char buffer that should be the place to read from. For an example usage, take a look at the token scanner: http://github.com/olabini/yecht/blob/master/src/main/org/yecht/TokenScanner.re.

I haven’t put any compiled binaries out anywhere, and at some point it might be nice to merge this with the proper re2c project so you can give a flag to generate Java instead of C, but for now this is all there is to the project.

No Comments | By Ola Bini | In: blogging | tags: generator, java, re2c, re2j, re2java, regular expressions, scanner. | #

April 23rd, 2009

NRegex separated into its own project

If you’re interested in my regular expression engine for .NET, you can now download and build it from http://github.com/olabini/nregex. The files will shortly be removed from the Ioke source tree.

1 Comment | By Ola Bini | In: blogging | tags: .net, ioke, nregex, regular expressions. | #

April 21st, 2009

Opinions on C# and .NET

After my recent exposure to C#, I thought I’d write up my thoughts about it and .NET. These will all mostly be in comparison with Java, rather then Ruby – since the implementation is a port of a Java project.

C# and Java started out very similar to each other. They still are, really. But they have grown in different directions. Some things are very nice, some things seem nice, but I didn’t use them, and some things are really problematic. When reading this, I might come of as harsh on C# and .NET. That’s not really my intent – Java and the JVM has its problems too, and I wouldn’t dare to suggest whether C# or Java is better.

The largest difference between C# and Java that really made a large change for me was that C# doesn’t have local anonymous classes. Java has, and these are highly useful. Of course, C# has delegates with lambda expressions instead, and they solve much the same problem. But there are two problems with delegates that make it impossible to use them for all cases. First, an anonymous type in Java can implement several interdependent methods. You can factor behavior local to that piece of code. That doesn’t work with delegates. Instead you’ll have to resort to ugly hacks (in Ioke I make each NativeMethod have references to two different delegates that interact with each other). The second problem is once again the question of intent. I spoke about this in the last post, and I will mention it again. Interfaces are about intent, and they get less useful if you can’t express intent well with them. That’s why the generic Func delegates might not be a good solution in all cases.

The second thing I noticed was the proliferation of “primitive” types. I knew at some level that C# had unsigned and signed versions of things, but I’d forgotten it. It’s actually pretty nice to have those available.

Enums in C# are quite bad compared to Java. The main distinction is that they are based on integers. This gives some fairly strange results in some cases. The one that really bit me was when I forgot to give a default value to an enum field – and expected the default to be null. That isn’t true. The default value for an enum will be the value in it that maps to 0 – which is usually the first element of the enum list. I recommend people using enums to always explicitly init them.

Extension methods seem very useful, and they have been used to add some really nice things in the .NET core library. That said, I didn’t use them for my implementation, so I don’t have any real experience with them.

One thing that really surprised me about .NET was that there is still no support for arbitrary precision math – neither big nums nor big decimals. I ended up implementing that myself, so now there is at least one open source library with liberal license that people can use.

Same thing with regular expressions. The implementation in .NET obviously works, but there are too many incompatibilities in the implementation. Especially the handling of named groups is so different I couldn’t get it to work for Ioke. I ended up implementing NRegex, which is a perl5.6 compatible regular expression engine. It supports named groups, is thread safe, supports look ahead and look behind, and is compliant with level 1 of Unicode Regular Expression Guidelines.

At the end of the day, it was an interesting experience, and nothing surprised me that much. Not really. Most of the things are nitpicks. If it weren’t for one small detail…

Namely equality and hash codes for collections. Why in the name of anything holy doesn’t .NET provide implementations for Equals and GetHashCode? In this day and age? Even if it was a mistake from the beginning, why couldn’t they have fixed that when adding the generic collections? I don’t expect to have to implement these things myself. I especially don’t expect to have to provide my own subclasses of any collection I need to work with. This seriously annoyed me, and made the whole thing take some time, since the bugs produced by it was very hard to pinpoint. And oh yeah, when we’re talking about collections, it’s good to keep in mind that ArrayList.Sort is _not_ stable. It’s using quick sort. If you want a stable sort you’ll have to implement a merge sort or something like that for yourself. This also came as a surprise to me, but it was easily found at least. Since I had a pretty good test suite… =)

Anyway. That’s it.

8 Comments | By Ola Bini | In: blogging | tags: .net, arbitrary precision math, big decimal, bigdecimal, c#, collections, nregex, regular expressions. | #

November 17th, 2008

The magic it variable in if, or solving regular expressions in Ioke

I’ve spent some time trying to figure out how to handle regular expression matching in Ioke. I really like how Ruby allows you to use literal regexps and an infix operator for matching. That’s really nice and I think it reads well. The problem with it is that as soon as you want to get access to the actual match result, not just a yes or no, you have two choices – either you use the ‘match’ method, instead of ‘=~’. The other solution is to use the semi-globals, like $1, or $&, etc. I’ve never liked the globals, so I try to avoid them – and I happen to think it’s good style to avoid them.

The problem is that then you can’t do the matching as well, and the code doesn’t read as well. I’ve tried to figure out how to solve this problem in Ioke, and I think I know what to do.

The solution is to introduce a magic variable – but it’s distinctly different from the Ruby globals. For one, it’s not a global variable. It’s only available inside the lexical context if an ‘if’ or ‘unless’ method. It’s also a lexical variable, meaning it can be captured by a closure. And finally, it’s a general solution to more things than the regular expression problem. The Lisp community has known about this for a long time. In Lisp the macro is generally called aif. But I decided to just integrate it with the if and unless methods.

What does it look like? Well, for matching something and extracting two values from it, you can do this:

str = "foo bar"
if(#/(.*?) (.*?)/ =~ str,
  "first  element: #{it[1]}" println
  "second element: #{it[2]}" println)

The interpolation syntax is the same as in Ruby.

The solution is simple. An if-method, or unless-method will always create a new lexical scope including a value for the variable ‘it’, that is the result of the condition. That means that you can do a really complex operation in the condition part of an if, and then use the result inside of that. In the case of regular expressions, the =~ invocation will return a MatchData-like object if the match succeeds. If it fails, it will return nil. The MatchData object is something that can be indexed with the [] method to get the groups.

The end result is that the it variable will be available where you want it, but not otherwise. Of course, this will incur a cost on every if/unless invocation. But following my hard line of doing things without regard for optimization, and only with regard for expressability, this seems like the right way to do it.

It’s still not totally good, because it’s magic. But it’s magic which solves a specific problem and makes some things much more natural to express. I’m not a 100% comfortable with it, but I’m pretty close. Your thoughts?

24 Comments | By Ola Bini | In: ioke | tags: ioke, magic it, regular expressions. | #

June 17th, 2008

Testing Regular Expressions

Something has been worrying me a bit lately. Being test infected and all, and working for ThoughtWorks, where testing is part of the life blood, I think more and more about these issues. And one thing I’ve started noticing is that regular expressions seems to be a total blind spot in many cases. I first started thinking about it when I changed a quite complicated regular expression in RSpec. Now RSpec has coverage tests as part of their build, and if the test coverage is less than a 100%, the build will fail. Now, since I had changed something to add new functionality, but hadn’t added any tests for it, I instinctively assumed that it would be caught be the coverage tool.

Guess what? It wasn’t. Of course, if I had changed the regexp to do something that the surrounding code couldn’t support, one of the tests for surrounding lines of code would have caught it, but I got no mention from the coverage tool that I needed more tests to fully handle the regular expressions. This is logical if you think about it. There is no way that a coverage tool could find all the regular expressions in your source code, and then make sure that all branches and alternatives of that particular regular expression was exercised. So that means that the coverage tool doesn’t do anything with them at all.

OK, I can live with that, but it’s still one of those points that would be very good to keep in mind. Every time you write a regular expression in your code, you need to take special care to actually exercise that part of the code with many inputs. What is many in this case? That’s another part of the problem – it depends on the regular expression. It depends on how complicated it is, how long it is, how many special operators are used, and so on. There is no real way around it. To test a regular expression, you really need to understand how they work. The corollary is obvious – to use a regular expression in your code, you need to know how to test it. Conclusion – you need to understand regular expressions.

In many code bases I haven’t seen any tests for regular expressions at all. In most cases these have been crafted by writing them outside the code, testing them by hand, and then putting them in the code. This is brittle to say the least. In the cases where there are tests, it’s much more common that they only test positives, and not negatives. And I’ve seldom heard of code bases with enough tests for regular expressions. One of the problems is that in a language like Ruby, they are so easy to use, so you stick them in all over the place. A standard refactoring could help here, by extracting all literal regular expressions to constants. But then the problem becomes another – as soon as you use regular expressions to extract values from a string, it’s a pain to not have the regular expression at the same place as the extracted groups are used. Example:

PhoneRegexp = /(\d{3})-?(\d{4})-?(\d{4})/
# 200 lines of code
if phone_number =~ PhoneRegexp
  puts "phone number is: #$1-#$2-#$3"
end

If the regular expression had been at the same place as the usage of the $1, $2 and $3 it would have been easy to tie them to the parts of the string. In this case it would be easy anyway, but in more complicated cases it’s more complicated. The solution to this is easy – the dollar numbers are evil: don’t use them. Instead use an idiom like this:

area, number, extension = PhoneRegexp.match(phone_number).captures

In Ruby 1.9 you will be able to use named captures, and that will make it even easier to make readable usage of the extracted parts of a string. But fact is, the difference between the usage point and the definition point can still cause trouble. A way of getting around this would be to take any complicated regular expression and putting it inside of a specific class for only that purpose. The class would then encapsulate the usage, and would also allow you to test the regular expression more or less in isolation. In the example above, maybe creating a PhoneNumberParser would be a good idea.

At the end of the day, regular expressions are an extremely complicated feature, and in general we don’t test the usage of them enough. So you should start. Begin by first creating both positive and negative tests for them. Figure out the boundaries, and see where they can go wrong. Know regular expressions well enough to know what happens in these strange circumstances. Think about unicode characters. Think about whitespace. Think about greedy and lazy matching. As an example of something that took a long time to cause trouble; what’s wrong with this regexp that tries to discern if a string is a select statement or not?

/^\s*\(*\s*SELECT\W+/i

And this example actually covers most of the ground, already. It checks case insensitive. It checks for white space before any optional parenthesis, and for any white space after. It makes sure that the word SELECT isn’t continued by checking for at least one non word character. So what’s wrong with it? Well… It’s the caret. Imagine if we had a string like this:

"INSERT INTO foo(a,b,c)\nSELECT * FROM bar"

The regular expression will in fact match this, even though it’s not a select statement. Why? Well, it just so happens that the caret matches the beginning of lines, not the beginning of strings. The dollar sign works the same way, matching the end of lines. How do you solve it? Change the caret to \A and the dollar sign to \Z and it will work as expected. A similar problem can show up with the “.” to match any character. Depending on which language you are using, the dot might or might not match a newline. Always make sure you know which one you want, and what you don’t want.

Finally, these are just some thoughts I had while writing it. There is much more advice to give, but it can be condensed to this: understand regular expressions, and test them. The dot isn’t as simple as it seem. Regular expressions are a full blown language, even though it’s not turing complete (in most implementations). That means that you can’t test it completely, in the general case. This doesn’t mean you shouldn’t try to cover all eventualities.

How are you testing your regular expressions? How much?

10 Comments | By Ola Bini | In: Uncategorized | tags: regular expressions, ruby, test. | #

November 27th, 2007

Joni merged to JRuby trunk

This is a glorious day! Joni (Marcin’s incredible Java port of the Oniguruma regexp engine) has been merged to JRuby trunk. It seems to work really well right now.

I did some initial testing, and the Petstore numbers are more or less the same as before, actually. This is explained by the fact that I did the integration quite quick and tried to get stuff working without concern for performance. We will go through the implementations and tune them for Joni soon, and this will absolutely give JRuby a valuable boost.

Marcin is also continuing to improve Joni performance, so over all this is a very nice approach.

Happy merge day!

1 Comment | By Ola Bini | In: Uncategorized | tags: joni, jruby, oniguruma, regular expressions. | #

November 25th, 2007

JRuby regular expression update

It’s been some time since I wrote about what’s happening in JRuby trunk right now, and what we’re working on. The reason is I’ve been really boring. All my time I’ve spent on Regular Expressions and the REJ implementation. Well, that’s ended now. After Marcin got the Oniguruma port close enough, we are both focusing on that instead. REJ’s implementation had some fundamental problems that would make it really hard to get better performance. In this regard, Joni is a better implementation. Also, Marcin is incredible at optimization so if everything goes as planned, we’re looking at better general Regular Expression performance, better compatibility and a much more competent implementation.

And boy am I bored by this now. =) I’d really like to get back to fixing bugs and get JRuby ready for the next release. That might happen soon, though – I’ve spent the weekend getting Joni integrated with JRuby inside a branch and today reached the goal of getting everything to compile. Also, easier programs run, like jirb. Our test suite fails, though, so there are still things to do. But getting everything compiling and ditching JRegex is a major point on the way of replacing JRegex in JRuby core. It shouldn’t be too far off, and I think it will be fair to say we will have Joni in JRuby 1.1. Actually, 1.1 is really going to be an awesome release.

1 Comment | By Ola Bini | In: Uncategorized | tags: joni, jruby, oniguruma, regular expressions. | #

October 26th, 2007

Current state of Regular Expressions

As I’ve made clear earlier, the current regular expression situation has once again become impractical. To reiterate the history: We began with regular Java regex support. This started to cave in when we found out that the algorithm used is actually recursive, and fails for some common regexps used inside Rails among others. To fix that, we integrated JRegex instead. That’s the engine 1.0 was released with and is still the engine in use. It works fairly well, and is fast for a Java engine. But not fast enough. In particular, there is no support for searching for exact strings and failing fast, and the engine requires us to transform our byte[]-strings to char[] or String. Not exactly optimal. Another problem is that compatibility with MRI suffers, especially in the multi byte support.

There are two solutions currently on the way. Core developer Marcin are working on a port of the 1.9 regexp engine Oniguruma. This port still has some way to go, and is not integrated with JRuby. The other effort is called REJ, and is a port of the MRI engine I did a few months back. I’ve freshened up the work and integrated it with JRuby in a branch. At the moment this work actually seems to go quite well, but there are some snags.

First of all, let me point out that this approach gives us more or less total multibyte compatibility for 1.8, which is quite nice.

When doing benchmarking, I’m generally using Rails as the bar. I have a series of regular expressions that Petstore uses for each requst, and I’m using these to check performance. As a first datapoint, JRuby+REJ is faster at parsing regexps than JRuby trunk for basically all regexps. This ranges from slightly faster to twice as fast.

Most of the Rails regexen are actually faster in REJ than in JRuby+trunk, but the problem is that some of them are actually quite a bit slower. 4 of the 22 Rails regexps are slower, by between 20 and 250% percent. There are also this one: /.*_f/ =~ “_fxxxxxxxxxxxxxxxxxxxxxxx” which basically runs about 10x slower than JRuby trunk. Not nice at all.

In the end, the problem is backtracking. Since REJ is a straight port of the MRI code, the backtracking is also ported. But it seems that Java is unusually bad at handling that specific algorithm, and it performs quite badly. At the moment I’m continuing to look at it and trying to improve performance in all ways possible, so we’ll see what happens. Charles Nutter have also started to look at it.

But what’s really interesting is that I reran my Petstore benchmarks with the current REJ code. To rehash, my last results with JRuby trunk looked like this:

controller :   1.804000   0.000000   1.804000 (  1.804000)
view       :   5.510000   0.000000   5.510000 (  5.510000)
full action:  13.876000   0.000000  13.876000 ( 13.876000)

But the results from rerunning with REJ was interesting, to say the least. I expected bad results because of the bad backtracking performance, but it seems the other speed improvements weigh up:

controller :   1.782000   0.000000   1.782000 (  1.782000)
view       :   4.735000   0.000000   4.735000 (  4.735000)
full action:  12.727000   0.000000  12.727000 ( 12.727000)

As you can see, the improvement is quite large in the view numbers. It is also almost there compared to MRI which had 4.57. Finally, the full action is better by a full second too. Again, MRI is 9.57s and JRuby 12.72. It’s getting closer. I am quite optimistic right now, provided that we manage to fix the remaining problems with backtracking, our regexp engine might well be a great boon to performance.

2 Comments | By Ola Bini | In: Uncategorized | tags: jruby, performance, regular expressions. | #

October 14th, 2007

JRuby discovery number one

After my last entry I’ve spent lots of time checking different parts of JRuby, trying to find the one true bottleneck for Rails. Of course, I still haven’t found it (otherwise I would have said YAY in the subject for this blog). But I have found a few things – for example, symbols are slow right now, but Bill’s work will make them better. And it doesn’t affect Rails performance at all.

But the discovery I made was when I looked at the performance of the regular expressions used in Rails. There are exactly 50 of them for each request, so I did a script that checked the performance of each of them against MRI. And I found that there was one in particular that had really interesting performance when comparing MRI to JRuby. In fact, it was between 200 and a 1000 times slower. What’s worse, the performance wasn’t linear.

So which regular expression was the culprit? Well, /.*?\n/m. That doesn’t look to bad. And in fact, this expression displayed not one, but two problems with JRuby. The first one is that any regular expression engine should be able to fail fast on something like this, simply because there is a string that always needs to be part of a string for this expression to match. In MRI, that part of the engine is called bm_search, and is a very fast way to fail. JRuby doesn’t have that. Marcin is working on a port of Oniguruma though, so that will fix that part of the problem.

But wait, if you grep for this regexp in the Rails sources you won’t find it. So where was it actually used? Here is the kicker: it was used in JRuby’s implementation of String#each_line. So, let’s take some time to look at a quick benchmark for each_line:

require 'benchmark'

str = "Content-Type: text/html; charset=utf-8\r\nSet-Cookie: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa "

TIMES=100_000

puts "each_line on small string with several lines"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})
end

str = "abc" * 15

puts "each_line on short string with no line divisions"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})
end

str = "abc" * 4000

puts "each_line on large string with no line divisions"
10.times do
puts(Benchmark.measure{TIMES.times { str.each_line{} }})
end

As you can see, we simple measure the performance of doing a 100 000 each_line calls on three different strings. The first one is a short string with several newlines, the second is a short string with no newlines, and the last is a long string with no newlines. How does MRI run this benchmark?

each_line on small string with several lines
 0.160000   0.000000   0.160000 (  0.157664)
 0.150000   0.000000   0.150000 (  0.160450)
 0.160000   0.000000   0.160000 (  0.171563)
 0.150000   0.000000   0.150000 (  0.157854)
 0.150000   0.000000   0.150000 (  0.154578)
 0.150000   0.000000   0.150000 (  0.154547)
 0.160000   0.000000   0.160000 (  0.158894)
 0.150000   0.000000   0.150000 (  0.158064)
 0.150000   0.010000   0.160000 (  0.156975)
 0.160000   0.000000   0.160000 (  0.156857)
each_line on short string with no line divisions
 0.080000   0.000000   0.080000 (  0.086789)
 0.090000   0.000000   0.090000 (  0.084559)
 0.080000   0.000000   0.080000 (  0.093477)
 0.090000   0.000000   0.090000 (  0.084700)
 0.080000   0.000000   0.080000 (  0.089917)
 0.090000   0.000000   0.090000 (  0.084176)
 0.080000   0.000000   0.080000 (  0.086735)
 0.090000   0.000000   0.090000 (  0.085536)
 0.080000   0.000000   0.080000 (  0.084668)
 0.090000   0.000000   0.090000 (  0.090176)
each_line on large string with no line divisions
 3.350000   0.020000   3.370000 (  3.404514)
 3.330000   0.020000   3.350000 (  3.690576)
 3.320000   0.040000   3.360000 (  3.851804)
 3.320000   0.020000   3.340000 (  3.651748)
 3.340000   0.020000   3.360000 (  3.478186)
 3.340000   0.020000   3.360000 (  3.447704)
 3.330000   0.020000   3.350000 (  3.448651)
 3.350000   0.010000   3.360000 (  3.489842)
 3.350000   0.020000   3.370000 (  3.429135)
 3.350000   0.010000   3.360000 (  3.372925)

OK, this looks reasonable. The large string is obviously taking more time to search, but not incredibly much time. What about trunk JRuby?

each_line on small string with several lines
32.668000   0.000000  32.668000 ( 32.668000)
30.785000   0.000000  30.785000 ( 30.785000)
30.824000   0.000000  30.824000 ( 30.824000)
30.878000   0.000000  30.878000 ( 30.877000)
30.904000   0.000000  30.904000 ( 30.904000)
30.826000   0.000000  30.826000 ( 30.826000)
30.550000   0.000000  30.550000 ( 30.550000)
32.331000   0.000000  32.331000 ( 32.331000)
30.971000   0.000000  30.971000 ( 30.971000)
30.537000   0.000000  30.537000 ( 30.537000)
each_line on short string with no line divisions
 7.472000   0.000000   7.472000 (  7.472000)
 7.350000   0.000000   7.350000 (  7.350000)
 7.516000   0.000000   7.516000 (  7.516000)
 7.252000   0.000000   7.252000 (  7.252000)
 7.313000   0.000000   7.313000 (  7.313000)
 7.262000   0.000000   7.262000 (  7.262000)
 7.383000   0.000000   7.383000 (  7.383000)
 7.786000   0.000000   7.786000 (  7.786000)
 7.583000   0.000000   7.583000 (  7.583000)
 7.529000   0.000000   7.529000 (  7.529000)
each_line on large string with no line divisions

Ooops. That doesn’t look so good… And also, where is the last ten lines? Eh… It’s still running. It’s been running for two hours to produce the first line. That means that it’s taking at least 7200 seconds which is more than 2400 times slower than MRI. But in fact, since the matching of the regular expression above is not linear, but exponential in performance, I don’t expect this to ever finish.

There are a few interesting lessons to take away from this exercise:

There may still be implementation problems like this in many parts of JRuby – performance will improve by quite much every time we find something like this. I haven’t measured Rails performance after this is fixed, and I don’t expect it to actually fix the whole problem, but I think I’ll see better numbers.
Understand regular expressions. Why is /.*?\n/ so incredibly bad for strings over a certain length? In this case it’s the combination of .* and ?. What would be a better implementation in almost all cases? /[^\n]\n/. Notice that there is no backtracking in this implementation, and because of that, this regexp will have performance O(n) while the earlier one was O(n^2). Learn and know these things. They are the difference between usage and expertise.

11 Comments | By Ola Bini | In: Uncategorized | tags: jruby, performance, regular expressions. | #

« Previous