Merging Ruby Hashes And Bang Method Usage

MergeThe other day something curious happened that made me question how I use bang methods (and their non-bang cousins). Here is how I normally look at using bang methods in Ruby. Always use the non-bang version, unless you have a really, really good reason not to. It just seems somehow safer this way, like when you're passing an object (String, Array etc.) around as a parameter, don't modify the parameter itself unless that is exactly the behaviour you're looking for. Use methods that don't modify the original object, but rather work on a copy, and then return the copy to the calling code. Maybe I am being too asinine about this, do come and let me know if I am :). Regardless, I have trained myself to avoid bang methods unless I really need them, but a few days ago I needed to merge some hashes.

You see, I had a file full of smaller hashes (serialized) that I wanted to read in and merge into one big hash. As usual, I tried using the non-bang version of the merge method and it just crawled. Let's recreate this code:

main_hash = {}
time = Benchmark.realtime do
  (1..10000).each do |number|
    simple_hash = {}
    simple_hash[number.to_s] = number
    main_hash = main_hash.merge(simple_hash)
  end
end
puts "Time elapsed #{time} seconds"
puts "Main hash key count: #{main_hash.keys.count}"

This produces:

Time elapsed 13.7789120674133 seconds
Main hash key count: 10000

As you can see, it took 13 seconds to merge only 10000 tiny hashes into one. I had significantly more than ten thousand and mine were somewhat larger. The surprising thing was that when you replace the merge method with its bang equivalent, the output produced is:

Time elapsed 28.7179946899414 milliseconds
Main hash key count: 10000

That's 28 MILLIseconds which is about 500 times faster. Now admittedly, this use case was probably a good candidate for using the bang version of merge from the start, but that is not immediately obvious, and such a major disparity in performance made me question my non-bang convention. Was this the case with all the bang/non-bang pairs of methods in Ruby? Should I perhaps forget about the non-bang methods all together and switch to bang exclusively? Well, the first question is easy to answer, since we can test it out empirically. Let's pick another bang/non-bang pair and see if we get the same performance disparity. We'll use Array flatten:

time = Benchmark.realtime do
  (1..1000).each do |number|
    string = "string#{number}"
    array << string
    inner_array = []
    (1..50).each do |inner_number|
      inner_array << inner_number
    end
    array << inner_array
    array = array.flatten
  end
end
puts "Time elapsed #{time} seconds"

When using the non-bang version, the output is:

Time elapsed 5.95429491996765 seconds

But when we switch to the bang version (array.flatten!):

Time elapsed 6.41582012176514 seconds

Hmm, there is almost no difference, infact the non-bang version is a little faster. I tried similar code with String's reverse bang/non-bang pair and the results were similar to what we got for the array, the performance differences were negligible. What gives?

The good news is that using non-bang methods is fine if, like me, this is your preference. Unless, of course, you're trying to merge hashes in which case we need to dig deeper still. Let's hook up a profiler to our original code and see what we get (I'll cover Ruby profiling in more detail in later post, for now, just bear with me). Here is our new code:

main_hash = {}
time = Benchmark.realtime do
  Profiler__::start_profile
  (1..10000).each do |number|
    simple_hash = {}
    simple_hash[number.to_s] = number
    main_hash = main_hash.merge(simple_hash)
  end
  Profiler__::stop_profile
  Profiler__::print_profile($stderr)
end
puts "Time elapsed #{time} seconds"
puts "Main hash key count: #{main_hash.keys.count}"

The profiler output looks like this:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 92.95    12.78     12.78    10000     1.28     1.28  Hash#initialize_copy
  4.51    13.40      0.62        1   620.00 13750.00  Range#each
  1.89    13.66      0.26    10000     0.03     1.30  Hash#merge
  0.51    13.73      0.07    10000     0.01     0.01  Fixnum#to_s
  0.15    13.75      0.02    10000     0.00     0.00  Hash#[]=
  0.00    13.75      0.00        1     0.00 13750.00  #toplevel
Time elapsed 14.436450958252 seconds
Main hash key count: 10000

The culprit here is clear – initialize_copy, that's where we are spending the vast majority of our time. So, what is initialize_copy? It is basically a hook method that Ruby provides which is invoked after an object has been cloned (using dup or clone). When you clone or dup an instance in Ruby, it is usually a shallow copy and so the fields in the cloned instance will still be referencing the same objects as the fields in the original instance. You can override initialize_copy to fix this issue. More info on cloning and initialize_copy can be found here. So we can now intuitively say that when the merge method creates a copy of the hash to work on, it will need to iterate though all the members of the original hash and copy them into the new hash, to make sure they are not still referencing the same object. Only after it has done this will it be able to go through the second hash (i.e. other_hash) and merge the values from there, into this new copy of our original hash. Of course, as the hash gets bigger from merging more and more keys into it, it takes longer and longer to copy the values across and so we spend more time inside initialize_copy which becomes our bottleneck. On the other hand, if we use the bang version of merge, we never need to clone our hash and this bottleneck does not exist, so everything is fast. Makes sense, right?

It does, the only problem I have is this. Why does the same thing not happen when we try to flatten the array? When we try to profile our array flattening code we get output similar to the following:

 %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 48.73     1.34      1.34     1000     1.34     1.34  Array#flatten
 31.27     2.20      0.86     1001     0.86     4.04  Range#each
 10.18     2.48      0.28    21000     0.01     0.01  Fixnum#to_s
  9.82     2.75      0.27    22000     0.01     0.01  Array#<<
  0.00     2.75      0.00        1     0.00  2750.00  #toplevel
Time elapsed 3.29541611671448 seconds

As you can see initialize_copy is not even in the picture. So does that mean we can clone an array (when using the non-bang version of flatten) without having to then go over all the elements to make sure they are not referencing the same object? I had a quick look at the C code that backs both the Array and the Hash implementation in Ruby, but I don't know Ruby native code well enough to get a definitive answer without spending an inordinate amount of time on it (*sigh* another thing to go on the list of things to learn :)). If you do know your way around Ruby's C code, could you please enlighten the rest of us regarding this interesting issue. In the meantime the lesson is this, non-bang methods are fine, but if you're going to merge a bunch of hashes into one, use the bang version of merge. The other lesson is the fact that C is still as relevant as ever, if you want to know interesting things :). Although, I probably need to go through my copy of K&R before digging into Ruby's C code – it has been a while. Anyways, that's a story for another time.

Image by rachel_r

  • http://www.thomthom.net/ Thomas Thomassen

    I try to use bang whenever I can. This is because when I write plugins for Sketchup I always have lots of data to iterate and generally bang methods are faster. Seem to be the overhead of creating variables and new objects.

    I also investigated the difference between the Set class and Array. I initially used array << x unless array.include?(x). And then I tried the Set class – which where faster if the collection was large enough. (For smaller it was quicker with a regular Array.)

    But then someone pointed out to me .uniq! (http://forums.sketchucation.com/viewtopic.php?f=180&t=23760#p222229) . It was much faster to just add all the various values blindly into the array and perform a .uniq! at the end instead of using the Set class that performed a hash lookup every time.

    As for the overhead of creating objects – I’ve even moved to use the for loop instead of the each loop since the variables in the each loop are locale and will be recreated every time the loop iterates, where in the for loop they are reused.

    Because I always iterate large collections I’ve been reading up here and there what tip people comes up with. And one of the common points are to use destructive methods, because creating objects are expensive.
    http://www.igvita.com/2008/07/08/6-optimization-tips-for-ruby-mri/

    As for the specific example you mention, I’m not sure why Hash and Array differs such. I also had a look at the C code – but I’ve only recently picked up Ruby C extensions. Not that familiar with it yet to make a judgement. But it’ll be interesting to hear what people have to say.

    But does you example reflect what you wanted to do? Iterate a collection of small hashes into one big?
    Your examples creates a hash for each iteration – and this is what is costing you time. But in the real world scenario you describe you’d already have these hashes.
    But the main point of the post was to highlight the difference between creating Array vs Hash?

    • http://www.skorks.com Alan Skorkin

      Hi Thomas,

      It’s good to know about Set vs Array and especially about uniq!. I am not sure if for vs each will make that much of a difference, have you found the difference significant?

      The example does indeed reflect what I was doing, the only difference is I wasn’t creating the small hashes, but rather was deserializing them from a file. Creating the hashes was actually not taking a lot of time comparatively according to the profiling I did.

      I guess I had several points when I wrote this, firstly to highlight the fact that using the non-bang merge can be really slow. Secondly, that most non-bang methods don’t really have this issue. Thirdly, knowing your data structures and how they work under the hood can be really helpful. I had other thoughts as well, but I think those were the main ones :).

      • http://www.thomthom.net/ Thomas Thomassen

        As to for vs each: we (the plugin developers that linger around SCF) have a thread going where this topic was brought up. http://forums.sketchucation.com/viewtopic.php?f=180&t=25305#p220844

        The difference can be significant. All though, this will be different from Ruby 1.8 and 1.9. Sketchup uses 1.8. If you are creating variables within the loop then the for loop will be faster because the variables are not locale to the loop. Difference in the test was at times double. But if you init the variables used in the each loop then the differences isn’t very noticeable.

        • http://www.skorks.com Alan Skorkin

          Yeah that makes sense, good thing to keep in mind if performance is important.

  • Amanda

    I haven’t looked at the internals of the Ruby runtime (and there’s wiggle room for different runtimes to implement things differently), but the obvious way to implement immutable value-based (ie. “non-bang”) hash operations is by creating a new backing table, then re-hashing every entry from the original hash into the new hash. This is like re-inserting every single item all over again. It would be an O(n) operation on the number of entries in both hashes, and the hashing function itself can be relatively expensive (particularly if your keys aren’t simple integers – get rid of the “.to_s” when you insert your values in your hash benchmark and you should see a significant difference).

    By contrast, the bang version would just be an O(n) operation on the number of entries in the new hash, and, if you’re lucky, no new backing table needs to be reallocated cos everything fits into the existing one.

    The obvious way to implement array flattening, on the other hand, depends on how arrays are implemented under the covers, and the obvious way to implement those is with native C arrays of pointers to objects, which makes them ungrowable. Then, like a Java ArrayList, as you add things to the Ruby array, new C arrays are created and the contents copied across, usually doubling the size each time (which gives you the amortised O(n) insertion time of these kinds of auto-expanding vector/array collections).

    Flattening would then depend on traversing the array and recursively looking for array sub-elements, copying their contents into a new native array, doing the reallocate-copy thing to the new array as you go. The non-bang and bang versions would then basically do the same thing (since the underlying arrays are fixed length anyway), except that the bang version eventually replaces its inner backing array with the final copy whereas the non-bang version creates a new Ruby array backed by the final copy. If the arrays aren’t deeply nested, this operation can be fairly cheap because the contents can be block-copied (using something like memcpy()) rather than copied one element at a time.

    I don’t think anything too magical is going on here. Maybe just a good example of why it’s important to pay attention when learning data structures? :P

    Hope some of this made sense!

    • http://www.skorks.com Alan Skorkin

      Hi Amanda,

      Integer keys would indeed make things faster but only overall not in a relative sense, i.e. the ratio of the difference between bang on non bang for merge would still be the same I believe.

      And you’re right about the O(n) for hash, that’s basically the thought I had and the C code seems to confirm this. I also had thoughts similar to yours regarding how things work with Arrays (block copy etc.), but when I tried to confirm this through the C code, it got a bit confusing, it is much more complex than what is being done for Hash, so I didn’t really get a definitive sense on how things were being done.

      You’re probably right regarding nothing magical happening, the confusion for me came from the fact that Array does seem to override initialize_copy (just like Hash) and yet it is not being used in a similar fashion. And yeah one of the reasons I highlighted all this in the first place was to show that knowing your data structures under the hood is worthwhile :).

      • Amanda

        Non-bang spends a much higher proportion of time re-hashing than bang, so I think faster hashing will change the proportions of total time. In theory, if hashing was free (ie. takes zero time), then the ratio between non-bang and bang would be much closer to 1:1.

        I think the initialize_copy thing can just be explained by arrays not really needing any initialisation – you create them and that’s it, no rehashing or any book keeping needed.

  • Korny

    It’s interesting to compare this to Hamster, an immutable collection class for ruby:
    http://github.com/harukizaemon/hamster

    On my laptop: using merge took 34.6141049861908 seconds
    using merge! took 0.06062912940979 seconds
    using Hamster.hash.merge took 0.595756053924561 seconds

    - so Hamster is still 10x slower than merge! but it’s a lot faster than copying the hash a lot. Pretty impressive, really, as Hamster is ruby code, while Hash is optimised in C.

    Also it lets you use idioms like the following to build hashes:
    results = (1..10000).inject(Hamster.hash) {|hash, value| hash.put(value.to_s, value)}
    You can do this in ruby as well:
    results = (1..10000).inject({}){|hash, value| hash[value.to_s] = value; hash}
    - but it’s ugly to have to return “hash” from the block. (of course, the hamster version is 10x as slow… :)

    • http://www.skorks.com Alan Skorkin

      Hey Korny,

      I haven’t actually used hamster before, I should give it a play-around – looks interesting.

  • http://alicebobandmallory.com/ Jonas Elfström

    I guess that it only was an example and that it wasn’t possible to do something like this instead?

    main_hash = {}
    time = Benchmark.realtime do
    simple_hash = {}
    (1..10000).each do |number|
    simple_hash[number.to_s] = number
    end
    main_hash = main_hash.merge(simple_hash)
    end
    puts “Time elapsed #{time} seconds”
    puts “Main hash key count: #{main_hash.keys.count}”

    Time elapsed 0.133629083633423 seconds

    • http://www.skorks.com Alan Skorkin

      Hi Jonas,

      Yeah, if you can restrict it to just doing the merge once, that certainly helps a lot, my example was contrived in that I was creating the small hashes, but with my real problem I already had a bunch of little hashes which I wanted to merge into one, so no matter how I spun it I would have to do a lot of merges.

  • Marek Kowalcze

    Hi
    I run some tests on JRuby to compare if someone’s interested:
    http://gist.github.com/386194

    In general in JRuby the difference between merge and merge it’s even larger..
    On my machine:
    # On Ruby i686-linux 1.9.1 merge! is 42.8203960017727 faster than merge.
    # On Ruby java 1.8.7 merge! is 772.361024112707 faster than merge. (first run)
    However, using nailgun (http://sourceforge.net/projects/nailgun/):
    # On Ruby java 1.8.7 merge! is 92.3685496183206 faster than merge.
    # On Ruby java 1.8.7 merge! is 102.961025482799 faster than merge.
    # On Ruby java 1.8.7 merge! is 207.121184088807 faster than merge.
    # …

    • http://www.skorks.com Alan Skorkin

      Hey Marek,

      Very interesting, well it’s good to know that JRuby is at least more or less consistent :).

  • Marek Kowalcze

    When I added –1.9 switch (which is compatible with 1.9.2dev) it fell down to about 70times faster with merge! (using nailgun).

  • Pingback: Ruby, NewRelicRPM, Font-face, Redis, testing, etc. « turnings

  • Zatoichi

    Thank you bro!!! Very useful

  • http://twitter.com/nickclarson Nick

    Thanks!!@!! that helped out a ton!