Serializing (And Deserializing) Objects With Ruby

Serialization is one of those things you can easily do without until all of a sudden you really need it one day. That's pretty much how it went with me. I was happily using and learning Ruby for months before I ever ran into a situation where serializing a few objects really would have made my life easier. Even then I avoided looking into it, you can very easily convert the important data from an object into a string and write that out to a file. Then when you need to, you just read the file, parse the string and recreate the object, what could be simpler? Of course, it could be much simpler indeed, especially when you're dealing with a deep hierarchy of objects. I think being weaned on languages like Java, you come to expect operations like serialization to be non-trivial. Don't get me wrong, it is not really difficult in Java, but neither is it simple and if you want your serialized object to be human-readable, then you're into 3rd party library land and things can get easier or harder depending on your needs. Suffice to say, bad experiences in the past don't fill you with a lot of enthusiasm for the future.

When I started looking into serialization in Ruby, I fully expected to have to look into 3rd party solutions – surely the serialization mechanisms built into the language couldn't possibly, easily fit my needs. As usual, I was pleasantly surprised. Now, like that proverbial hammer, serialization seems to be useful all the time :). Anyway, I'll let you judge for yourself, let's take a look at the best and most common options you have, when it comes to serialization with Ruby.

Human-Readable Objects

Text

Ruby has two object serialization mechanisms built right into the language. One is used to serialize into a human readable format, the other into a binary format. I will look into the binary one shortly, but for now let’s focus on human readable. Any object you create in Ruby can be serialized into YAML format, with pretty much no effort needed on your part. Let’s make some objects:

require "yaml"
 
class A
  def initialize(string, number)
    @string = string
    @number = number
  end
 
  def to_s
    "In A:\n   #{@string}, #{@number}\n"
  end
end
 
class B
  def initialize(number, a_object)
    @number = number
    @a_object = a_object
  end
 
  def to_s
    "In B: #{@number} \n  #{@a_object.to_s}\n"
  end
end
 
class C
  def initialize(b_object, a_object)
    @b_object = b_object
    @a_object = a_object
  end
 
  def to_s
    "In C:\n #{@a_object} #{@b_object}\n"
  end
end
 
a = A.new("hello world", 5)
b = B.new(7, a)
c = C.new(b, a)
 
puts c

Since we created a to_s, method, we can see the string representation of our object tree:

In C:
 In A:
   hello world, 5
 In B: 7
  In A:
   hello world, 5

To serialize our object tree we simply do the following:

serialized_object = YAML::dump(c)
puts serialized_object

Our serialized object looks like this:

--- !ruby/object:C
a_object: &id001 !ruby/object:A
  number: 5
  string: hello world
b_object: !ruby/object:B
  a_object: *id001
  number: 7

If we now want to get it back:

puts YAML::load(serialized_object)

This produces output which is exactly the same as what we had above, which means our object tree was reproduced correctly:

In C:
 In A:
   hello world, 5
 In B: 7
  In A:
   hello world, 5

Of course you almost never want to serialize just one object, it is usually an array or a hash. In this case you have two options, either you serialize the whole array/hash in one go, or you serialize each value separately. The rule here is simple, if you always need to work with the whole set of data and never parts of it, just write out the whole array/hash, otherwise, iterate over it and write out each object. The reason you do this is almost always to share the data with someone else.

If you just write out the whole array/hash in one fell swoop then it is as simple as what we did above. When you do it one object at a time, it is a little more complicated, since we don't want to write it out to a whole bunch of files, but rather all of them to one file. It is a little more complicated since you want to be able to easily read your objects back in again which can be tricky as YAML serialization creates multiple lines per object. Here is a trick you can use, when you write the objects out, separate them with two newlines e.g.:

File.open("/home/alan/tmp/blah.yaml", "w") do |file|
  (1..10).each do |index|
    file.puts YAML::dump(A.new("hello world", index))
    file.puts ""
  end
end

The file will look like this:

--- !ruby/object:A
number: 1
string: hello world

--- !ruby/object:A
number: 2
string: hello world

...

Then when you want to read all the objects back, simply set the input record separator to be two newlines e.g.:

array = []
$/="\n\n"
File.open("/home/alan/tmp/blah.yaml", "r").each do |object|
  array << YAML::load(object)
end
 
puts array

The output is:

In A:
   hello world, 1
In A:
   hello world, 2
In A:
   hello world, 3
...

Which is exactly what we expect – handy. By the way, I will be covering things like the input record separator in an upcoming series of posts I am planning to do about Ruby one-liners, so don't forget to subscribe if you don't want to miss it.

A 3rd Party Alternative

Of course, if we don't want to resort to tricks like that, but still keep our serialized objects human-readable, we have another alternative which is basically as common as the Ruby built in serialization methods – JSON. The JSON support in Ruby is provided by a 3rd party library, all you need to do is:

gem install json

or

gem install json-pure

The second one is if you want a pure Ruby implementation (no native extensions).

The good thing about JSON, is the fact that it is even more human readable than YAML. It is also a "low-fat" alternative to XML and can be used to transport data over the wire by AJAX calls that require data from the server (that's the simple one sentence explanation :)). The other good news when it comes to serializing objects to JSON using Ruby is that if you save the object to a file, it saves it on one line, so we don't have to resort to tricks when saving multiple objects and reading them back again. 

There is bad news of course, in that your objects won't automagically be converted to JSON, unless all you're using is hashes, arrays and primitives. You need to do a little bit of work to make sure your custom object is serializable. Let’s make one of the classes we introduced previously serializable using JSON.

require "json"
 
class A
  def initialize(string, number)
    @string = string
    @number = number
  end
 
  def to_s
    "In A:\n   #{@string}, #{@number}\n"
  end
 
  def to_json(*a)
    {
      "json_class"   => self.class.name,
      "data"         => {"string" => @string, "number" => @number }
    }.to_json(*a)
  end
 
  def self.json_create(o)
    new(o["data"]["string"], o["data"]["number"])
  end
end

Make sure to not forget to 'require' json, otherwise you'll get funny behaviour. Now you can simply do the following:

a = A.new("hello world", 5)
json_string = a.to_json
puts json_string
puts JSON.parse(json_string)

Which produces output like this:

{"json_class":"A","data":{"string":"hello world","number":5}}
In A:
   hello world, 5

The first string is our serialized JSON string, and the second is the result of outputting our deserialized object, which gives the output that we expect.

As you can see, we implement two methods:

  • to_json – called on the object instance and allows us to convert an object into a JSON string.
  • json_create – allows us to call JSON.parse passing in a JSON string which will convert the string into an instance of our object

You can also see that, when converting our object into a JSON string we need to make sure, that we end up with a hash and that contains the 'json_class' key. We also need to make sure that we only use hashes, arrays, primitives (i.e. integers, floats etc., not really primitives in Ruby but you get the picture) and strings.

So, JSON has some advantages and some disadvantages. I like it because it is widely supported so you can send data around and have it be recognised by other apps. I don't like it because you need to do work to make sure your objects are easily serializable, so if you don't need to send your data anywhere but simply want to share it locally, it is a bit of a pain.

Binary Serialization

Binary

The other serialization mechanism built into Ruby is binary serialization using Marshal. It is very similar to YAML and just as easy to use, the only difference is it's not human readable as it stores your objects in a binary format. You use Marshal exactly the same way you use YAML, but replace the word YAML with Marshal :)

a = A.new("hello world", 5)
puts a
serialized_object = Marshal::dump(a)
puts Marshal::load(serialized_object)
In A:
   hello world, 5
In A:
   hello world, 5

As you can see, according to the output the objects before and after serialization are the same. You don't even need to require anything :). The thing to watch out for when outputting multiple Marshalled objects to the same file, is the record separator. Since you're writing binary data, it is not inconceivable that you may end up with a newline somewhere in a record accidentally, which will stuff everything up when you try to read the objects back in. So two rules of thumb to remember are:

  • don't use puts when outputting Marshalled objects to a file (use print instead), this way you avoid the extraneous newline from the puts
  • use a record separator other than newline, you can make anything unlikely up (if you scroll down a bit you will see that I used '—_—' as a separator)

The disadvantage of Marshal is the fact the its output it not human-readable. The advantage is its speed.

Which One To Choose?

It's simple, if you need to be able to read your serializable data then you have to go with one of the human-readable formats (YAML or JSON). I'd go with YAML purely because you don't need to do any work to get your custom objects to serialize properly, and the fact that it serializes each object as a multiline string is not such a big deal (as I showed above). The only times I would go with JSON (aside the whole wide support and sending it over the wire deal), is if you need to be able to easily edit your data by hand, or when you need human-readable data and you have a lot of data to deal with (see benchmarks below).

If you don't really need to be able to read your data, then always go with Marshal, especially if you have a lot of data.

Here is a situation I commonly have to deal with. I have a CSV file, or some other kind of data file, I want to read it, parse it and create an object per row or at least a hash per row, to make the data easier to deal with. What I like to do is read this CSV file, create my object and serialize them to a file at the same time using Marshal. This way I can operate on the whole data set or parts of the data set, by simply reading the serialized objects in, and it is orders of magnitude faster than reading the CSV file again. Let's do some benchmarks. I will create 500000 objects (a relatively small set of data) and serialize them all to a file using all three methods.

require "benchmark"
 
def benchmark_serialize(output_file)
  Benchmark.realtime do
    File.open(output_file, "w") do |file|
      (1..500000).each do |index|
        yield(file, A.new("hello world", index))
      end
    end
  end
end
 
puts "YAML:"
time = benchmark_serialize("/home/alan/tmp/yaml.dat") do |file, object|
  file.puts YAML::dump(object)
  file.puts ""
end
puts "Time: #{time} sec"
 
puts "JSON:"
time = benchmark_serialize("/home/alan/tmp/json.dat") do |file, object|
  file.puts object.to_json
end
puts "Time: #{time} sec"
 
puts "Marshal:"
time = benchmark_serialize("/home/alan/tmp/marshal.dat") do |file, object|
  file.print Marshal::dump(object)
  file.print "---_---"
end
puts "Time: #{time} sec"
YAML:
Time: 45.9780583381653 sec
JSON:
Time: 5.44697618484497 sec
Marshal:
Time: 2.77714705467224 sec

What about deserializing all the objects:

def benchmark_deserialize(input_file, array, input_separator)
  $/=input_separator
  Benchmark.realtime do
    File.open(input_file, "r").each do |object|
      array << yield(object)
    end
  end
end
 
array1 = []
puts "YAML:"
time = benchmark_deserialize("/home/alan/tmp/yaml.dat", array1, "\n\n") do |object|
  YAML::load(object)
end
puts "Array size: #{array1.length}"
puts "Time: #{time} sec"
 
array2 = []
puts "JSON:"
time = benchmark_deserialize("/home/alan/tmp/json.dat", array2, "\n") do |object|
  JSON.parse(object)
end
puts "Array size: #{array2.length}"
puts "Time: #{time} sec"
 
array3 = []
puts "Marshal:"
time = benchmark_deserialize("/home/alan/tmp/marshal.dat", array3, "---_---") do |object|
  Marshal::load(object.chomp)
end
puts "Array size: #{array3.length}"
puts "Time: #{time} sec"
YAML:
Array size: 500000
Time: 19.4334170818329 sec
JSON:
Array size: 500000
Time: 18.5326402187347 sec
Marshal:
Array size: 500000
Time: 14.6655268669128 sec

As you can see, it is significantly faster to serialize objects when you're using Marshal, although JSON is only about 2 times slower. YAML gets left in the dust. When deserializing, the differences are not as apparent, although Marshal is still the clear winner. The more data you have to deal with the more telling these results will be. So, for pure speed – choose Marshal. For speed and human readability – choose JSON (at the expense of having to add methods to custom objects). For human readability with relatively small sets of data – go with YAML.

That's pretty much all you need to know, but it is not all I have to say on serialization. One of the more interesting (and cool) features of Ruby is how useful blocks can be in many situations, so you will inevitably, eventually run into a situation where you may want to serialize a block and this is where you will find trouble! We will deal with block serialization issues and what (if anything) you can do about it in a subsequent post. More Ruby soon :).

Images by Andrew Mason and just.Luc

  • Steve Conover

    My own testing shows yajl-ruby (for json) matching or beating Marshal dump/load – check it out:

    http://github.com/brianmario/yajl-ruby

    • http://www.skorks.com Alan Skorkin

      Hey Steve,

      Looks really cool, I will definitely check it out, and thanks for putting me onto YAJL itself, that look awesome :).

  • Ricardo

    The Puppet project had huge problems with both Marshal and YAML and ended up going with JSON because they couldn’t get either “native” format to work consistently.

    • http://www.skorks.com Alan Skorkin

      That’s interesting, I’d love to hear what specific issues you had with Marashal and YAML, purely for curiosities sake.

      • http://www.selleo.com Tomasz Bak

        I used YAML for serializing both reports tempaltes and reports data. The structures were not very complicated, the business case was to have flexibility over defining defects reports in a manufacturing process.

        I have experienced occasionally corupted YAML in reports data, which sometimes was hundreds of YAML serialized objects stored in single database text filed (PostgreSQL).

  • http://[email protected] Mark Wilden

    Or use Marshal to serialize and convert to YAML when you need human readability.

    • http://www.skorks.com Alan Skorkin

      Hey Mark,

      Yeah there is always that option :). Although from the comments above, it looks like JSON seems to be the winner for people if you want consistent and good serialization. I guess Marshal and YAML are more of a quick and dirty type of thing.

      • http://[email protected] Mark Wilden

        JSON or YAML, sure. My point was just that it’s not necessary to store data in human-readable form in order for it to be human-readable. I think the fashion these days is to use JSON for serialization, but I’ve found Marshal to be quicker, easier, and smaller.

  • Korny

    I strongly suggest, instead of setting the global $/, you just pass the separator to IO.each:
    File.open(“/home/alan/tmp/blah.yaml”, “r”).each(“\n\n”) do |object|

    Otherwise you are setting a global variable that would affect any later code – what happens when you call a library method that internally calls IO.each just after your code?

    Generally, these perl-style $ variables are deprecated and to be avoided – with the exception of folks writing one-liners, and regexp processing, imho, as regexp processing without magic variables is ugly.
    (Though I saw some cunning stuff with regexps and string array processing – quick quiz, what does “foobar”[/o(.*)a/,1] return? )

    • http://www.skorks.com Alan Skorkin

      That’s true, you would need to remember to reset the variable if you’re using it. I’ve been doing quite a bit of stuff with one liners lately so I’ve gotten used to using the $x family of special variables :), they are a bit addictive.

  • http://www.arbia.co.uk Roja

    Having utilised serialisation rather a lot, in a number of different projects, I can’t recommend highly enough looking into Message Pack. It’s significantly faster than JSON, Marshall and YAML and doesn’t suffer from the version-incompatibility problems of Marshall. It also produces extremely compact binary representations making it ideal for communications protocols.

    Peter cooper did a bit of a write-up over at RI: http://www.rubyinside.com/messagepack-binary-object-serialization-3150.html

    • http://www.skorks.com Alan Skorkin

      Hi Roja,

      Cheers, that one seems really useful as well, I now have two serialization alternatives to look at besides the ones I covered in the post :).

    • Steve Agalloco

      MessagePack is crazy fast. Check out the results of this benchmark comparing it with other utilities: http://gist.github.com/290425

      • http://www.skorks.com Alan Skorkin

        That is pretty damn fast, although it would be interesting to see how it works with objects other than, arrays/hashes full of primitives. Maybe I’ll do some investigation.

  • Luca Simone

    Eventually you could use: http://msgpack.sourceforge.net, it’s even faster and lighter that protocollBuffer!

  • bluehavana

    You should definitely check out the YAML.each_document method for your array example.

    http://ruby-doc.org/stdlib/libdoc/yaml/rdoc/classes/YAML.html#M001845

    • http://www.skorks.com Alan Skorkin

      That’s handy, I wasn’t aware of it, should have looked at the API a bit better, cheers :).

  • Delano Mandelbaum

    Ya I’ll +1 on using yajl-ruby for JSON. It’s streaming support is super powerful (you can start deserializing before you have a complete object) and I’ve also encountered string encoding issues with the json lib.

    Note that it’s also possible to serialize Proc objects to their source content. I added support to Storable for this ( http://github.com/delano/storable ) a few days ago. The original code was by Florian Gross:

    http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/147706

    • http://www.skorks.com Alan Skorkin

      That’s really cool about procs, I will definitely have a play around with it and incorporate it into the post I plan to write about serialization and procs. The biggest question for me, is how something like that would deal with objects inside the proc that have gone out of scope everywhere else except the proc itself (the proc being a closure and all).

      • Delano Mandelbaum

        If there are references inside the Proc to objects created elsewhere, you would have to run it in a context that contains those objects. With instance variables, that’s pretty straight forward because you simply use instance_eval. Local variables are another story — the cleanest solution I can think of would be to define methods with the same names as the vars.

  • http://blog.marc-andre.ca Marc-André Lafortune

    It might be important to note that most libs will not be able to {de}serialize all data types. Marshal is the only mechanism will round trip all objects (except some like Procs, …). Yaml is pretty good too. Others will fail (even on simple stuff like symbols).

    As for performance, the yaml library is entirely new in the upcoming 1.9.2.

    • http://www.skorks.com Alan Skorkin

      The thing with something like JSON though, is that you provide your own implementation of how to serialize and deserialize so that will always be as solid as the code you yourself write.

      • Delano Mandelbaum

        That’s true. The other issue with Marshal (and other binary formats) is that you can’t (safely) marshal between versions or implementations of Ruby (e.g. 1.8 to 1.9 or MRI to JRuby).

        • Korny

          Yeah – that’s why I always prefer something like JSON or YAML – not only are the others implementation dependent, they are my-code-version dependent.
          I guess it depends *why* you are serializing data – if it’s only temporary, it’s handy to use Marshal or whatever is fastest. But if it’s not ephemeral data, it’s nice to have a generic format I can de-serialise at any time in the future, even if I’ve changed my object model in the meantime.

          • http://blog.marc-andre.ca Marc-Andre Lafortune

            Marshal provides for mechanism to be backward compatible should you change your data format and need it: just specialize marshal_load and marshal_dump.

        • http://blog.marc-andre.ca Marc-Andre Lafortune

          I believe you are mistaken.
          First, all implementations of Ruby are mandated to respect the Marshal binary format for a given marshal version. Check RubySpec (core/marshal)
          Second, marshal version has changed between 1.8 and 1.9, but it is backward compatible and mostly forward compatible.

  • Pingback: April 16, 2010: The Library of Congress Recommends the Following Tweets « Rails Test Prescriptions Blog

  • LucaB

    I really enjoyed the post, very useful, as its comments are.
    In my recent experience, serialization is fundamental to skip accessing the database during the creation of objects related the current one.
    The de-serialization phase is also smooth, I don’t even need to explicitly refer to an array.
    For example:

    MyClass.write_associated_objects_to_yaml_file (method which access to the database, and use to_yaml) gives me a yaml which already represents an array.

    When I do MyClass.read_associated_objects_from_yaml_file, I automatically get an array filled with objects of the right type (MyOtherClass), using “YAML.load_file ‘my_yaml.yml’”.

    I only have to be careful and do
    load ‘my_other_class.rb’
    otherwise I get a bunch of “YAML::Object”s, which are not that useful… :-)

    Thanks, this is a very interesting argument,
    LucaB

  • ac

    i’ve also had problems with yaml and whitespaces.. for example

    YAML::load(“\n”.to_yaml) # => “”

    for that reason i went with json, only to discover that can’t handle binary payload in strings, so i ended up stripping out all strings to a separate data block and replace them with their index in the data block and length.

    (i wanted to also use it with other languages, so marshal wasn’t an option)

  • Pingback: Inside Ruby on Rails: Serializing Ruby objects with JSON – Simone Carletti's Blog

  • Pingback: links for 2011-04-23 « Use You Imagination

  • Pingback: Probando geoplugin.net desde ruby | Javi Sanromán

  • Sagar

    I enjoyed the post! Very useful and well organized set. ;-) Thank you!

  • Pavel Chernov

    This is what I’ve searched for!
    Thank you!

  • Pingback: Entity-Component game programming using JRuby and libGDX – part 7 « I have a black belt in geek.

  • http://in.linkedin.com/in/fahimbabarpatel/ Fahim Babar Patel

    +1….great job….Jajakallah