Writing A More Ruby-ish Array Intersection Function And Sorting Structs

In my previous post on boolean search, we wrote an efficient list (array) intersection function. But, to be honest, I wasn’t very happy with it. Considering the fact that we were using Ruby, it just wasn’t very Ruby-like. So, this time I am going to try and show how that function can be re-written to not only be more Ruby-ish but to also be tighter and easier to understand. In addition I will look at a couple of interesting aspects of OpenStruct, which will lead nicely into my next post on skip pointers (I’ve spoken about covering skip pointers before).

The Ruby-ish Intersection Function

The last time we implemented an array intersection function it looked like this:

def intersect_lists(list1, list2)
  final_list = []
  current_list1_index = 0
  current_list2_index = 0
 
  while(current_list1_index < list1.length && 
                          current_list2_index < list2.length)
    if list1[current_list1_index] == list2[current_list2_index]
      final_list << list1[current_list1_index]
      current_list1_index += 1
      current_list2_index += 1
    elsif list1[current_list1_index] < list2[current_list2_index]
      current_list1_index += 1
    else
      current_list2_index += 1
    end
  end
  final_list
end

There is actually nothing wrong with the code itself. It is an efficient implementation. The reason it is efficient is due to the fact that we only need to walk through each of the lists once, in order to produce the intersected list. So if the sizes of our two lists are x and y, the intersection takes x+y operations, i.e. the complexity of the algorithm is O(x+y). If we were writing that function in C, everything would be fine at this point, but we’re using Ruby. What we really want our function to look like is something along the lines of:

def intersect_lists(list1, list2)
  list1.each do |item1|
    list2.each do |item2|
      #stuff
    end
  end
end

This would actually work, but we go back to the naive array intersection implementation where we scan through the whole of one list for each of the elements in the other. What we really want is to be able to exit the inner each iterator to allow the outer one to advance, but then restart where we left off the next time we enter the inner one. This way we can compare all the values in both arrays to produce the intersection while still scanning both lists only once. Fortunately with Ruby, we can simply open up the array class and create just such an iterator:

class Array
  def each_continued
    if !@current_index
      reset_current_index
    end
    while @current_index < self.size
      next_index = yield(self[@current_index],@current_index)
      if next_index
        @current_index = next_index
        break
      end
      @current_index += 1
    end
  end
 
  def reset_current_index
    @current_index = 0
  end
end

As you can see our new iterator uses a class level variable (@current_index) to keep track of where the iteration is at. We stop the iteration based on the value that is returned from the block that needs to be passed to this iterator. If the value returned from the block is nil, we continue to iterate. When the value returned is a number, we set it to be the index from which we will restart the iteration next time the method is called. This allows us to not only stop and restart the iteration where we left off, but also roll the iteration forward or back if necessary. This is actually important as we will see shortly. There is only one issue with this implementation, the fact that we need to use a class level variable. The problem is the fact that when one full iteration through the array completes, you need to remember to rewind (i.e. must call reset_current_index) – not great. The only way to overcome this would be keep track of the current position of the iteration externally and pass this value in as a parameter to the iterator, but we won’t worry about that for now, for our purposes, resetting the array is not such a big deal.

We can now re-implement our intersection function, it will look like this:

def rubyish_intersect_lists(list1, list2)
  list3 = []
  list1.each do |value1|
    list2.each_continued do |value2, index2|
      if value1 == value2
        list3 << value2
        index2 + 1
      elsif value1 < value2
        index2
      end
    end
  end
  list3
end

You’ll probably agree that’s much tighter and more Ruby-ish looking. There are two important things to note, in order to understand how it works.

  1. When the values from both lists match, we copy the result into our intersection array, we then need to advance both lists. In order to do this, we get the block of the inner iterator to return the index of the next value in the list. This will cause the inner iterator to terminate, but will also prime it to restart from the index we return out of the block (see what I mean about the importance of giving our iterator the ability to restart the iteration from any index we want). This way we advance our inner iterator and naturally advance the outer one, since the inner one has exited.
  2. When the value of the array in the inner loop is bigger than the outer one, we want to keep advancing the outer iterator, but keep the inner one where it is. We do this very easily, by making the block of the inner iterator return the current index of the list. This has the effect of exiting the inner iterator and priming it to restart in exactly the same place on the next iteration, and of course it naturally advances the outer iterator.

The only other scenario is when the value of the outer loop iterator is bigger then the inner one, in which case we want to advance the inner one. This will happen by default, and so needs no special code to handle it.

Sorting OpenStructs

As I said, this post actually a lead into another post, which will be about skip pointers. This in itself, is no big deal, we still want to intersect lists, the only issue is, they have to be lists of objects rather than lists of primitive integers. After all you actually need somewhere to put the skip pointers (if you don’t really get it, don’t worry, I’ll explain in the next post). When I was playing around with skip pointers initially I decided to use OpenStruct to wrap the integers (which are the values of the arrays we were going to intersect). In hindsight I should have created an object, but I was a bit too far along and didn’t want to go around changing a bunch of code. Here is the problem.

In order to test my intersection function, I generate arrays of integers, using the following function.

def create_raw_mock_postings_list(min_postings, max_postings, corpus_size)
  postings_list = []
  num_postings = min_postings + rand(max_postings)
  num_postings.times do
    postings_list << rand(corpus_size) + 1
  end
  postings_list.sort.uniq
end

This allows me to dummy up what a posting list for a term would look like (a particular number of document ids each of which is a random number between 1 and the size of the corpus). The important bits are in the end i.e. postings_list.sort.uniq. It is very important that the posting list is sorted and all values are distinct. This is fine when your value are integers, it all just works. It would also be fine if the value were objects, since all you would need to do would be to implement the spaceship operator (<=>, so that you can compare them to each other) for everything to once again work (which is why I should have used an object). I, of course, was using OpenStruct and I don’t know of any way to implement the comparison operator.

When it comes to sort, this is reasonably easy to overcome, since that function takes a block that it can use for comparison, but when it comes to uniq, we’re out of luck. I needed a function similar to the following:

def create_wrapped_mock_postings_list(min_postings, max_postings, corpus_size)
  postings_list = []
  num_postings = min_postings + rand(max_postings)
  num_postings.times do
    wrapped_id = OpenStruct.new
    wrapped_id.value = rand(corpus_size) + 1
    postings_list << wrapped_id
  end
  postings_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value}
end

Ruby goodness came to the rescue, since we can, once again, open up Array and create our own uniq function. Like so:

class Array
  def block_uniq
    uniqness_hash = {}
    uniqed_array = []
    self.each do |value|
      if block_given?
        uniqness_value = yield(value)
      else
        uniqness_value = value
      end
      if !uniqness_hash.key?(uniqness_value)
        uniqed_array << value
        uniqness_hash[uniqness_value] = nil
      end
    end
    uniqed_array
  end
end

We use a hash, to make sure the items are unique (i.e. if it is already in the hash, discard it). If no block is given to this function it will just operate like a normal uniq function, but when the block is there, it will use the return value of the block rather than the array items themselves to do it’s job. Implementing this function allows our create_wrapped_mock_postings_list function to work as intended. One last thing to note is the fact that our new block_uniq function preserves the order of the elements of the array we are uniquing, always good to keep things neat.

Intersecting Structs

Now that we’re dealing with structs rather than integers, our previous intersection function is no longer correct (for exactly the same reason, why sort and uniq were such a pain, I really should have used an object instead of OpenStruct, it’s a good learning experience, but let that be a lesson to me for later).

We could just slightly alter our intersection function to work with our structs:

def rubyish_intersect_for_wrapped(list1, list2)
  list3 = []
  list1.each do |item1|
    list2.each_continued do |item2, index2|
      if item1.value == item2.value
        list3 << item2
        index2 + 1
      elsif item1.value < item2.value
        index2
      end
    end
  end
  list3
end

This is actually what I did, but is not necessarily a good solution, since we can augment the original function to work either with integers or with structs or anything else, in a similar way to how we implemented our block_uniq function. For example:

def rubyish_intersect_for_wrapped(list1, list2)
  list3 = []
  list1.each do |item1|
    list2.each_continued do |item2, index2|
      if yield(item1) == yield(item2)
        list3 << item2
        index2 + 1
      elsif yield(item1) < yield(item2)
        index2
      end
    end
  end
  list3
end

We can now call this function like so, when we want to work with structs:

rubyish_intersect_for_wrapped(list1, list2) {|item| item.value}

when we have just plain integers, we would do:

rubyish_intersect_for_wrapped(list1, list2) {|item| item}

Of course, it would be nicer if we had used the block_given? method so that we don’t have to supply any block at all in the second case, but I’ll leave that one as an exercise :).

We’re now ready to tackle skip pointers and see if can actually get better performance (than O(x+y)) from our intersection function. This is coming up very shortly

Image by Thomas Hawk

  • apeiros

    I’m not quite sure whether I might have missed the point, but I’d write an interesect manually like this:
    def interesect(a,b)
    stash = Hash.new(0)
    a.each do |item| stash[item] += 1 end
    b.each do |item| stach[item] += 1 end
    stash.reject { |item,count| count < 2 }.keys
    end

    This uses eql? instead of ==, but that should be an issue.
    The implementation of each_continued is problematic as it is not reentrant. You'd better go via an external iterator there (IMO).
    Also, an @variable is named 'instance_variable', and as the name says it is not really related with the class. It is only attached to one object.
    Another issue I have is with the naming: 'rubyish_intersect_lists'. As far as I understand your implementation, you can only work with *sorted* lists. This should IMO be reflected by the name: rubyish_intersect_sorted_lists

    What might be confusing is your choice of words regarding openstructs: "I really should have used an object instead of OpenStruct". OpenStructs are of course objects too.

    Retrofitting my above intersect for mapped values could look like this:
    def interesect_by(*enumerables)
    stash = Hash.new(0)
    mapping = {}
    enumerables.each do |enumerable|
    enumerable.each do |item|
    mapped = yield(item)
    mapping[mapped] = item
    stash[mapped] += 1
    end
    end
    occurrences = enumerables.length
    mapping.values_at(*stash.reject { |item,count| count < occurrences }.keys)
    end

    The implementation of your block_uniq (which I'd rename to 'uniq_by', following other method's pattern) could be simplified by detecting the missing block early and resorting to uniq:
    def uniq_by
    return uniq unless block_given?

    Last but not least: thank you for writing this article :)

    Regards
    Stefan, aka apeiros

    • http://www.skorks.com Alan Skorkin

      Hi Stefan,

      You’ve made some awesome comments, so i’ll tackle them one by one. Firstly your intersect function would work fine, thanks for sharing it. The reason I didn’t do it this way is as follows. This article is part of a serious of articles on the internals of search engines. The intersection operation is a very important one when it comes to search as it is the most used (have a look at my previous articles regarding this). Considering the amounts of data that an intersection operation must work with when it comes to search you want it to be as efficient as possible. In the case of your implementation it is almost as efficient as the one I supply, but not quite. My intersect is O(x+y) as I mentioned in the article the one you have written is O(x+y+however long it takes to do the reject). It might seem like a very minor difference but when it comes to intersection in information retrieval even a minor inefficiency is significant. Infact my next post on this topic will cover this in more detail.
      You’re probably right, I should have have gone with an external iterator, I was tossing up, but in kept as it is in the end (one of those slip-second decisions), would probably go the other way if I was writing this post again.

      You’re right, it can only work with sorted lists, as postings lists are kept in sorted order when it comes to search indexes, and you’re again right, the naming should reflect it :).

      Yeah I didn’t explain myself as well as I could have. I meant to say “I should have created a class instead of using OpenStruct” :), OpenStruct is most certainly an object.

      I’ll need to study your interesect_by function further, I feel like I’ll be able to learn some stuff from it :).

      Thanks for the tips regarding simplifying the block_uniq, they are all very good points and well worth taking on board.

      Lastly, you’re welcome, and thanks for reading :). I appreciate you taking the time to put down your thoughts, gives me and others the opportunity to see a different point of view and learn something in the process.

      • Trevor

        I dont think you cant really say that O(x+y) is more efficient that O(x+y+some other value). They are all just O(n)…linear time based on the size of your input.

        • http://www.skorks.com Alan Skorkin

          Hey Trevor,

          You’re right of course, the point I was trying to make is that in a search environment when you’re potentially intersecting very large lists, any inefficiencies you can shave off will matter. So while they are both linear time as far as pure big Oh notation is concerned, one is faster than the other :).

  • http://terrbear.org Terry

    Are you implementing the intersection for illustration only? Why not rely on ruby’s native array intersection?

    • http://www.skorks.com Alan Skorkin

      Hi terry,

      Yeah I am mainly doing it to get familiar with the algorithm when it comes to intersecting lists in an information retrieval environment. It is all part of me digging into how search engines work under the hood and this stuff is just the basics of that.

  • apeiros

    OK, I see. I didn’t read the previous article, so I’m obviously missing some information. Is the target to intersect 2 sorted lists of numbers or 2 sorted list of comparables?
    If it is numbers, and the main target is performance, you can further speed it up by not advancing +1 but in quadratic steps and search within the covered space logarithmically. I even have a C extension lying around doing exactly that for arrays of Fixnums (I had a very specific case where performance for that precise operation mattered a lot).
    Btw., Big-O for your and my approach is both O(n) (reject is O(n), n = number of items in both lists together), so if they differ, then only in a constant factor k ;-)

    regards
    Stefan, aka apeiros

    • http://www.skorks.com Alan Skorkin

      In the simple case it is to intersect 2 sorted lists of numbers, but of course real life is never a simple case :). In a real search index the docIds will always end up having to be augmented with extra info so it ends up an intersection of 2 sorted lists of comparables (that’s my current understanding anyway :)).

      I’d actually love to see that C extension purely because I am curious and would love to know how it works . If you have it lying around somewhere accessible (e.g. github) could you point me at it? Don’t worry if it’s too much of a pain.

      You’re right about O(n), but as I said in one of the above comments, when your lists are at least 10s of thousands of entries and potentially millions of entries, everything makes a difference. I’ll get into this kind of stuff more in my next post on this topic, which will be basically a continuation of this post. Which incidentally will also be about trying to improve the performance of an intersection operation (would be good to have your input once I write it).

  • http://github.com/dkubb/veritas Dan Kubb

    I’ve been working on a simple query system that performs most of the simple relational algebra operations in-memory (join, intersection, union, etc) called Veritas. The intersection algorithm I ended up using was: http://bit.ly/ckktz2 .. I also have code to handle restriction, boolean operators and most common simple matchers. The README in the project root provides more information about the API.

    While I can certainly understand your comment about wanting to use the most efficient implementation possible, and agree that an efficient algorithm should be used, I decided to approach this problem a bit differently. Rather than optimizing the code itself, I’ve been choosing the most efficient algorithm and writing the simplest possible ruby code with tests and only optimizing it after benchmarking and profiling.

    • http://www.skorks.com Alan Skorkin

      Hi Dan,

      That looks really cool and your approach (optimizing after benchmarking) sounds like a sensible one also. For me this is more a learning exercise rather than a full project which is why I go straight for the things that interest me such as the most efficient implementations etc.

      • http://github.com/dkubb/veritas Dan Kubb

        Yeah, the stuff I’m doing is also a learning experience. I’m hoping that this system will become the foundation for DataMapper 2.0, and it looks like it will, but it’s not guaranteed. I mostly just wanted to teach myself relational algebra while figuring out a way to apply to to non-RDBMS systems.

  • Mitch Kuppinger

    Use enumerators:

    def intersection(seta,setb)
    ea = seta.each; a = ea.first
    eb = setb.each; b= eb.first
    result = []
    begin
    while true
    case a b
    when -1: a = ea.next
    when +1: b = eb.next
    else result << a
    end
    end
    rescue
    result
    end
    end

    this of course assumes seta and setb are ordered

    • http://www.skorks.com Alan Skorkin

      Ah I see what you’re getting at, pretty cool. This definitely has possibilities, I’ll have to play with this further

  • Mitch Kuppinger

    I should point out that this is ruby 1.9 syntax and usage. I think there has been a backport of this into 1.8.7. the ruby facets library may have some of the same functionality. I have not played with either of those. David Black’s ‘The Well -Grounded Rubyist’ has a nice section on Enumerators. It was my intro tho them. The 1.9 API docs are also helpful.

    The begin..end pair are not needed in a method definition. The while is just to setup an infinite loop that is exited when an error (StopIteration ) is raised. So I can clean the example up a bit:

      def sorted_set_intersection(seta,setb)
      	result = []
      	ea = seta.each
      	eb = setb.each
      	a = ea.first
      	b = eb.first
      	loop do
      	  case a < = > b
      		when -1: a = ea.next
      		when +1: b = eb.next
      		else result << a
      	  end
      	end
      rescue StopIteration
      	result
      end
    

    Enumerators raise the StopIteration error when they move beyond the end of the enumerable by single stepping through.

    The method also depends on an appropriate definition of < = >

    • Mitch Kuppinger

      Corrected code. When matching values are found both lists have to be advanced. The termination of the infinite loop was incorrectly done. The case statement uses a spaceship operator. This does not render well here so I have tried to indicate its presence by adding a surrounding pair of angle brackets.

      def sorted_list_intersection(list1,list2)
        	result = []
        	e1, e2 = list1.each,  list2.each
        	a, b = e1.next,  e2.next
        	while true
        	  begin
        		case a < = > b
        		  when -1 then a = e1.next
        		  when +1 then b = e2.next
        		  else result << a; a = e1.next; b = e2.next
        		end
        	  rescue StopIteration;  return result
        	  end
        	end
        end
      
      • http://www.skorks.com Alan Skorkin

        I’ve put back the spaceship operator into the case statement by putting in a space before and after the = sign i.e. < = > :).

  • Mitch Kuppinger

    Oops! special characters were removed. The last line of my last post should read:

    The method also depends on an appropriate definition of the ‘space-ship operator’. “”

  • Mitch Kuppinger

    Sets in 1.9 support intersection, union and difference methods. Sets in 1.9 are not, I think, ordered. I have no idea how intersection, for example, is implemented so cannot guess how its performance would compare with the intersection method in your post. Perhaps one of your readers has some information.

    • http://www.skorks.com Alan Skorkin

      Intersection is implemented in C for arrays, so it is much faster than a pure ruby solution like mine. I’ve actually posted some quick time comparisons in one of my previous posts. Browsing the C source, it does look like the algorithm used is similar to the one I describe, although difficult to say for sure without spending way longer trying to grok the C code.

  • Mitch Kuppinger

    I’ve played a little further with the Enumerator based list intersect method. There are some optimizations possible.

    The if-then-else construct is faster than case.

    Taking advantage of the fact that finding the intersection of list pairs is faster on lists of raw integers than on lists of ‘wrapped’ integers, I tried doing the intersect on raw lists that exactly mirror the target wrapped lists. I store the index of the matching integers in the raw list intersect result. I then use that result to pull out the corresponding wrapped integers from wrapped list1. In ‘wraps w/ prebuilt’ raws I used the raws lists that were already created for testing raw list intersection methods. In ‘wraps build raws’ I build the raw lists from scratch within the method.

    The latter optimization could, of course, be applied to your methods.

    ENUMERATOR BASED INTERSECT: CASE vs IF-THEN-ELSE
    ---------------------------------------------------------
    list1 size: 1064  list2 size: 4014
    intersect list size: 416
    benchmark repeats: 1000
                 user     system      total        real
    case:   17.140000   0.000000  17.140000 ( 17.171875)
    if:     14.438000   0.000000  14.438000 ( 14.484375)
    
    ENUMERATOR BASED INTERSECT: RAW LISTS AS PROXIES
    ---------------------------------------------------------
    list1 size: 1064  list2 size: 4014
    intersect list size: 416
    benchmark repeats: 1000
                                  user     system      total        real
    raws only:               15.375000   0.031000  15.406000 ( 15.406250)
    wraps only:              49.250000   0.031000  49.281000 ( 49.390625)
    wraps w/ prebuilt raws:  17.328000   0.000000  17.328000 ( 17.359375)
    wraps build raws:        28.515000   0.016000  28.531000 ( 28.593750)
    

    I cleaned up the code a bit:

      def isect_raw_indexed(list1,list2)
        result = []
        e1, e2 = list1.each_with_index, list2.each_with_index
        a, b = e1.next, e2.next
        while true
          if    a[0] < = > b[0] then b = e2.next
          else result << a[1] ;a = e1.next; b = e2.next
          end
        end
        rescue StopIteration; return result
      end
    
      def isect_wrapped(list1,list2)
        result = []
        e1, e2 = list1.each, list2.each
        a, b = e1.next, b = e2.next
        while true
          if    a.value < = > b.value then b = e2.next
          else result << a.value ;a = e1.next; b = e2.next
          end
        end
        rescue StopIteration; return result
      end
    
      def isect_wrapped_with_indexed_raws(raws1, raws2, list1)
        isects = isect_raw_indexed(raws1, raws2)
        isects.inject([]) {|acc,item| acc << list1[item]}
      end
    
      def isect_wrapped_with_indexed_raws2(list1,list2)
        r1 = list1.map {|i| i.value}
        r2 = list2.map {|i| i.value}
        isects = isect_raw_indexed(r1, r2)
        isects.inject([]) {|acc,item| acc << list1[item]}
      end
    
    • http://www.skorks.com Alan Skorkin

      Hi Mitch,

      You’ve done more with this now than I have :), thanks for taking the time to look at this so closely and then share your findings with everyone.

  • http://www.twitter.com/seanmdixon Sean Dixon

    Hello All,

    Came across this posted will looking for a good way to intersect multiple arrays of values. Here is the monkey patch I came up with for my app in case any one else finds it useful:

    class Array
    def intersect
    inject{|sum, n| sum & n}
    end
    end

    Example:
    ruby-1.9.2-p290 :006 > [[2,2,4,5,6,78,1,],[3,2,4,6,3,78],[5,78,4,2,6,7,4,3]].intersect
    => [2, 4, 6, 78]

    Thanks!

    • http://www.twitter.com/seanmdixon Sean Dixon

      Wow, sorry for the obvious lack of proofing. I meant to type “Came across this posting while looking…”

      -Sean