Faster List Intersection Using Skip Pointers

IntersectionAs I mentioned in my previous post about array intersection, all the work we did was to enable us to experiment with skip pointers. You may remember me saying that list intersection was the most important operation when it comes to search engines. This is because in web search, most queries are implicitly intersections. Consider, when your query contains multiple words (e.g. "car repairs", "britney spears songs" etc.) the search engine will translate this into – ("car AND repairs", "britney AND spears AND songs"), which means it will be intersecting 2 or more postings lists in order to return a result. Because intersection is so crucial, search engines try to speed it up in any way possible. One such way is to use skip pointers.

What Are Skip Pointers

As you recall from my previous post if the sizes of our two posting lists are x and y, the intersection takes x+y operations. Skip pointers allow us to potentially complete the intersection in less than x+y operations. Here is how. At indexing time, we augment our posting lists of document identifiers with a set of "short cut" pointers that reference values further along in the list e.g.:

Skip Pointers

This way, if the situation allows, we can avoid processing parts of the list by following this pointer whenever this will have no impact on our intersection operation.

Here is an example, let us say we want to merge two posting lists for the words "car" and "repairs" which have been augmented with skip pointers like so:

Skip Pointers

We would start using our normal intersection algorithm. We continue advancing through both lists until we have matched 12 and advanced to the next item in each list. At this point the "car" list is sitting on 48 and the "repairs" list is on 13, but 13 has a skip pointer. We check the value our skip pointer is pointing at (i.e. 29) and if this value is less than the current value of the "car" list (which it is), we follow our skip pointer and jump to this value in the list. This allows us to skip 3 items in the "repairs" list since we know they weren't going to match anyway. Using this technique we can complete an intersection operation faster than we would have been able to otherwise.

We can immediately see a problem. If we augment our data with skip pointers there is now extra overhead for storing these as well as checking for their existence, as well as the overhead involved in actual skip pointer comparisons. This can negate any advantage we gain for our intersections. In practice you don't want to have too many skips for this very reason, but you also don't want to have too few as this defeats the purpose of having skips in the first place.

One simple heuristic which has been shown to work well, is to have sqrt(n) evenly spaced skip pointers for a posting list of size n (this is decent but not the best as it doesn't take into account the distribution of your query terms). What we are going to try and do is test empirically whether or not skip pointers will allow us to improve the speed of an intersection operation.

The Test Data

The first thing to do is to generate some dummy posting lists. We use the following function:

def create_wrapped_mock_postings_list(min_postings, max_postings, corpus_size)
  postings_list = []
  num_postings = min_postings + rand(max_postings)
  num_postings.times do
    wrapped_id = OpenStruct.new
    wrapped_id.value = rand(corpus_size) + 1
    postings_list << wrapped_id
  end
  postings_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value}
end

This generates a list of between min_postings and max_postings document ids each of which is a number between 1 and corpus_size. We also need to wrap each of our raw numbers in a container since we know we will be augmenting them with skip pointers.

Putting in the skips turned out to be reasonably complex, making sure they were evenly spaced according to our heuristic (sqrt(n)) as well as making sure everything was correctly handled when we got to the end of the list. The augment_wrapped_list_with_skips does a reasonable job, using two other helper functions. There is nothing really revolutionary about these, they were just extremely fiddly to write.

def calculate_skip_index_for(skip_number, values_to_skip)
  skip_index = skip_number * values_to_skip
  skip_index = skip_index - 1 if skip_index > 0
  skip_index
end
 
def build_skip_index_hash_for(list)
  skip_indexes = {}
  number_of_skips = Math.sqrt(list.size).to_i
  number_to_skip = (list.size/number_of_skips).to_i
  (0..number_of_skips - 1).each do |skip_number|
    skip_index = calculate_skip_index_for(skip_number, number_to_skip)
    skip_indexes[skip_index] = calculate_skip_index_for(skip_number + 1, number_to_skip)
  end
  skip_indexes
end
 
def augment_wrapped_list_with_skips(list)
  skip_indexes = build_skip_index_hash_for list
  list.each_with_index do |list_item, list_index|
    if skip_indexes.key? list_index
      list_item.skip_index = skip_indexes[list_index]
      list_item.skip_value = list[skip_indexes[list_index]].value
    end
  end
end

The New Intersection Function

Now that we're able to augment our lists with skips, all we need as an intersection function that will take advantage of this:

def rubyish_intersect_for_wrapped_with_skips(list1, list2)
  list3 = []
  skips = 0
  list1.each_with_skips do |item1, index1|
    next_index1 = nil
    list2.each_continued do |item2, index2|
      if item1.value == item2.value
        list3 << item2
        index2 + 1
      elsif item1.value < item2.value
        while !item1.skip_index.nil? && item1.skip_value <= item2.value
          next_index1 = item1.skip_index
          item1 = list1[next_index1]
          skips += 1
        end
        index2
      else
        next_index2 = nil
        while !item2.skip_index.nil? && item2.skip_value <= item1.value
          next_index2 = item2.skip_index
          item2 = list2[next_index2]
          next_index1 = index1
          skips += 1
        end
        next_index2
      end
    end
    next_index1
  end
  list3
end

As you can see it is quite a bit more complex than the previous intersection functions we have written. This does nothing more than what I described above:

  • looks for a skip pointer
  • if found, checks the value it is pointing at
  • figures out if we're able to follow the skip by comparing the current value of the other list with the value the skip is pointing at
  • follows the skip pointer if possible

It could certainly use a helper function or two, but I believe keeping it all in one place makes the algorithm a bit clearer. One thing to note is the fact that the outer each iterator is no longer just a standard one, we've had to open up our Array class and add the following:

class Array
  def each_with_skips
    current_index = 0
    while current_index < self.size
      next_index = yield(self[current_index], current_index)
      if next_index
        current_index = next_index
      else
        current_index += 1
      end
    end
  end
end

This essentially allows us to return a value from the block, which will be used as the index to continue the iteration from on the next pass, allowing us to "skip" a number of values when we're able to follow the skip pointers of the outer iterator. The inner iterator still uses the "each_continued" function that we wrote in the last post as it already had the ability to skip values when necessary. If you can re-write this intersection function to be tighter and more "ruby-ish", do share it, I'd be intersected to see what other people can come up with.

The Experiment

We're now ready to run our experiment to see if skip pointers can actually make a difference. All we need is a method to time the execution of our intersection functions. Fortunately I have written about timing Ruby code before (how lucky is that :)), so we'll just take the timing method from there:

def time_method(method=nil, *args)
  beginning_time = Time.now
  if block_given?
    yield
  else
    self.send(method, args)
  end
  end_time = Time.now
  puts "Time elapsed #{(end_time - beginning_time)*1000} milliseconds"
end

We want to compare three things:

  • intersecting two lists of raw integers (not augmented by skip pointers and not wrapped by another object)
  • intersecting two lists of "wrapped" integers (wrapped by another object to possibly contain more data) without using skip pointers
  • intersecting two lists of "wrapped" integers (wrapped by another object to possibly contain more data, such as the skips) using skip pointers

Here is the code:

def create_raw_mock_postings_list(min_postings, max_postings, corpus_size)
  postings_list = []
  num_postings = min_postings + rand(max_postings)
  num_postings.times do
    postings_list << rand(corpus_size) + 1
  end
  postings_list.sort.uniq
end
 
def wrap_raw_list(list)
  wrapped_list = []
  list.each do |item|
    wrapped_item = OpenStruct.new
    wrapped_item.value = item
    wrapped_list << wrapped_item
  end
  wrapped_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value}
end
 
CORPUS_SIZE = 100000000
 
raw_list1 = create_raw_mock_postings_list(100, 200, CORPUS_SIZE)
raw_list2 = create_raw_mock_postings_list(30000, 50000, CORPUS_SIZE)
 
wrapped_list1 = wrap_raw_list(raw_list1)
wrapped_list2 = wrap_raw_list(raw_list2)
 
wrapped_list1_with_skips = augment_wrapped_list_with_skips(wrapped_list1)
wrapped_list2_with_skips = augment_wrapped_list_with_skips(wrapped_list2)
 
puts ""
puts "RAW"
puts "---------------------------------------------------------"
time_method(:rubyish_intersect_for_raw, raw_list1, raw_list2)
puts ""
puts "WRAPPED"
puts "---------------------------------------------------------"
time_method(:rubyish_intersect_for_wrapped, wrapped_list1, wrapped_list2)
puts ""
puts "WRAPPED WITH SKIPS"
puts "---------------------------------------------------------"
wrapped_list1_with_skips.reset_current_index
wrapped_list2_with_skips.reset_current_index
time_method(:rubyish_intersect_for_wrapped_with_skips, wrapped_list1_with_skips, wrapped_list2_with_skips)

Running this, we get results similar to the following:

RAW
---------------------------------------------------------
Time elapsed 59.634 milliseconds

WRAPPED
---------------------------------------------------------
Time elapsed 265.142 milliseconds

WRAPPED WITH SKIPS
---------------------------------------------------------
Time elapsed 245.459 milliseconds

I won't bore you by running this dozens of times here, suffice to say I did run this lots and lots of times while tweaking the parameters :) (i.e. the sizes of the lists, the corpus size). Here is what I found.

No matter what you do, intersecting a "raw" (not wrapped in a decorator and no skip pointers) list is always significantly faster. This is to be expected, but ultimately means nothing since it is not possible to augment these raw values with skip pointers anyway. Regardless in a real search engine index, document ids in posting lists will often need to contain extra data in addition to skip pointers (e.g. positional information etc.) and so will almost never be raw numbers. What we're really interested in is comparing "wrapped" lists with and without skips. As you can see from the output above, skip pointers made the list intersection slightly faster, but this is not always the case. Here are some interesting facts that I have found:

  • when both lists are sufficiently small, we never skip but the time difference is negligible due to the size of the lists
  • for larger lists of similar size (i.e. both lists around 30000 values) skip pointers are followed rarely and are therefore actually detrimental to the performance of the intersection function (due to the overhead of checking the skip values but being unable to follow), with the times being 10% to 40% slower for lists with skips
  • when posting list sizes are orders of magnitude different (i.e. list of size 200 vs list of size 20000 like the code above), skip pointers begin to come into their own since the larger list is able to skip a lot more often, the improvement is around 10% to 40% (10 is more likely than 40)

As you can see, skip pointers certainly can improve the speed of an intersection function, but only in certain situations. Of course there are a couple of things to be aware of. Firstly, random generation of posting lists data may not be representative of the distribution of the document ids, in a posting list, in a real search engine index. When randomly generated, document ids end up evenly distributed across the size of the corpus, in real life, this is likely not to be the case, which may allow us to skip a lot more often, improving performance. Secondly, we augmented our data with skip pointers in a somewhat naive fashion (sqrt(n)). If we can come up with another heuristic that will fit our data better (through data analysis), we may be able to improve performance even more. Well, there we go, some interesting code, a new concept and a modified intersection algorithm, don't forget to subscribe to my feed if you enjoyed it. As always, if you have something to add or have a question, I'd love to hear from you. 

  • Pingback: Dew Drop – March 20, 2010 | Alvin Ashcraft's Morning Dew

  • Amanda

    Hi Alan,

    Thanks for this interesting post.

    I think your random data generation is definitely a problem as your benchmark results are basically just showing that two barrels of apples with 50/50 mixtures of red and green fruit have… well, a lot of mixture. (More mixture = smaller homogenous “chunks” = less skipping.) No surprises there. :P

    Also, minor point about implementation…

    Quote: “No matter what you do, intersecting a “raw” (not wrapped in a decorator and no skip pointers) list is always significantly faster.”

    This is an unfortunate effect of languages which hafta autobox everything and don’t support composite value types that are statically known. You wouldn’t see such a significant overhead with the “wrapped” results if you implemented it with such a language (eg. C or C#).

    • Amanda

      Just realised my apple barrel analogy is broken. Make that two barrels of apples of myriad colors, with each barrel having some of each color.

      …or just forget the apple barrels. >.<

    • http://www.skorks.com Alan Skorkin

      Yeah, that’s pretty much the feeling I got after I started getting my results. Unfortunately the only real way to find out what the distribution would be in real life is to index and analyse a real corpus, I just need to “steal” some data from somewhere :). The other way is to ask someone who already know, anyone :)?

  • http://comonad.com/ Edward Kmett

    You might have better results using real skip lists. One problem you have is that with such a large granularity it is hard to ever use the skip list pointer. In a real skip list your nodes can have a number of pointers, where the random distribution of the sizes of the pointers is geometric. (50% with size 1, 25% with size 2, 12.5% with size 3, etc.). This randomized property is important because it allows you to cons onto the skip list or tweak the middle of the list without having to shuffle all of your skip pointers later in the list.

    Once you have that you can choose in your intersection operator to search through the skip pointers for the next item, and you can bound how high up in the skip pointers you need to go based on some quick checks at the start of the intersection or merge.

    • http://www.skorks.com Alan Skorkin

      Hi Edward,

      I’ll definitely give skip lists a look, the other thing that has been suggested is balanced binary trees. Lots of interesting things to have a look at, thanks for the suggestion – much appreciated.

  • http://comonad.com/ Edward Kmett

    Perhaps a better way to put it is that searching for a matching item in a skip list is O(log n). Searching for it in your structure here is O(sqrt n). Skip list performance will dominate your approach asymptotically.

    • http://rgrig.blogspot.com/ Radu Grigore

      If you want to find a matching item in O(lg n) then all you need is binary search on the original sorted array, no links required.

      • http://www.skorks.com Alan Skorkin

        Hey Radu,

        That’s actually a pretty good idea, I might play around with it and do some experiments to see how the timings compare.

        • http://rgrig.blogspot.com/ Radu Grigore

          Actually, I don’t expect binary search to do well here because it doesn’t use the cache nicely. But, if you try, let us know. :)

          If you have two lists of the same size n that are equal, then you need Ω(n) time just to output the result, so the naive merging algorithm is optimal. In other words, you may only expect to do better if (1) you look at the average-case instead of the worst-case or (2) lists have significantly different sizes. The second point leads to the idea of iterating through the short list and checking for each element if it is in the other list. Checking if an integer is in a set is done in expected constant time with hash-tables. You might want to try it.

          Because hash-tables are not terribly cache-friendly (which reminds me, use linear probing not chaining if you want it fast!) you could install a Bloom filter in front of it that quickly discards many of the elements that aren’t in the set. This is probably worth it only for the big lists/hashtables.

          This brings the worst-case down from O(m+n) to O(min(m,n)). Analyzing at the average case is more tricky.

  • http://rgrig.blogspot.com/ Radu Grigore

    You may want to read about skip lists. The intersection of two sorted skip lists would be O(n lg lg n) worst case if I’m not mistaken, but you might get some improvements in practice (just as you did with your ‘heuristic’ O(sqrt(n)) skipping).

    • http://www.skorks.com Alan Skorkin

      Cheers Radu, I am on to it :), gotta love it when you find out about new things.

  • Pingback: TiggrSuccess Blog

  • Mitch Kuppinger

    You might consider the use of Enumerators here as well. The examples are written (and tested) for ruby 1.9.1. I’m assuming the raw, wrapped and wrapped_with_skips lists are structured as you described them in your post. I have no real world experience with the issues you are writing about. However, the problems seemed to be an excellent opportunity to learn more about Enumerators. I hope you and your readers find the following similarly useful.

    module SortedListOps 
    # This is the method for a 'raw' list (a sorted list of unique integers)
      def isect_raw(list1,list2)
      	result = []
      	e1, e2 = list1.each, list2.each
      	a, b = e1.next, e2.next
      	while true
      	  begin
      		case a < = > b
      		  when -1 then a = e1.next
      		  when +1 then b = e2.next
      		  else result << a ;a = e1.next; b = e2.next
      		end
      	  rescue StopIteration; return result
      	  end
      	end
      end
    
    # This does the same for a list of integers  now wrapped in ostructs.
      def isect_wrapped(list1,list2)
      	result = []
      	e1, e2 = list1.each, list2.each
      	a, b = e1.next, b = e2.next
      	loop do
      	  begin
      		case a.value < = > b.value
      		  when -1 then a = e1.next
      		  when +1 then b = e2.next
      		  else result << a.value ;a = e1.next; b = e2.next
      		end
      	  rescue StopIteration; return result
      	  end
      	end
      end
    
    # Finally the intersection method for sorted ostructs containing an integer value and a skip_value representing the value of the ostruct one skip_size further along in the list.
      def isect_wrapped_with_skips(list1,list2)
      	result = []
      	skip_size1 = Math.sqrt(list1.size).to_i
      	skip_size2 = Math.sqrt(list2.size).to_i
      	s = 1
      	e1 = list1.enum_for(:skip, s)
      	e2 = list2.enum_for(:skip, s)
      	a, b = e1.next, b = e2.next
      	loop do
      	  begin
      		case a.value < = > b.value
      		  when -1 then s = (a.skip_value && (a.skip_value < b.value) ? skip_size1 : 1); a = e1.next
      		  when +1 then s = (b.skip_value && (b.skip_value< a.value) ? skip_size2 : 1); b = e2.next
      		  else result << a.value; s = 1; a = e1.next; b = e2.next
      		end
      	  rescue StopIteration; return result
      	  end
      	end
      end
    end
    
    class Array
      def skip(s = 1)
      	current_index = -1
      	loop do
      	  current_index += s
      	  raise StopIteration unless current_index < self.length
      	  yield self[current_index] || self.last
      	end
      end
    end
    

    The isect_wrapped_with_skips method depends on defining a skip iterator for Array. I take advantage of the fact that the arguments passed to an enumerator created by enumerable.enum_for(:method, *args) are available to the enumerator where ever it is used. Assignments to those args are seen within the enumerator and thus can be used to affect the enumerator behavior. Note also that you cannot use enumerator.first to set the initial values. This does not set the position within the enumerable in a way to support the correct behavior of the first use of enumerator.next. This seems to me to be a bug but I'll have to investigate further.

    When I put these routines through your timing routines they appear to be quite fast but I haven't had time to compare them directly with your versions.

    I'm very glad to have found your blog. It has been very stimulating. Thanks!

    btw. My reply to your post (http://www.skorks.com/2010/03/writing-a-more-ruby-ish-array-intersection-function-and-sorting-structs/) contained several errors. I wrote that code off the top of my head and failed to test it at all before submitting it. :-( The current code does work in general but there are edge cases that need to be tested before I would use in any production code.

    • Mitch Kuppinger

      Once again I am defeated by the spaceship operator! Your rendering of a reply removes the spaceship operator characters. That is unfortunately the test I use in each of the case statements in the cod I submitted above.

      Is there anything I can do to avoid similar problems in the future?

      • http://www.skorks.com Alan Skorkin

        This is WordPress trying to be smart and trying to interpret anything in less than and greater than signs as an html tag, and since the spaceship html tag doesn’t exist it just removes it. If you put spaces around the = sign, it should prevent WordPress from removing it
        i.e. < = >

        The other thing you can do is to wrap the code in a “pre” tag e.g.:

        def hash  
           title.hash 
        end  
        

        but I believe it would still be better to put spaces in the spaceship operator just to make sure.

        I’ve fixed up your comment for the moment. If it does happen again despite all this, I’ll have a look at fixing it in a more fundamental way.

    • http://www.skorks.com Alan Skorkin

      Hi Mitch,

      This is awesome, I’ll need to spend a little bit of time playing around with your code before I can say anything intelligent so bear with me :). It might even end up as a separate post :).

      I hope you continue to find my blog interesting.

  • http://www.gameranx.com/ Atul Kash

    I didn’t know about Skip Pointers to begin with, thanks for enlightening me

    • http://www.skorks.com Alan Skorkin

      Any time :)

  • Mitch Kuppinger

    Alan,

    I decided to try comparing the performance of my approach to yours. I tried your timing method but was not satisfied because on my slowish hardware the resolution appears to be about 15.625 microseconds and thus on a single pass I could not distinguish the relative speeds of each approach. I then changed to Benchmark only to discover that you had already blogged about it! I think Benchmark gives a better picture of the relative performances. I ran 3 types of lists through the benchmarking process just as you had done: raw integer lists, wrapped lists and wrapped lists with skips. I stored the integer values of the list members that exist in the intersection in the output list. This made it easy to verify that each intersection method returned the same output. I created the lists once and stored them as yaml to allow repeated testing.

    The results surprised me:

        list1 size: 1064      list2 size: 4014      intersect list size: 416
        1000 passes:
    		Enumerator Based Intersect	        Skorks Rubyish Intersect			
    		user	system	 total	 real		user	system	 total	 real
    raws:	21.59	0.02	21.61	21.64		0.89	0.00	0.89	0.89
    wraps:	38.55	0.00	38.55	38.56		9.05	0.00	9.05	9.05
    skips:	51.53	0.02	51.55	51.69		0.34	0.00	0.34	0.34
    

    The file containing the input lists in yaml occupied 678 KB on disk and I had 2+ meg of RAM available as I started this run. I don’t think these results can be explained by memory problems. eg using virtual RAM on disk. It appears that the Enumerator based approach is much slower than the approach you used! I thought my suggestion was much easier to read and understand but the apparent performance difference has to make your code the clear winner. ;-)

    btw. I also ran this with far larger lists that clearly exceeded available memory and found that the Enumerator based approach appeared faster. I think that may be because that approach is not supposed to require as much memory so perhaps there was less HD thrashing going on.

    The take home message is of course:

    When performance matters, test your code for performance.

    I can send you the code for my tests, if you would like.

    • http://www.skorks.com Alan Skorkin

      Hi Mitch,

      That’s really interesting. Yeah I’d love for you to send me the code you used, I’ll play around with it myself. If you don’t mind I might even do a separate post about it (if I can fit it in that is, so much to do so little time :)).

  • Burhan

    “We continue advancing through both lists until we have matched 12 and advanced to the next item in each list.”

    I didn’t get why 12?