IntersectionAs I mentioned in my previous post about array intersection, all the work we did was to enable us to experiment with skip pointers. You may remember me saying that list intersection was the most important operation when it comes to search engines. This is because in web search, most queries are implicitly intersections. Consider, when your query contains multiple words (e.g. “car repairs”, “britney spears songs” etc.) the search engine will translate this into – (car AND repairs”, “britney AND spears AND songs), which means it will be intersecting 2 or more postings lists in order to return a result. Because intersection is so crucial, search engines try to speed it up in any way possible. One such way is to use skip pointers.

What Are Skip Pointers

As you recall from my previous post if the sizes of our two posting lists are x and y, the intersection takes x+y operations. Skip pointers allow us to potentially complete the intersection in less than x+y operations. Here is how. At indexing time, we augment our posting lists of document identifiers with a set of “short cut” pointers that reference values further along in the list e.g.:

Skip Pointers

This way, if the situation allows, we can avoid processing parts of the list by following this pointer whenever this will have no impact on our intersection operation.

Here is an example, let us say we want to merge two posting lists for the words “car” and “repairs” which have been augmented with skip pointers like so:

Skip Pointers

We would start using our normal intersection algorithm. We continue advancing through both lists until we have matched 12 and advanced to the next item in each list. At this point the “car” list is sitting on 48 and the “repairs” list is on 13, but 13 has a skip pointer. We check the value our skip pointer is pointing at (i.e. 29) and if this value is less than the current value of the “car” list (which it is), we follow our skip pointer and jump to this value in the list. This allows us to skip 3 items in the “repairs” list since we know they weren’t going to match anyway. Using this technique we can complete an intersection operation faster than we would have been able to otherwise.

We can immediately see a problem. If we augment our data with skip pointers there is now extra overhead for storing these as well as checking for their existence, as well as the overhead involved in actual skip pointer comparisons. This can negate any advantage we gain for our intersections. In practice you don’t want to have too many skips for this very reason, but you also don’t want to have too few as this defeats the purpose of having skips in the first place.

One simple heuristic which has been shown to work well, is to have sqrt(n) evenly spaced skip pointers for a posting list of size n (this is decent but not the best as it doesn’t take into account the distribution of your query terms). What we are going to try and do is test empirically whether or not skip pointers will allow us to improve the speed of an intersection operation.

The Test Data

The first thing to do is to generate some dummy posting lists. We use the following function:

ruby def create_wrapped_mock_postings_list(min_postings, max_postings, corpus_size) postings_list = [] num_postings = min_postings + rand(max_postings) num_postings.times do wrapped_id = OpenStruct.new wrapped_id.value = rand(corpus_size) + 1 postings_list << wrapped_id end postings_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value} end

This generates a list of between _minpostings and _maxpostings document ids each of which is a number between 1 and _corpussize. We also need to wrap each of our raw numbers in a container since we know we will be augmenting them with skip pointers.

Putting in the skips turned out to be reasonably complex, making sure they were evenly spaced according to our heuristic (sqrt(n)) as well as making sure everything was correctly handled when we got to the end of the list. The augment_wrapped_list_with_skips does a reasonable job, using two other helper functions. There is nothing really revolutionary about these, they were just extremely fiddly to write.

```ruby def calculate_skip_index_for(skip_number, values_to_skip) skip_index = skip_number * values_to_skip skip_index = skip_index - 1 if skip_index > 0 skip_index end

def build_skip_index_hash_for(list) skip_indexes = {} number_of_skips = Math.sqrt(list.size).to_i number_to_skip = (list.size/number_of_skips).to_i (0..number_of_skips - 1).each do |skip_number| skip_index = calculate_skip_index_for(skip_number, number_to_skip) skip_indexes[skip_index] = calculate_skip_index_for(skip_number + 1, number_to_skip) end skip_indexes end

def augment_wrapped_list_with_skips(list) skip_indexes = build_skip_index_hash_for list list.each_with_index do |list_item, list_index| if skip_indexes.key? list_index list_item.skip_index = skip_indexes[list_index] list_item.skip_value = list[skip_indexes[list_index]].value end end end```

The New Intersection Function

Now that we’re able to augment our lists with skips, all we need as an intersection function that will take advantage of this:

ruby def rubyish_intersect_for_wrapped_with_skips(list1, list2) list3 = [] skips = 0 list1.each_with_skips do |item1, index1| next_index1 = nil list2.each_continued do |item2, index2| if item1.value == item2.value list3 << item2 index2 + 1 elsif item1.value < item2.value while !item1.skip_index.nil? && item1.skip_value <= item2.value next_index1 = item1.skip_index item1 = list1[next_index1] skips += 1 end index2 else next_index2 = nil while !item2.skip_index.nil? && item2.skip_value <= item1.value next_index2 = item2.skip_index item2 = list2[next_index2] next_index1 = index1 skips += 1 end next_index2 end end next_index1 end list3 end

As you can see it is quite a bit more complex than the previous intersection functions we have written. This does nothing more than what I described above:

  • looks for a skip pointer
  • if found, checks the value it is pointing at
  • figures out if we’re able to follow the skip by comparing the current value of the other list with the value the skip is pointing at
  • follows the skip pointer if possible

It could certainly use a helper function or two, but I believe keeping it all in one place makes the algorithm a bit clearer. One thing to note is the fact that the outer each iterator is no longer just a standard one, we’ve had to open up our Array class and add the following:

ruby class Array def each_with_skips current_index = 0 while current_index < self.size next_index = yield(self[current_index], current_index) if next_index current_index = next_index else current_index += 1 end end end end

This essentially allows us to return a value from the block, which will be used as the index to continue the iteration from on the next pass, allowing us to “skip” a number of values when we’re able to follow the skip pointers of the outer iterator. The inner iterator still uses the “_eachcontinued” function that we wrote in the last post as it already had the ability to skip values when necessary. If you can re-write this intersection function to be tighter and more “ruby-ish”, do share it, I’d be intersected to see what other people can come up with.

The Experiment

We’re now ready to run our experiment to see if skip pointers can actually make a difference. All we need is a method to time the execution of our intersection functions. Fortunately I have written about timing Ruby code before (how lucky is that :)), so we’ll just take the timing method from there:

ruby def time_method(method=nil, *args) beginning_time = Time.now if block_given? yield else self.send(method, args) end end_time = Time.now puts "Time elapsed #{(end_time - beginning_time)*1000} milliseconds" end

We want to compare three things:

  • intersecting two lists of raw integers (not augmented by skip pointers and not wrapped by another object)
  • intersecting two lists of “wrapped” integers (wrapped by another object to possibly contain more data) without using skip pointers
  • intersecting two lists of “wrapped” integers (wrapped by another object to possibly contain more data, such as the skips) using skip pointers

Here is the code:

```ruby def create_raw_mock_postings_list(min_postings, max_postings, corpus_size) postings_list = [] num_postings = min_postings + rand(max_postings) num_postings.times do postings_list << rand(corpus_size) + 1 end postings_list.sort.uniq end

def wrap_raw_list(list) wrapped_list = [] list.each do |item| wrapped_item = OpenStruct.new wrapped_item.value = item wrapped_list << wrapped_item end wrapped_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value} end

CORPUS_SIZE = 100000000

raw_list1 = create_raw_mock_postings_list(100, 200, CORPUS_SIZE) raw_list2 = create_raw_mock_postings_list(30000, 50000, CORPUS_SIZE)

wrapped_list1 = wrap_raw_list(raw_list1) wrapped_list2 = wrap_raw_list(raw_list2)

wrapped_list1_with_skips = augment_wrapped_list_with_skips(wrapped_list1) wrapped_list2_with_skips = augment_wrapped_list_with_skips(wrapped_list2)

puts “” puts “RAW” puts “———————————————————” time_method(:rubyish_intersect_for_raw, raw_list1, raw_list2) puts “” puts “WRAPPED” puts “———————————————————” time_method(:rubyish_intersect_for_wrapped, wrapped_list1, wrapped_list2) puts “” puts “WRAPPED WITH SKIPS” puts “———————————————————” wrapped_list1_with_skips.reset_current_index wrapped_list2_with_skips.reset_current_index time_method(:rubyish_intersect_for_wrapped_with_skips, wrapped_list1_with_skips, wrapped_list2_with_skips)```

Running this, we get results similar to the following:

RAW
---------------------------------------------------------
Time elapsed 59.634 milliseconds

WRAPPED
---------------------------------------------------------
Time elapsed 265.142 milliseconds

WRAPPED WITH SKIPS
---------------------------------------------------------
Time elapsed 245.459 milliseconds

I won’t bore you by running this dozens of times here, suffice to say I did run this lots and lots of times while tweaking the parameters :) (i.e. the sizes of the lists, the corpus size). Here is what I found.

No matter what you do, intersecting a “raw” (not wrapped in a decorator and no skip pointers) list is always significantly faster. This is to be expected, but ultimately means nothing since it is not possible to augment these raw values with skip pointers anyway. Regardless in a real search engine index, document ids in posting lists will often need to contain extra data in addition to skip pointers (e.g. positional information etc.) and so will almost never be raw numbers. What we’re really interested in is comparing “wrapped” lists with and without skips. As you can see from the output above, skip pointers made the list intersection slightly faster, but this is not always the case. Here are some interesting facts that I have found:

  • when both lists are sufficiently small, we never skip but the time difference is negligible due to the size of the lists
  • for larger lists of similar size (i.e. both lists around 30000 values) skip pointers are followed rarely and are therefore actually detrimental to the performance of the intersection function (due to the overhead of checking the skip values but being unable to follow), with the times being 10% to 40% slower for lists with skips
  • when posting list sizes are orders of magnitude different (i.e. list of size 200 vs list of size 20000 like the code above), skip pointers begin to come into their own since the larger list is able to skip a lot more often, the improvement is around 10% to 40% (10 is more likely than 40)

As you can see, skip pointers certainly can improve the speed of an intersection function, but only in certain situations. Of course there are a couple of things to be aware of. Firstly, random generation of posting lists data may not be representative of the distribution of the document ids, in a posting list, in a real search engine index. When randomly generated, document ids end up evenly distributed across the size of the corpus, in real life, this is likely not to be the case, which may allow us to skip a lot more often, improving performance. Secondly, we augmented our data with skip pointers in a somewhat naive fashion (sqrt(n)). If we can come up with another heuristic that will fit our data better (through data analysis), we may be able to improve performance even more. Well, there we go, some interesting code, a new concept and a modified intersection algorithm, don’t forget to subscribe to my feed if you enjoyed it. As always, if you have something to add or have a question, I’d love to hear from you.