As I mentioned in my previous post about array intersection, all the work we did was to enable us to experiment with skip pointers. You may remember me saying that list intersection was the most important operation when it comes to search engines. This is because in web search, most queries are implicitly intersections. Consider, when your query contains multiple words (e.g. "car repairs", "britney spears songs" etc.) the search engine will translate this into – ("car AND repairs", "britney AND spears AND songs"), which means it will be intersecting 2 or more postings lists in order to return a result. Because intersection is so crucial, search engines try to speed it up in any way possible. One such way is to use skip pointers.
What Are Skip Pointers
As you recall from my previous post if the sizes of our two posting lists are x and y, the intersection takes x+y operations. Skip pointers allow us to potentially complete the intersection in less than x+y operations. Here is how. At indexing time, we augment our posting lists of document identifiers with a set of "short cut" pointers that reference values further along in the list e.g.:

This way, if the situation allows, we can avoid processing parts of the list by following this pointer whenever this will have no impact on our intersection operation.
Here is an example, let us say we want to merge two posting lists for the words "car" and "repairs" which have been augmented with skip pointers like so:

We would start using our normal intersection algorithm. We continue advancing through both lists until we have matched 12 and advanced to the next item in each list. At this point the "car" list is sitting on 48 and the "repairs" list is on 13, but 13 has a skip pointer. We check the value our skip pointer is pointing at (i.e. 29) and if this value is less than the current value of the "car" list (which it is), we follow our skip pointer and jump to this value in the list. This allows us to skip 3 items in the "repairs" list since we know they weren't going to match anyway. Using this technique we can complete an intersection operation faster than we would have been able to otherwise.
We can immediately see a problem. If we augment our data with skip pointers there is now extra overhead for storing these as well as checking for their existence, as well as the overhead involved in actual skip pointer comparisons. This can negate any advantage we gain for our intersections. In practice you don't want to have too many skips for this very reason, but you also don't want to have too few as this defeats the purpose of having skips in the first place.
One simple heuristic which has been shown to work well, is to have sqrt(n) evenly spaced skip pointers for a posting list of size n (this is decent but not the best as it doesn't take into account the distribution of your query terms). What we are going to try and do is test empirically whether or not skip pointers will allow us to improve the speed of an intersection operation.
The Test Data
The first thing to do is to generate some dummy posting lists. We use the following function:
def create_wrapped_mock_postings_list(min_postings, max_postings, corpus_size) postings_list = [] num_postings = min_postings + rand(max_postings) num_postings.times do wrapped_id = OpenStruct.new wrapped_id.value = rand(corpus_size) + 1 postings_list << wrapped_id end postings_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value} end
This generates a list of between min_postings and max_postings document ids each of which is a number between 1 and corpus_size. We also need to wrap each of our raw numbers in a container since we know we will be augmenting them with skip pointers.
Putting in the skips turned out to be reasonably complex, making sure they were evenly spaced according to our heuristic (sqrt(n)) as well as making sure everything was correctly handled when we got to the end of the list. The augment_wrapped_list_with_skips does a reasonable job, using two other helper functions. There is nothing really revolutionary about these, they were just extremely fiddly to write.
def calculate_skip_index_for(skip_number, values_to_skip) skip_index = skip_number * values_to_skip skip_index = skip_index - 1 if skip_index > 0 skip_index end def build_skip_index_hash_for(list) skip_indexes = {} number_of_skips = Math.sqrt(list.size).to_i number_to_skip = (list.size/number_of_skips).to_i (0..number_of_skips - 1).each do |skip_number| skip_index = calculate_skip_index_for(skip_number, number_to_skip) skip_indexes[skip_index] = calculate_skip_index_for(skip_number + 1, number_to_skip) end skip_indexes end def augment_wrapped_list_with_skips(list) skip_indexes = build_skip_index_hash_for list list.each_with_index do |list_item, list_index| if skip_indexes.key? list_index list_item.skip_index = skip_indexes[list_index] list_item.skip_value = list[skip_indexes[list_index]].value end end end
The New Intersection Function
Now that we're able to augment our lists with skips, all we need as an intersection function that will take advantage of this:
def rubyish_intersect_for_wrapped_with_skips(list1, list2) list3 = [] skips = 0 list1.each_with_skips do |item1, index1| next_index1 = nil list2.each_continued do |item2, index2| if item1.value == item2.value list3 << item2 index2 + 1 elsif item1.value < item2.value while !item1.skip_index.nil? && item1.skip_value <= item2.value next_index1 = item1.skip_index item1 = list1[next_index1] skips += 1 end index2 else next_index2 = nil while !item2.skip_index.nil? && item2.skip_value <= item1.value next_index2 = item2.skip_index item2 = list2[next_index2] next_index1 = index1 skips += 1 end next_index2 end end next_index1 end list3 end
As you can see it is quite a bit more complex than the previous intersection functions we have written. This does nothing more than what I described above:
- looks for a skip pointer
- if found, checks the value it is pointing at
- figures out if we're able to follow the skip by comparing the current value of the other list with the value the skip is pointing at
- follows the skip pointer if possible
It could certainly use a helper function or two, but I believe keeping it all in one place makes the algorithm a bit clearer. One thing to note is the fact that the outer each iterator is no longer just a standard one, we've had to open up our Array class and add the following:
class Array def each_with_skips current_index = 0 while current_index < self.size next_index = yield(self[current_index], current_index) if next_index current_index = next_index else current_index += 1 end end end end
This essentially allows us to return a value from the block, which will be used as the index to continue the iteration from on the next pass, allowing us to "skip" a number of values when we're able to follow the skip pointers of the outer iterator. The inner iterator still uses the "each_continued" function that we wrote in the last post as it already had the ability to skip values when necessary. If you can re-write this intersection function to be tighter and more "ruby-ish", do share it, I'd be intersected to see what other people can come up with.
The Experiment
We're now ready to run our experiment to see if skip pointers can actually make a difference. All we need is a method to time the execution of our intersection functions. Fortunately I have written about timing Ruby code before (how lucky is that :)), so we'll just take the timing method from there:
def time_method(method=nil, *args) beginning_time = Time.now if block_given? yield else self.send(method, args) end end_time = Time.now puts "Time elapsed #{(end_time - beginning_time)*1000} milliseconds" end
We want to compare three things:
- intersecting two lists of raw integers (not augmented by skip pointers and not wrapped by another object)
- intersecting two lists of "wrapped" integers (wrapped by another object to possibly contain more data) without using skip pointers
- intersecting two lists of "wrapped" integers (wrapped by another object to possibly contain more data, such as the skips) using skip pointers
Here is the code:
def create_raw_mock_postings_list(min_postings, max_postings, corpus_size) postings_list = [] num_postings = min_postings + rand(max_postings) num_postings.times do postings_list << rand(corpus_size) + 1 end postings_list.sort.uniq end def wrap_raw_list(list) wrapped_list = [] list.each do |item| wrapped_item = OpenStruct.new wrapped_item.value = item wrapped_list << wrapped_item end wrapped_list.sort{|x,y| x.value <=> y.value}.block_uniq{|item| item.value} end CORPUS_SIZE = 100000000 raw_list1 = create_raw_mock_postings_list(100, 200, CORPUS_SIZE) raw_list2 = create_raw_mock_postings_list(30000, 50000, CORPUS_SIZE) wrapped_list1 = wrap_raw_list(raw_list1) wrapped_list2 = wrap_raw_list(raw_list2) wrapped_list1_with_skips = augment_wrapped_list_with_skips(wrapped_list1) wrapped_list2_with_skips = augment_wrapped_list_with_skips(wrapped_list2) puts "" puts "RAW" puts "---------------------------------------------------------" time_method(:rubyish_intersect_for_raw, raw_list1, raw_list2) puts "" puts "WRAPPED" puts "---------------------------------------------------------" time_method(:rubyish_intersect_for_wrapped, wrapped_list1, wrapped_list2) puts "" puts "WRAPPED WITH SKIPS" puts "---------------------------------------------------------" wrapped_list1_with_skips.reset_current_index wrapped_list2_with_skips.reset_current_index time_method(:rubyish_intersect_for_wrapped_with_skips, wrapped_list1_with_skips, wrapped_list2_with_skips)
Running this, we get results similar to the following:
RAW --------------------------------------------------------- Time elapsed 59.634 milliseconds WRAPPED --------------------------------------------------------- Time elapsed 265.142 milliseconds WRAPPED WITH SKIPS --------------------------------------------------------- Time elapsed 245.459 milliseconds
I won't bore you by running this dozens of times here, suffice to say I did run this lots and lots of times while tweaking the parameters :) (i.e. the sizes of the lists, the corpus size). Here is what I found.
No matter what you do, intersecting a "raw" (not wrapped in a decorator and no skip pointers) list is always significantly faster. This is to be expected, but ultimately means nothing since it is not possible to augment these raw values with skip pointers anyway. Regardless in a real search engine index, document ids in posting lists will often need to contain extra data in addition to skip pointers (e.g. positional information etc.) and so will almost never be raw numbers. What we're really interested in is comparing "wrapped" lists with and without skips. As you can see from the output above, skip pointers made the list intersection slightly faster, but this is not always the case. Here are some interesting facts that I have found:
- when both lists are sufficiently small, we never skip but the time difference is negligible due to the size of the lists
- for larger lists of similar size (i.e. both lists around 30000 values) skip pointers are followed rarely and are therefore actually detrimental to the performance of the intersection function (due to the overhead of checking the skip values but being unable to follow), with the times being 10% to 40% slower for lists with skips
- when posting list sizes are orders of magnitude different (i.e. list of size 200 vs list of size 20000 like the code above), skip pointers begin to come into their own since the larger list is able to skip a lot more often, the improvement is around 10% to 40% (10 is more likely than 40)
As you can see, skip pointers certainly can improve the speed of an intersection function, but only in certain situations. Of course there are a couple of things to be aware of. Firstly, random generation of posting lists data may not be representative of the distribution of the document ids, in a posting list, in a real search engine index. When randomly generated, document ids end up evenly distributed across the size of the corpus, in real life, this is likely not to be the case, which may allow us to skip a lot more often, improving performance. Secondly, we augmented our data with skip pointers in a somewhat naive fashion (sqrt(n)). If we can come up with another heuristic that will fit our data better (through data analysis), we may be able to improve performance even more. Well, there we go, some interesting code, a new concept and a modified intersection algorithm, don't forget to subscribe to my feed if you enjoyed it. As always, if you have something to add or have a question, I'd love to hear from you.
Related posts:
- Writing A More Ruby-ish Array Intersection Function And Sorting Structs
- Timing Ruby Code – It Is Easy With Benchmark
- Lets Roll Our Own Boolean Query Search Engine
- A Wealth Of Ruby Loops And Iterators
- Ruby Procs And Lambdas (And The Difference Between Them)
- Serializing (And Deserializing) Objects With Ruby
- How To Write A Name Generator (In Ruby)
{ 2 trackbacks }
{ 19 comments… read them below or add one }
Hi Alan,
Thanks for this interesting post.
I think your random data generation is definitely a problem as your benchmark results are basically just showing that two barrels of apples with 50/50 mixtures of red and green fruit have… well, a lot of mixture. (More mixture = smaller homogenous “chunks” = less skipping.) No surprises there. :P
Also, minor point about implementation…
Quote: “No matter what you do, intersecting a “raw” (not wrapped in a decorator and no skip pointers) list is always significantly faster.”
This is an unfortunate effect of languages which hafta autobox everything and don’t support composite value types that are statically known. You wouldn’t see such a significant overhead with the “wrapped” results if you implemented it with such a language (eg. C or C#).
Just realised my apple barrel analogy is broken. Make that two barrels of apples of myriad colors, with each barrel having some of each color.
…or just forget the apple barrels. >.<
Yeah, that’s pretty much the feeling I got after I started getting my results. Unfortunately the only real way to find out what the distribution would be in real life is to index and analyse a real corpus, I just need to “steal” some data from somewhere :). The other way is to ask someone who already know, anyone :)?
You might have better results using real skip lists. One problem you have is that with such a large granularity it is hard to ever use the skip list pointer. In a real skip list your nodes can have a number of pointers, where the random distribution of the sizes of the pointers is geometric. (50% with size 1, 25% with size 2, 12.5% with size 3, etc.). This randomized property is important because it allows you to cons onto the skip list or tweak the middle of the list without having to shuffle all of your skip pointers later in the list.
Once you have that you can choose in your intersection operator to search through the skip pointers for the next item, and you can bound how high up in the skip pointers you need to go based on some quick checks at the start of the intersection or merge.
Hi Edward,
I’ll definitely give skip lists a look, the other thing that has been suggested is balanced binary trees. Lots of interesting things to have a look at, thanks for the suggestion – much appreciated.
Perhaps a better way to put it is that searching for a matching item in a skip list is O(log n). Searching for it in your structure here is O(sqrt n). Skip list performance will dominate your approach asymptotically.
If you want to find a matching item in O(lg n) then all you need is binary search on the original sorted array, no links required.
Hey Radu,
That’s actually a pretty good idea, I might play around with it and do some experiments to see how the timings compare.
Actually, I don’t expect binary search to do well here because it doesn’t use the cache nicely. But, if you try, let us know. :)
If you have two lists of the same size n that are equal, then you need Ω(n) time just to output the result, so the naive merging algorithm is optimal. In other words, you may only expect to do better if (1) you look at the average-case instead of the worst-case or (2) lists have significantly different sizes. The second point leads to the idea of iterating through the short list and checking for each element if it is in the other list. Checking if an integer is in a set is done in expected constant time with hash-tables. You might want to try it.
Because hash-tables are not terribly cache-friendly (which reminds me, use linear probing not chaining if you want it fast!) you could install a Bloom filter in front of it that quickly discards many of the elements that aren’t in the set. This is probably worth it only for the big lists/hashtables.
This brings the worst-case down from O(m+n) to O(min(m,n)). Analyzing at the average case is more tricky.
You may want to read about skip lists. The intersection of two sorted skip lists would be O(n lg lg n) worst case if I’m not mistaken, but you might get some improvements in practice (just as you did with your ‘heuristic’ O(sqrt(n)) skipping).
Cheers Radu, I am on to it :), gotta love it when you find out about new things.
You might consider the use of Enumerators here as well. The examples are written (and tested) for ruby 1.9.1. I’m assuming the raw, wrapped and wrapped_with_skips lists are structured as you described them in your post. I have no real world experience with the issues you are writing about. However, the problems seemed to be an excellent opportunity to learn more about Enumerators. I hope you and your readers find the following similarly useful.
The isect_wrapped_with_skips method depends on defining a skip iterator for Array. I take advantage of the fact that the arguments passed to an enumerator created by enumerable.enum_for(:method, *args) are available to the enumerator where ever it is used. Assignments to those args are seen within the enumerator and thus can be used to affect the enumerator behavior. Note also that you cannot use enumerator.first to set the initial values. This does not set the position within the enumerable in a way to support the correct behavior of the first use of enumerator.next. This seems to me to be a bug but I'll have to investigate further.
When I put these routines through your timing routines they appear to be quite fast but I haven't had time to compare them directly with your versions.
I'm very glad to have found your blog. It has been very stimulating. Thanks!
btw. My reply to your post (http://www.skorks.com/2010/03/writing-a-more-ruby-ish-array-intersection-function-and-sorting-structs/) contained several errors. I wrote that code off the top of my head and failed to test it at all before submitting it. :-( The current code does work in general but there are edge cases that need to be tested before I would use in any production code.
Once again I am defeated by the spaceship operator! Your rendering of a reply removes the spaceship operator characters. That is unfortunately the test I use in each of the case statements in the cod I submitted above.
Is there anything I can do to avoid similar problems in the future?
This is Wordpress trying to be smart and trying to interpret anything in less than and greater than signs as an html tag, and since the spaceship html tag doesn’t exist it just removes it. If you put spaces around the = sign, it should prevent Wordpress from removing it
i.e. < = >
The other thing you can do is to wrap the code in a “pre” tag e.g.:
but I believe it would still be better to put spaces in the spaceship operator just to make sure.
I’ve fixed up your comment for the moment. If it does happen again despite all this, I’ll have a look at fixing it in a more fundamental way.
Hi Mitch,
This is awesome, I’ll need to spend a little bit of time playing around with your code before I can say anything intelligent so bear with me :). It might even end up as a separate post :).
I hope you continue to find my blog interesting.
I didn’t know about Skip Pointers to begin with, thanks for enlightening me
Any time :)
Alan,
I decided to try comparing the performance of my approach to yours. I tried your timing method but was not satisfied because on my slowish hardware the resolution appears to be about 15.625 microseconds and thus on a single pass I could not distinguish the relative speeds of each approach. I then changed to Benchmark only to discover that you had already blogged about it! I think Benchmark gives a better picture of the relative performances. I ran 3 types of lists through the benchmarking process just as you had done: raw integer lists, wrapped lists and wrapped lists with skips. I stored the integer values of the list members that exist in the intersection in the output list. This made it easy to verify that each intersection method returned the same output. I created the lists once and stored them as yaml to allow repeated testing.
The results surprised me:
list1 size: 1064 list2 size: 4014 intersect list size: 416 1000 passes: Enumerator Based Intersect Skorks Rubyish Intersect user system total real user system total real raws: 21.59 0.02 21.61 21.64 0.89 0.00 0.89 0.89 wraps: 38.55 0.00 38.55 38.56 9.05 0.00 9.05 9.05 skips: 51.53 0.02 51.55 51.69 0.34 0.00 0.34 0.34The file containing the input lists in yaml occupied 678 KB on disk and I had 2+ meg of RAM available as I started this run. I don’t think these results can be explained by memory problems. eg using virtual RAM on disk. It appears that the Enumerator based approach is much slower than the approach you used! I thought my suggestion was much easier to read and understand but the apparent performance difference has to make your code the clear winner. ;-)
btw. I also ran this with far larger lists that clearly exceeded available memory and found that the Enumerator based approach appeared faster. I think that may be because that approach is not supposed to require as much memory so perhaps there was less HD thrashing going on.
The take home message is of course:
When performance matters, test your code for performance.
I can send you the code for my tests, if you would like.
Hi Mitch,
That’s really interesting. Yeah I’d love for you to send me the code you used, I’ll play around with it myself. If you don’t mind I might even do a separate post about it (if I can fit it in that is, so much to do so little time :)).