Lets Roll Our Own Boolean Query Search Engine

boolean Of course, we’re not going to write a full-on search engine in this one post, that would take at least 2 :). But, surprisingly enough, given our knowledge of inverted indexes (which I talked about previously), we can actually cobble together a very basic boolean information retrieval system relatively easily, all we need is a little bit more knowledge and a couple of algorithms, so guess what this post is going to be about.

Boolean Queries

So, what are boolean queries? I guess the best way to explain it is to contrast boolean queries with the type of search we know best – web search. Web search is an example of a ranked retrieval system. In a ranked system users typically use free text queries to define their search parameters and the results returned by the system are ranked in order of relevance (hence the name). A boolean system on the other hand has the following properties:

  • users employ a special syntax (i.e. operators such as AND, OR, NOT etc) to define their queries
  • the results are not ranked by relevance (being a boolean system)

Let us use an example to illustrate the point. In web search the users searching for ‘christmas tree’, is basically asking the following:

“Give me all the documents which contain the phrase ‘christmas tree’ in order from most relevant to least relevant.”

On the other hand, if we’re dealing with a boolean system the user would have to frame his query in a language the system can understand e.g.:

christmas AND tree – “Give me all documents that contain the word christmas and the word tree”

christmas OR tree – “Give me all documents that contain the word christmas or the word tree”

etc.

You get the picture. Given a collection of documents which we have used to create an inverted index, we need minimal effort (relatively) to expand our index into a basic boolean retrieval system.

Basic Boolean Queries And The Scope Of Our System

All we really need is a query parser which will understand the syntax of our queries and some sort of query executor which can access our index and spit out the results that we’re after. So, lets get implementing and comment as we go.

First thing first, we need an inverted index. We have the one from my previous post, but it was a tiny toy index and won’t let us appreciate the scope of the problem, we could of course find a bunch of documents to index, but that will add a whole level of complexity that we don’t really need. The best thing to do therefore is to dummy up an inverted index. We don’t need real documents, all we need is a dictionary of terms with a postings list (of dummy document ids) for each term – living on disk. This will allow us to generate posting lists of a size which will let us appreciate the scope of the search problem. Let’s dummy up our inverted index:

POSTINGS_LIST_LOCATION = "/home/alan/tmp"
MIN_POSTIN_LIST_SIZE = 8000
MAX_POSTING_LIST_SIZE = 10000
 
def create_mock_postings_list(min_postings, max_postings)
  postings_list = []
  num_postings = min_postings + rand(max_postings-min_postings)
  num_postings.times do
    postings_list << rand(max_postings)
  end
  postings_list.uniq.sort
end
 
def write_mock_postings_list postings, file_name
  File.open(file_name, "w") do |file|
    file.write(postings)
  end
end
 
def file_name_for_word(path, word)
  "#{path}/#{word}.postings"
end
 
def create_test_inverted_index(dictionary, path_to_posting_files)
  inverted_index = {}
  dictionary.each do |word|
    postings_list = create_mock_postings_list(MIN_POSTIN_LIST_SIZE, MAX_POSTING_LIST_SIZE)
    value = OpenStruct.new
    value.postings_file = file_name_for_word(path_to_posting_files, word)
    value.document_frequency = postings_list.length
    write_mock_postings_list(postings_list.join(","), value.postings_file)
    inverted_index[word] = value
  end
  inverted_index
end
 
dictionary = %w{hello world ruby quick fox lazy dog random stuff blah yadda}
index = create_test_inverted_index(dictionary, POSTINGS_LIST_LOCATION)

As you can see our dictionary contains only a few terms:

  • we create a posting list of random size for each term
  • we’ve set our posting list maximum size to be 10000 and minimum size to be 8000 to make sure there is sufficient crossover between the posting lists for each term, it also makes the posting lists large enough to make them interesting since we’re essentially saying that our collection is 10000 documents (in real life document collections are usually much larger, but 10000 is enough for our purposes)
  • our inverted index is a little “musclier” since it no longer keeps posting lists in memory, they are instead living on disk with the index only keeping the on-disk location

We’re now ready to start writing our query executor (we’ll look at query parsing a little later). We’ll only concern ourselves with very simple queries that use one or two terms and one or more of three operators (AND, OR, NOT) e.g.:

  • hello AND world
  • lazy OR dog
  • random AND NOT stuff
  • blah OR NOT yadda

Yes, it does mean we can only handle four types of queries (unless you count single term queries), but this will be enough to illustrate the complexity of the problem as well as providing a reasonably usable system in the end.

Naive Vs Efficient Implementations

Lets look at our all the possible queries our system will need to handle in order of complexity.

Single Term

A single term query is trivial, all we need to do is retrieve the postings list for the term from our inverted index. At this point all we would need to do is fetch the documents for all the ids and return them to the user (for our purposes, once the list of ids is returned we consider the query satisfied). Here is the implementation (without calling code):

def load_postings_list_for_word(word)
  postings = []
  postings_list_file = file_name_for_word(POSTINGS_LIST_LOCATION, word)
  File.open(postings_list_file).each do |file|
    file.each do |line|
      line.split(',').each {|id| postings << id.to_i}
    end
  end
  postings
end
 
def single_term_query(word)
  load_postings_list_for_word(word)
end

We will be able to reuse the load_postings_list_for_word, helper for other query types as well.

2 Term AND

These are queries such as (hello AND world), where we basically want to find all documents which contain both terms, in boolean algebra terms – a conjunction.

and In the case of our inverted index, it simply means we want to intersect the posting lists for the two terms and return the result. Of course, things are a little more complicated than that. We need to do this efficiently. Unlike enterprise software development, where implementing things naively is often a good thing (keeps things simple) – in search you will quickly pay for being naive – with an unusable system. A naive way to implement the postings list intersection might be something like this:

def naive_and_words(word1, word2)
  final_list = []
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
  postings_list1.each do |id|
    if postings_list2.include? id
      final_list << id
    end
  end
  final_list
end

This would mean we’re potentially scanning the whole of one posting list for each value in the other, considering the fact that we’re dealing with very large lists, this will grind our system to halt. We can be smarter about our implementation where we scan both posting lists only once:

def intersect_lists(list1, list2)
  final_list = []
  current_list1_index = 0
  current_list2_index = 0
 
  while(current_list1_index < list1.length && current_list2_index < list2.length)
    if list1[current_list1_index] == list2[current_list2_index]
      final_list << list1[current_list1_index]
      current_list1_index += 1
      current_list2_index += 1
    elsif list1[current_list1_index] < list2[current_list2_index]
      current_list1_index += 1
    else
      current_list2_index += 1
    end
  end
  final_list
end
 
def and_words(word1, word2)
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
 
  intersect_lists(postings_list1, postings_list2)
end

This is somewhat more complex, but will allow our system to scale to very large posting lists. Of course we can simply use Ruby’s built-in operators to perform the intersection:

def ruby_and_words(word1, word2)
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
  postings_list1 & postings_list2
end

This is infact much simpler and works just as well, but we don’t learn any interesting algorithms from doing it this way :). Here is a sample run where we timed the intersection of the posting lists for two words using all three methods.

Elapsed time (naive): 1.3064 sec
Elapsed time (our AND): 0.027263 sec
Elapsed time (ruby AND): 0.012857 sec

As you can see our “good” implementation is much faster than the naive one, but still more than twice as slow as Ruby’s one, but then again we have to remember that Ruby’s implementation is written in C :).

Challenge/Question!

Can you write an implementation of the intersection algorithm, in pure Ruby, that would approach the speed of Ruby’s core one (which is written in C)?

2 Term OR

These are queries such as (hello OR world), where we want all the documents that contain either or both of the words – a disjunction:

or I won’t bother with the naive implementation (the story is the same just a different operator). Here we need to walk through the posting lists for the two terms and add everything we find to our result list – without duplicates:

def add_lists(list1, list2)
  final_list = []
  current_list1_index = 0
  current_list2_index = 0
 
  while(current_list1_index < list1.length || current_list2_index < list2.length)
    if current_list1_index >= list1.length
      final_list << list2[current_list2_index]
      current_list2_index += 1
    elsif current_list2_index >= list2.length
      final_list <<; list1[current_list1_index]
      current_list1_index += 1
    elsif list1[current_list1_index] == list2[current_list2_index]
      final_list << list1[current_list1_index]
      current_list1_index += 1
      current_list2_index += 1
    elsif list1[current_list1_index] < list2[current_list2_index]
      final_list << list1[current_list1_index]
      current_list1_index += 1
    else
      final_list << list2[current_list2_index]
      current_list2_index += 1
    end
  end
  final_list
end
 
def or_words(word1, word2)
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
  add_lists(postings_list1, postings_list2)
end

Ruby will of course let us do this even more easily (if we’re dealing with sets or arrays that is):

def ruby_or_words(word1, word2)
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
  result = postings_list1 + postings_list2
end

The result of comparing the time will be similar is this case as well, our implementation does well, but Ruby’s C implementation is more than 2 times faster.

2 Term AND NOT

Now we’re introducing the NOT operator (hello AND NOT world). The situation is similar (again :)) to the regular AND, but the algorithm is slightly different in that we want to find all the documents that contain the first term and don’t contain the second:

def list_difference(list1, list2)
  final_list = []
  current_list1_index = 0
  current_list2_index = 0
 
  while(current_list1_index < list1.length)
    if current_list2_index >= list2.length || list1[current_list1_index] < list2[current_list2_index]
      final_list << list1[current_list1_index]
      current_list1_index += 1
    elsif list1[current_list1_index] == list2[current_list2_index]
      current_list1_index += 1
      current_list2_index += 1
    else
      current_list2_index += 1
    end
  end
  final_list
end
 
def and_not_words(word1, word2)
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
 
  list_difference(postings_list1, postings_list2)
end

Ruby has a handy operator for this also:

def ruby_and_not_words(word1, word2)
  postings_list1 = load_postings_list_for_word(word1)
  postings_list2 = load_postings_list_for_word(word2)
  postings_list1 - postings_list2
end

As usual Ruby’s C implementation is faster, but our implementation scales well also.

2 Term OR NOT (also single term NOT)

These last two are a little trickier. We want to find all documents that contain the first term or the documents that don’t contain the second, which basically means we want to find all documents in our collection that don’t contain the second term, which seems to be equivalent to a single term NOT (i.e. hello OR NOT world = NOT world). Do feel free to correct me if I am wrong here! Why is this trickier? Because our inverted index only contains associations between terms and the documents they belong two, there are no associations for the documents a term doesn’t belong to. Unless we augment our index with further information when we construct it, we really only have two ways out of this situation that I can see:

  1. if we assume that we know how many documents our collection contains in total and our documents have consecutive ids then we can create a list of all document ids and take away the ids we don’t want (i.e. perform an AND NOT operation)
  2. we go through the whole dictionary, retrieve each posting list and construct a list of all document ids in the collection that way, then we once again perform the AND NOT operation against the posting list of the term we don’t want – this doesn’t seem very efficient.

Is augmenting the index at construction time the only real way to solve this problem or can anyone see a better way of doing this?

Query Parsing And Optimization Of Complex Queries

Since we so handily restricted the possible queries we want to process, writing a parser would be relatively trivial. However there are caveats even here. When dealing with large document collections (almost always for search problems), we always try to minimize the time we take to generate our results as much a we can. So is there anything more we can do for our queries? Well, it turns out that at least in the case of AND queries we can. If we sort the query terms by frequency (remember we store document frequency as part of index construction) we can begin processing our terms by taking the smallest frequency term first. If we look at our implementation of the AND query, we can see that we stop once the smaller list is exhausted. Therefore starting with the smallest frequency term first allows us to do the least amount of work and speed up our processing even more.

We’re now in the realm of query optimization. For a two term query this may not be such a big win, but if we decide to allow more complex queries (i.e. AND with arbitrary number of terms), the savings will begin to add up. If we were to allow arbitrary boolean queries we can quickly end up with a parser and optimizer that is quite complex, consider:

((a AND b OR C) AND NOT d) OR e AND f OR NOT g AND (h AND NOT i) – scary and ugly

The good news is that the query above can be optimized. For example we can re-write it in conjunctive normal form or disjunctive normal form (err, probably – I haven’t actually tried).

Challenge!

Have a go at writing a parser/optimizer that will take an arbitrary query that uses AND, OR, NOT and break it down into one of the normal forms. For some extra points re-arrange it for the most efficient query execution time. This is a non-trivial problem, I hope you remember your boolean algebra :). De Morgan’s law and distributive law would be a good place to start I think.

Problems With Boolean Systems

As fun as boolean systems are, we can immediately spot some issues.

  • you need to learn the syntax, for some system this can get extremely complex, especially when more than just three operators are involved, infact people have made careers from being expert at using a particular boolean system
  • when using AND type queries you will usually get very relevant results but few of them (high precision, low recall), while when using OR type queries will usually net you low precision and high recall, there doesn’t seem to be a decent way to combine these to get a happy medium
  • searching for multi-word phrases that denote a single concept becomes problematic e.g. toilet paper, unless the system includes some type of proximity operator and even then it is not perfect

Ok, so I lied, we didn’t really put together a complete boolean retrieval system in this post. But you know what – we got close, and we learned a bit about search and boolean queries, and we got to practice our coding skills, and we gained an appreciation for how much complexity is hidden behind even seemingly simple search concepts. I do believe this is more than enough for one sitting. I’d love to hear any comments/criticisms you may have and if you can expand on anything I’ve said then please do, I don’t fancy myself an expert by any stretch of the imagination, so any knowledge sharing is welcome. If you enjoyed this post and would like to read more of my musings then do consider subscribing.

Image by rikhei

  • Tordek

    “hello OR NOT world” is NOT the same as “NOT world”.

    The trivial case: “hello world” will match the first query (since hello is there), but will not match the second.

    • http://www.skorks.com Alan Skorkin

      If you draw this case out with the two intersecting circles you’ll find that the area colored by both queries is the same, which makes me think they are indeed equivalent.

      “hello world” – is not the trivial case. In a boolean system you can’t do multi-word queries without an operator unless the tokens used during indexing are also multi-word.

      • Tordek

        If you actually draw this case out with the two intersecting circles you’ll find that the area colored by both queries is NOT the same.

        “hello world” is not a query in my example; it’s an item you search for. IE, a document containing both “hello” and “world”.

        Basic logic:

        p|q|p OR NOT q| NOT q
        0|0|1         |1
        0|1|0         |0
        1|0|1         |1
        1|1|1         |0 <- Different.
        
        • http://www.skorks.com Alan Skorkin

          Ah yes, now I see it, you’re right the intersecting part of the two circles is in play for the OR NOT case but not in play for the NOT case (as per your truth table). Unfortunately, this doesn’t remove any of the complexity of actually finding a list of documents that would satisfy either query :(.

          Thanks for picking up on my error, much appreciated.

  • CarolN

    I’m not sure why you make the distinction “the results are not ranked by relevance (being a boolean system)”– many search engines are BOTH a boolean retrieval system and a ranked retrieval system. There’s nothing incompatible about doing both boolean queries and ranking, and having ranking doesn’t make something NOT a boolean retrieval system…

    • http://www.skorks.com Alan Skorkin

      The way I understand it, a boolean retrieval system is called that not because you do AND, OR, NOT queries, it is called that due to the fact that the results are either found or not found (i.e. either 1 or 0). There is no fuzziness like there would be in a ranked system such as a web search engine.

      All ranked retrieval systems are also boolean retrieval systems (kinda) in that you have to find the results before you can rank them by relevance.