Search Fundamentals – Basic Indexing

A little while ago I wrote a tiny little crawler, at the time I promised myself that having dabbled in crawling I would also cover searching, indexing and other web-search related activities. Little did I know back then just how sparse my knowledge in the area actually was, although some of the comments I received regarding the limitations of my crawler should have probably clued me in :). I can’t rightly say that I am an expert now, but I did learn a little bit in the last few months, enough to know just how complex this area can be. Regardless, I feel that I have reached a milestone in that I can now write about search without making a complete idiot out of myself (I hope :)), which means it is time to fulfill that promise I made to myself.

I’d actually love to cover the ins and outs of writing a big-boy crawler, but that’s a story for another post, for the moment I’ll begin with the fundamentals:

  • some basic terminology for you to throw around at parties
  • the anatomy of a basic index, the fundamental underpinning of any search engine

Why Index?

Wherever we find a search engine, we also find a set of documents. You’re lucky if your set is small, but if it is small enough you can probably scan it by hand. If you need a search engine, chances are your document collection is big and if you’re looking at writing a search engine for the web, it is very, very big. It is ludicrous to expect a search engine to scan all the documents every time you do a query, it would take forever. To make your queries fast and efficient a search engine will pre-process the documents and create an index.

The Heart Of Every Search Engine

At the core of every modern search engine is an inverted index, this is a standard term the reason for which will become clear shortly. The most basic thing we want to do is be able to quickly tell if a word occurs in any of the documents in our collection. We can assign a set of ids to our documents and then associate all the words that occur in a particular document with it’s id, this is rather inefficient for obvious reasons (duplicate words and all). Instead we invert the concept. We take all the words/terms that occur in all the documents in our collection – this is called a vocabulary (also standard terminology) – we then map each term  to a set of document ids it occurs in. Each document id is called a posting and a set of document ids is a postings list. So, the most basic inverted index is a dictionary of terms each of which is associated with a postings list.

It goes without saying that an inverted index is built in advance to support future queries. On the surface this is done in a manner you would expect:

  • go through all the documents, assign each an id and tokenize each one for words
  • process all the tokens (linguistic processing), to produce a list of normalized tokens
  • for each token create a postings list, i.e. a list of document ids it occurs in

Of course those 3 simple steps hide infinite layers of complexity. What we want to end up with is a sorted list of terms each of which is associated with a list of document ids. We can also start storing some extra info even with this basic inverted index, such as the document frequency for each term (how many documents the term occurs in). This extra information will eventually become useful when we want to rank our search results.

Let’s Play With Some Code

Too much theory without any practice would be cruel and unusual considering we’re hardcore developer types, so a good exercise at this point would be to construct a small inverted index given a set of documents, to help crystallize the concepts. A small set (2-3 docs) is best so that results can be visually checked. Here is my 10 minute attempt in ruby (it actually took longer but it should have taken 10 minutes, need more skillz :)).

Before we begin the indexing we pre-process out documents to produce a dictionary of terms, while at the same time retaining the document ids for each term:

require 'ostruct'
 
documents = ["the quick brown fox jumped over the lazy dog while carrying a chicken and ran away",
  "a fox and a dog can never be friends a fox and a cat is a different matter brown or otherwise",
  "i don't have a dog as i am not really a dog person but i do have a cat it ran away but came back by itself"]
 
def pre_process_documents(documents)
  dictionary=[]
  documents.each_with_index do |document, index|
    document.split.each do |word|
      record = OpenStruct.new
      record.term = word
      record.doc_id = index
      dictionary << record
    end
  end
  sort_initial_dictionary dictionary
end
 
def sort_initial_dictionary(dictionary)
  dictionary.sort do |record1, record2|
    record1.term <=> record2.term
  end
end
 
initial_dictionary = pre_process_documents documents

Some things to note for this phase are:

  • the documents we are indexing live in an in-memory array, in real life they would live on disk or on the web or whatever, but I didn’t bother with that – for simplicity
  • the ids that are assigned to the documents are simply a sequence which is assigned to a document as we encounter it for the first time, in our case we make do with an array index, this is infact analogous to how a real search engine index would do this
  • we didn’t really have to sort in this initial phase, but sorting makes me happy so there

We are now ready to begin constructing our inverted index:

def index(dictionary)
  inverted_index_hash = {}
  dictionary.each do |record|
    postings_list = inverted_index_hash[record.term] || []
    postings_list << record.doc_id
    inverted_index_hash[record.term] = postings_list.sort.uniq
  end
  finalize_index inverted_index_hash.sort
end
 
def finalize_index index
  final_inverted_index = []
  index.each do |term_array|
    final_record = OpenStruct.new
    final_record.term = term_array[0]
    final_record.postings = term_array[1]
    final_record.frequency = term_array[1].size
    final_inverted_index << final_record
  end
  final_inverted_index
end
 
final_index = index initial_dictionary

There are a few more things of note here:

  • I used a hash to begin constructing my inverted index, which I then converted to an array for the final index, this was done for the sake of simplicity once again, but as you might imagine it is not really optimal, especially as your index gets larger. I should have been looking at using some sort of balanced binary tree, which would allow me to keep things sorted and allow reasonably quick access after
  • The postings list was kept in memory together with the term, for a larger index these would probably live on disk and each term in memory would have a reference to the on-disk location of it’s postings list
  • After the index is finished I go over it again to compute and store the document frequencies for the terms, this may not be terribly efficient, but it is a lot more efficient than computing these on the fly during query time, the larger your document collection the more the savings add up. Plus there are many other things we can compute and store at this point to potentially speed up our queries, it is a classic trade-off

All that’s left to do is print out the final index to make sure everything looks the way we expect.

def print_index(index)
  index.each do |term_struct|
    puts "#{term_struct.term} (#{term_struct.frequency}) -> #{term_struct.postings.inspect}"
  end
end
 
print_index(final_index)

This produces the following output:

a (3) -> [0, 1, 2]
am (1) -> [2]
and (2) -> [0, 1]
as (1) -> [2]
away (2) -> [0, 2]
back (1) -> [2]
be (1) -> [1]
brown (2) -> [0, 1]
but (1) -> [2]
by (1) -> [2]
came (1) -> [2]
can (1) -> [1]
carrying (1) -> [0]
cat (2) -> [1, 2]
chicken (1) -> [0]
different (1) -> [1]
do (1) -> [2]
dog (3) -> [0, 1, 2]
don't (1) -> [2]
fox (2) -> [0, 1]
friends (1) -> [1]
have (1) -> [2]
i (1) -> [2]
is (1) -> [1]
it (1) -> [2]
itself (1) -> [2]
jumped (1) -> [0]
lazy (1) -> [0]
matter (1) -> [1]
never (1) -> [1]
not (1) -> [2]
or (1) -> [1]
otherwise (1) -> [1]
over (1) -> [0]
person (1) -> [2]
quick (1) -> [0]
ran (2) -> [0, 2]
really (1) -> [2]
the (1) -> [0]
while (1) -> [0]

We can visually verify that our index looks correct based on the input (our initial collection of 3 documents). There are many things we can note about this output, but for the moment only one is significant:

  • As we can see, most of the words in our index only occur in one document and only a couple occur in all three, this is due to the size of our collection i.e. this is bound to happen when indexing a small number of documents. If we add more and more documents the postings lists for all the terms will begin to grow (pretty self-explanatory really).

You’re welcome to go through a similar exercise of constructing an index for some simple documents, it is actually a reasonably decent code kata, if you do I’d love to hear about it. If you want something a little bit more involved consider using some real text documents (rather than contrived ones) and fetching them from disks. Avoid using hashes and arrays for the final index, be a real man/woman and use some sort of tree. Make sure, with your final index, the postings lists live on disk while the dictionary lives in memory and everything still works. There are quite a number of ways to make the exercise a little bit more complex and interesting (unless you’re afraid of a little extra complexity, but then you wouldn’t be the developer I think you are, we eat extra complexity for breakfast, mmmmm complexity). Of course for a web search engine even the term dictionary is potentially too big to be kept in memory, but we won’t get into distributed stuff at this time, that’s a whole different kettle of fish.

This is a very basic explanation of very basic indexing, but surprisingly enough even Google’s mighty index is just a much more advanced descendant of just such an inverted index. I’ll try and get into how we can give our basic index a little bit more muscle in a later post. In the meantime if you want to know more about search, information retrieval and other topics along similar vein, here are some resources (these are the books that I used, so they are all tops :)). For a gentler more mainstream introduction have a look at Programming Collective Intelligence and Collective Intelligence In Action, but if you want to get straight into the nitty gritty and are not afraid of a little bit of maths try Introduction To Information Retrieval, this one is a textbook though with lots of exercises and references and stuff – you’ve been warned. As always, I welcome any comments/questions you might have, even if you just want to say hi :).

Image by koalazymonkey

  • David Takahashi

    welcome back Alan… Hope all is well!

    • http://www.skorks.com Alan Skorkin

      Thanks, it feels good to finally blog something again, it’s been bugging me for a while now. An unfortunate side-effect of having too many interests/responsibilities with equally high priorities (and living on stupid Earth with it’s tiny 24 hour days :)).

  • Pingback: Dew Drop – January 26, 2010 | Alvin Ashcraft's Morning Dew

  • http://cubeantics.com Robert

    welcome back, indeed.
    Nice post to start the year off with.

    • http://www.skorks.com Alan Skorkin

      Cheers, hopefully more to come, time management kung-fu permitting

  • http://myopentalk.blogspot.com Rainer

    Hi there! . Good to read you again. Always interesting. All the best for 2010 and beyond.

    • http://www.skorks.com Alan Skorkin

      Hi Rainer, thanks, I’ve come up with a plan this time round, if too many things fall on me at once, I’ll just blog about them :). In true agile style I’ll try it out to see if it works and adjust if necessary.

  • http://matpalm.com mat

    welcome back al. oh wait… i sit next to you every day :P

    • http://www.skorks.com Alan Skorkin

      Hehe, is that why you voted yourself off the island :)

  • http://www.yahoo.com riz

    i want to know fundamentals of searching.claerly

  • http://www.yahoo.com riz

    all r good but for easy .can u give in points wise?