How Search Engines Process Documents Before Indexing

From my last two posts about searching (i.e. basic indexing and boolean queries), you might get an impression that writing a search engine is a fairly straight forward process. Infact nothing could be further from the truth. To highlight this I am going to explore one of the issues that I have glossed over previously in a little bit more detail – that of linguistic preprocessing of documents before indexing occurs. This is a massive topic and I have no hope of covering all of it in any kind of detail in just one post, instead I will try and touch on some of the main concepts and issues involved, to give everyone an idea of the scope that even this tiny area of information retrieval can cover.

So, let’s assume we have just fetched a document, what exactly do we need to do before we begin indexing?

Determining Document Encoding And Getting A Character Sequence

The first thing we need to be aware of is the fact that all we have is a byte stream. We don’t necessarily know what kind of document we have just fetched. Of course if we are building a custom search system where we know the kinds of documents that we’ll be dealing with then this is not the case. But, for a complex search engine that needs to deal with multiple document formats, before doing anything else, we need to determine the encoding of the document. We need to do this in order to correctly turn our byte stream into a character stream (some encoding systems use multiple bytes per character e.g. unicode). This may be simple for some documents as they will contain their encoding as metadata, otherwise it is somewhat more complex. I won’t go into too much more detail, you can look here for a little bit more info about determining Unicode, do look around if you’re interested and don’t forget to let me know what you find :).

Once we have determined the encoding, our work is not yet over. Our ultimate aim is to get a character stream, but the document may be an archive (e.g. zip) which we may need to extract in order to work with it’s contents. Alternatively a document may be a binary format (e.g. pdf, or MS Word document) where getting the character stream is quite a bit more involved. Even when this is not the case, most documents will contain spurious data which we don’t want to deal with (e.g. html or xml tags) and will therefore need to strip from our document. Having done all of that we eventually end up with a character sequence that we can process further. It is also worth noting that, this part of the system will potentially need to be highly extensible in order to cater for different encodings and document formats (including future ones).

Find The Document Unit

Having extracted the relevant characters from the raw document, we now need to determine exactly what constitutes a document for the purposes of our index. In other words we need to find what’s known as a document unit – i.e. the chunk of text that will be treated as a distinct document by our indexer. This is not as simple as it might appear. Consider – we have pulled-in a document which contains an e-mail conversation thread, do you treat each e-mail as a separate document, for indexing purposes, or do you just take the whole thread? What if some of the e-mails in the thread have attachments, do you ignore them, or treat each one as a separate document, or make it part of the e-mail that it was attached to? Taking this idea further, let’s say you’re indexing very large documents like books. Is the whole book one document? If someone does a multi-word query you don’t necessarily want the book to come back as a result if one of the words appears at the start and the other at the end – they may be completely unrelated and would not satisfy the query. So, do you index the book with each chapter being a document, how about each paragraph? If your document units are too granular however you potentially miss out on concepts that ARE related and would thereby fail to return relevant results. As you can see it is a precision versus recall trade-off. Depending on your document collection (what you’re trying to index), you may already have a clearly delineated document unit, or you may employ heuristic methods to work out the document units based on the content.

Tokenizing The Document

Once we know our document unit, it is time to tokenize. Once again this is deceptively complex. We want to split out document units into parts that will form the terms of our index. In the simplest case we want to split into words on whitespace and punctuation. But, this is only of limited utility. For example if we have words such as “are not” everything is fine, but what about “aren’t”. The semantic meaning is the same so we potentially want to get the same two words “are not” out of it, but if we simply split on punctuation and whitespace, we’ll get “aren t” – not quite as useful. There are countless examples like that one. But wait, there is more, what about names and terms such as C++, where discarding the plus signs will alter the meaning significantly. Then there are special cases such as URLs, dates, IP addresses, etc. that only make sense if taken as a single token. And if that wasn’t enough consider terms that are made up of multiple words which lose their meaning (or have different meanings) when split up, for example Baden-Baden, San Francisco, Sydney Opera House etc. There are many other similar considerations that we need to be aware of. The crucial thing to remember is this, whatever tokenizer we apply to our documents during indexing we also need to apply to our queries at search time, otherwise you will get at best unpredictable results – pretty obvious. As you can see, tokenization can be a complicated problem and if you’re working with documents in multiple languages, it’s an order of magnitude more complicated still, as each language presents it’s own set of challenges (if you speak a second language you’ll see what I mean, if you don’t, consider the fact that some Asian languages are written without spaces :)).

Removing Stop Words (Or Not)

Once we have tokenized successfully, we are almost ready to index, but indexing some of the words adds very little or no value. These are normally called stop words (think words such as: a, an, the etc.) and are often discarded from the index entirely. To find the stop words we can sort our list of terms by frequency and pick the most frequent ones according to their lack of semantic value (this can be done by hand as stop lists are normally quite small). While stop words usually don’t add much value and can be discarded safely, this is not always the case. For example, think of the phrase “to be or not to be” where every word is potentially a stop word, it is not the best outcome when the system returns no results for that kind of query. For this reason web search engines don’t usually employ stop lists. If a search engine does decide to use a stop list, it must once again be applied not only to the documents during indexing, but also to the query during the search.

Token Normalization

If you think that we are pretty much ready to start indexing, you would be wrong. The next thing we want to do with our tokens is to normalize them. This means we turn each token we have into its canonical form. For example, as far as we are concerned the terms: co-operation and cooperation are equivalent. So we can say that the canonical form of both of those terms is – cooperation. We want searches for either of the terms (co-operation or cooperation) to return results applicable to both. We can do this in a couple of ways:

Equivalence classing – this is an implicit process that is based on a set of rules and heuristics. For example we can say that the canonical form of a word with a hyphen is the same word without one and we will therefore create an equivalence class between those words (e.g. co-operation, cooperation -> cooperation). These equivalence classes will be applied to the document terms during indexing as well as to the queries at runtime so that a query that contains one of the words in the equivalence class, will return documents that contains any of the words.
Synonym lists – this is a more explicit and labour intensive process, but can give more control. Synonym lists are constructed by hand (e.g. think – chair, stool, seat, throne). These synonym lists can then be applied in one of two ways. At indexing time you can index the same document under all the terms that are synonyms of each other, this requires a bit more space, but at query time any of the synonym terms will match the document. You may also apply the synonym lists at runtime whereby you expand the query to contain more terms (i.e. all the synonyms) which means the query takes a bit longer to execute. How you decide to do it is up to you, it is a classic space/time trade-off.

There are other normalization steps that can be applied to our terms, such as case folding, whereby we lower-case all our terms. This will of course need to be applied to the query as well, but can come back to bite us, especially where proper names are concerned. For example Crunchy Nut Cornflakes is a breakfast serial, crunchy nut cornflakes are just three random words. Despite this, case folding is normally used by most search engines.

Stemming And Lemmatization

We’ve already done a lot of processing on our documents and terms, but there is still more to do. Different words often derive from the same root e.g. differ, different, difference, differing etc. It make sense that a query for one of these words will potentially be satisfied by a different form of it. We can therefore try to equate all of these words for indexing purposes. For this we can use either stemming or lemmatization (which are, I guess, types of normalization but deserve a section of their own in my opinion).

Stemming basically concerns itself with chopping off suffixes using various heuristics to find the more of less correct base form of the word. There are several well known stemming algorithms. The Lovins algorithm is one of the earliest stemmers and the Porter stemmer is one of the best known. Stemming tends to increase recall while harming precision.
Lemmatization is a more complex process using morphological analysis and vocabularies to arrive at the base form.

Neither stemming or lemmatization tend to have a lot of benefit when dealing with English language documents (which doesn’t mean they are never used :)), however languages with richer morphologies will often benefit to a greater degree.

And there you have it. I hope this has given you an appreciation for how complex even a small area, like preprocessing, can be when it comes to search engines. Of course I am far from an expert, so if you have something to add or want to correct me :), then do feel free to leave a comment. I hope you’ve been enjoying my posts about searching, since there is more to come. In my previous post on $1.

Images by Shawn Econo and francescomucio

Table of Contents