How To Write A Simple Web Crawler In Ruby

I had an idea the other day, to write a basic search engine – in Ruby (did I mention I’ve been playing around with Ruby lately). I am well aware that there are perfectly adequate ruby crawlers available to use, such RDig or Mechanize. But I don’t want to use any libraries for the higher level functions of my search engine (crawling, indexing, querying etc.), at least not for the first version. Since the main idea is to learn (while doing something fun and interesting) and the best way to learn is to sometimes do things the hard way. Now that I have ensured I don’t get eaten alive for not reusing existing code, we can move on :).

I will examine all the different aspects of what makes a search engine (the anatomy) in a later post. In the meantime I believe doing something like this gives you an opportunity to experience first-hand all the different things you have to keep in mind when writing a search engine. It gives you chance to learn why we do SEO the way we do; it lets you play with different ruby language features, database access, ranking algorithms, not to mention simply cut some code for the experience. And you get all this without touching Rails, nothing against Rails, but I prefer to get comfortable with Ruby (by itself) first.

Well to dive right into it, I decided to write the crawler first, after all, you can’t have a search engine without a nice web crawler. First thing first, some basic features:

should be able to crawl the web (basic general functionality)
must be able to limit the depth of the crawl (otherwise will potentially keep going for ever and will also run out of memory eventually because of the way it’s written)
must be able to limit number of pages to crawl (even with a depth limited crawl, might still have too many pages to get through depending on the starting set of domains)
must be able to crawl just a single domain (you would also be able to limit this by number of pages)
the only output it will produce is to print out the fact that it is crawling a url

If you want to dive right into the code and explore you can download it here. For the rest, here is how it works. Firstly to run it do the following:

ruby search-engine-main.rb -c web -d 3 -p 100 -f 'urls.txt'

where:

-c (is either ’web’ or ‘domain’)

-d (is the depth of the crawl, it will only look at links this many levels below the initial urls)

-p (is the page limit, will not crawl more than this many pages regardless of other parameters)

-f (the file to use as input, simply contains a list of starting urls on separate lines, will use the first one from this file for domain crawl)

Our nice and clean entry point looks like this :):

ruby argument_parser = CommandLineArgumentParser.new argument_parser.parse_arguments spider = Spider.new url_store = UrlStore.new(argument_parser.url_file) spider.crawl_web(url_store.get_urls, argument_parser.crawl_depth, argument_parser.page_limit) if argument_parser.crawl_type == CommandLineArgumentParser::WEB_CRAWLER spider.crawl_domain(url_store.get_url, argument_parser.page_limit) if argument_parser.crawl_type == CommandLineArgumentParser::DOMAIN_CRAWLER

We don’t really care about this too much since this is not where the real fun bits are, so lets move on to that.

The Spider

The main worker class here is the Spider class. It contains 2 public methods, crawl_web and crawl_domain. Crawling the web looks like this:

ruby def crawl_web(urls, depth=2, page_limit = 100) depth.times do next_urls = [] urls.each do |url| url_object = open_url(url) next if url_object == nil url = update_url_if_redirected(url, url_object) parsed_url = parse_url(url_object) next if parsed_url == nil @already_visited[url]=true if @already_visited[url] == nil return if @already_visited.size == page_limit next_urls += (find_urls_on_page(parsed_url, url)-@already_visited.keys) next_urls.uniq! end urls = next_urls end end

As you can see it is not recursive as this makes it easier to limit the depth of the crawl. You can also tell that we take special care to handle server side redirects. The class also keeps the urls already visited in memory, so as to guard against us getting into loops visiting the same several pages over and over. This is not the most efficient way, obviously, and will not scale anywhere past a few thousand pages, but for our simple crawler this is fine. We can improve this later.

Crawling a domain looks like this:

ruby def crawl_domain(url, page_limit = 100) return if @already_visited.size == page_limit url_object = open_url(url) return if url_object == nil parsed_url = parse_url(url_object) return if parsed_url == nil @already_visited[url]=true if @already_visited[url] == nil page_urls = find_urls_on_page(parsed_url, url) page_urls.each do |page_url| if urls_on_same_domain?(url, page_url) and @already_visited[page_url] == nil crawl_domain(page_url) end end end

This time the algorithm is recursive as it is easier to crawl a domain this way and we don’t need to limit the depth. Everything else is pretty much the same except we take special care to only crawl links on the same domain and we no longer need to care about redirection.

To get the links from the page I use Hpricot. I know I said I didn’t want to use too many libraries, but parsing html by hand would just be torture :). Here how I find all the links:

ruby def find_urls_on_page(parsed_url, current_url) urls_list = [] parsed_url.search('a[@href]').map do |x| new_url = x['href'].split('#')[0] unless new_url == nil if relative?(new_url) new_url = make_absolute(current_url, new_url) end urls_list.push(new_url) end end return urls_list end

The challenging part here is handling the relative urls and making them absolute. I didn’t do anything fancy here. There is a helper module that I created (UrlUtils – yeah I know, great name :)), this doesn’t have anything too interesting, just blood, sweat and string handling code. That’s pretty much all there is to it. Now there are a few points that we need to note about this crawler.

It is not an ‘enterprise strength’ type of crawler, so don’t go trying to unleash it on the whole of the web, do make liberal use of the depth and page limiters, I wouldn’t try to get it to handle more than a few thousand pages at a time (for reasons I noted above). Some other limitations are as follows:

won’t work behind proxy
won’t handle ftp, https(maybe) etc. properly
client side redirects – you’re out of luck
if link not in href attribute, i.e. javascript – no good
relative url handling could probably be improved some more
will not respect no-follow and similar rules that other bots will e.g. robots.txt
if list of starting urls too big, can grind it too a halt
probably a whole bunch of other limitations that I haven’t even begun thinking of

Of course there is an easy way to make this guy a little bit more scalable and useful – introduce indexing. This would be the next thing to do cause, even a simple little search engine would need some indexing. This would allow us to stop storing so much stuff in memory and also retain more info about pages which would let us potentially make better decisions, but that’s a story for another time :). Finally it probably goes without saying that I didn’t test this guy exhaustively and I only ran it on XP (too lazy to boot into my Linux :)). Anyway, have a play with it if you like, and feel free to suggest improvements and point out issues (or just say hello in general) while I start thinking about an indexer.

You can download all the code here.

Images by jpctalbot and mkreyness

Table of Contents