Share this:

How To Write A Simple Web Crawler In Ruby

WebI had an idea the other day, to write a basic search engine – in Ruby (did I mention I’ve been playing around with Ruby lately). I am well aware that there are perfectly adequate ruby crawlers available to use, such RDig or Mechanize. But I don’t want to use any libraries for the higher level functions of my search engine (crawling, indexing, querying etc.), at least not for the first version. Since the main idea is to learn (while doing something fun and interesting) and the best way to learn is to sometimes do things the hard way. Now that I have ensured I don’t get eaten alive for not reusing existing code, we can move on :).

I will examine all the different aspects of what makes a search engine (the anatomy) in a later post. In the meantime I believe doing something like this gives you an opportunity to experience first-hand all the different things you have to keep in mind when writing a search engine. It gives you chance to learn why we do SEO the way we do; it lets you play with different ruby language features, database access, ranking algorithms, not to mention simply cut some code for the experience. And you get all this without touching Rails, nothing against Rails, but I prefer to get comfortable with Ruby (by itself) first.

Well to dive right into it, I decided to write the crawler first, after all, you can’t have a search engine without a nice web crawler. First thing first, some basic features:

  • should be able to crawl the web (basic general functionality)
  • must be able to limit the depth of the crawl (otherwise will potentially keep going for ever and will also run out of memory eventually because of the way it’s written)
  • must be able to limit number of pages to crawl (even with a depth limited crawl, might still have too many pages to get through depending on the starting set of domains)
  • must be able to crawl just a single domain (you would also be able to limit this by number of pages)
  • the only output it will produce is to print out the fact that it is crawling a url

If you want to dive right into the code and explore you can download it here. For the rest, here is how it works. Firstly to run it do the following:

ruby search-engine-main.rb -c web -d 3 -p 100 -f 'urls.txt'

where:

-c (is either ’web’ or ‘domain’)

-d (is the depth of the crawl, it will only look at links this many levels below the initial urls)

-p (is the page limit, will not crawl more than this many pages regardless of other parameters)

-f (the file to use as input, simply contains a list of starting urls on separate lines, will use the first one from this file for domain crawl)

Our nice and clean entry point looks like this :):

argument_parser = CommandLineArgumentParser.new
argument_parser.parse_arguments
spider = Spider.new
url_store = UrlStore.new(argument_parser.url_file)
spider.crawl_web(url_store.get_urls, argument_parser.crawl_depth, argument_parser.page_limit) if argument_parser.crawl_type == CommandLineArgumentParser::WEB_CRAWLER
spider.crawl_domain(url_store.get_url, argument_parser.page_limit) if argument_parser.crawl_type == CommandLineArgumentParser::DOMAIN_CRAWLER

We don’t really care about this too much since this is not where the real fun bits are, so lets move on to that.

The Spider

Spider

The main worker class here is the Spider class. It contains 2 public methods, crawl_web and crawl_domain. Crawling the web looks like this:

  def crawl_web(urls, depth=2, page_limit = 100)
    depth.times do
      next_urls = []
      urls.each do |url|
        url_object = open_url(url)
        next if url_object == nil
        url = update_url_if_redirected(url, url_object)
        parsed_url = parse_url(url_object)
        next if parsed_url == nil
        @already_visited[url]=true if @already_visited[url] == nil
        return if @already_visited.size == page_limit
        next_urls += (find_urls_on_page(parsed_url, url)[email protected]_visited.keys)
        next_urls.uniq!
      end
      urls = next_urls
    end
  end

As you can see it is not recursive as this makes it easier to limit the depth of the crawl. You can also tell that we take special care to handle server side redirects. The class also keeps the urls already visited in memory, so as to guard against us getting into loops visiting the same several pages over and over. This is not the most efficient way, obviously, and will not scale anywhere past a few thousand pages, but for our simple crawler this is fine. We can improve this later.

Crawling a domain looks like this:

  def crawl_domain(url, page_limit = 100)
    return if @already_visited.size == page_limit
    url_object = open_url(url)
    return if url_object == nil
    parsed_url = parse_url(url_object)
    return if parsed_url == nil
    @already_visited[url]=true if @already_visited[url] == nil
    page_urls = find_urls_on_page(parsed_url, url)
    page_urls.each do |page_url|
      if urls_on_same_domain?(url, page_url) and @already_visited[page_url] == nil
        crawl_domain(page_url)
      end
    end
  end

This time the algorithm is recursive as it is easier to crawl a domain this way and we don’t need to limit the depth. Everything else is pretty much the same except we take special care to only crawl links on the same domain and we no longer need to care about redirection.

To get the links from the page I use Hpricot. I know I said I didn’t want to use too many libraries, but parsing html by hand would just be torture :). Here how I find all the links:

  def find_urls_on_page(parsed_url, current_url)
    urls_list = []
    parsed_url.search('a[@href]').map do |x|
      new_url = x['href'].split('#')[0]
      unless new_url == nil
        if relative?(new_url)
         new_url = make_absolute(current_url, new_url)
        end
        urls_list.push(new_url)
      end
    end
    return urls_list
  end

The challenging part here is handling the relative urls and making them absolute. I didn’t do anything fancy here. There is a helper module that I created (UrlUtils – yeah I know, great name :)), this doesn’t have anything too interesting, just blood, sweat and string handling code. That’s pretty much all there is to it. Now there are a few points that we need to note about this crawler.

It is not an ‘enterprise strength’ type of crawler, so don’t go trying to unleash it on the whole of the web, do make liberal use of the depth and page limiters, I wouldn’t try to get it to handle more than a few thousand pages at a time (for reasons I noted above). Some other limitations are as follows:

  • won’t work behind proxy
  • won’t handle ftp, https(maybe) etc. properly
  • client side redirects  – you’re out of luck
  • if link not in href attribute, i.e. javascript – no good
  • relative url handling could probably be improved some more
  • will not respect no-follow and similar rules that other bots will e.g. robots.txt
  • if list of starting urls too big, can grind it too a halt
  • probably a whole bunch of other limitations that I haven’t even begun thinking of

Of course there is an easy way to make this guy a little bit more scalable and useful – introduce indexing. This would be the next thing to do cause, even a simple little search engine would need some indexing. This would allow us to stop storing so much stuff in memory and also retain more info about pages which would let us potentially make better decisions, but that’s a story for another time :). Finally it probably goes without saying that I didn’t test this guy exhaustively and I only ran it on XP (too lazy to boot into my Linux :)). Anyway, have a play with it if you like, and feel free to suggest improvements and point out issues (or just say hello in general) while I start thinking about an indexer.

You can download all the code here.

Images by jpctalbot and mkreyness

  • Victor

    Thank you for sharing this. Found it on Reddit.

  • http://www.dancingbison.com Vasudev Ram

    Interesting post, thanks for sharing. I’d read another post the same topic (writing a web spider) at IBM developerWorks, IIRC. Don’t have the URL handy but it can be googled for.

    I liked your code in the examples above – seems well-refactored.

    Minor nit plus question – why do you use the if-after-imperative style, as with “return if …” or ” if …” ?

    (I’m aware that Ruby, following Perl, supports that style, and some general reasons for it.)

    In the case of the “return if”, its not really much of an issue to me, the code is still clear.
    But in the case of the method calls (with arguments) followed by the if, the if part is hidden on the right side of the screen (the code box). Like in the line that starts like this:

    spider.crawl_web(url_store.get_urls, …

    So it’s not obvious that there is an if there, unless you are reading carefully and scroll to the right …

    It’s your call, of course, just mentioning my opinion …

    – Vasudev

    • http://www.skorks.com Alan Skorkin

      Regarding using the ‘if’ after, i was actually just trying it out to see how it feels :). You’re right that if a line scrolls off the visible part of the screen it is quite possible to miss the fact that there is an ‘if’ which is an issue as I like to be explicit with my code.

      It does make things ‘look’ cleaner as things fit on one line rather than 3. I probably need to look at some more ruby code to decide if I like it or not.

  • grimnir

    Thanks a lot, interesting article.
    I already have an idea how to use this :)

    • http://www.skorks.com Alan Skorkin

      You’re welcome, would you like to share your idea? (unless it’s a secret big idea that will make you millions that is :), in which case I’d keep it to yourself)

  • Pieter

    I get the same error all the time. Any ideas?

    >ruby search-engine-main.rb -c web -d 1 -p 1000 -f ‘urls.txt’
    ./spider.rb:46:in `crawl_web’: undefined method `times’ for “1”:String (NoMethodError)
    from search-engine-main.rb:9

    • http://www.skorks.com Alan Skorkin

      Yeah, I probably wasn’t as clear on this in the post as I could have been. Your problem is with the depth limiting (-d 1). I guess the easiest way to explain it is to say that this parameter must have a minimum value of 2.

      The reason this is so is because the number includes your initial set of urls that you provide in the ‘url.txt’ file. Therefore if you try and set -d to be 1 it would in theory try to only visit the urls that you provide in the file.

      In practice it doesn’t really work (as you’ve discovered), probably because calling ‘times’ for a value of 1 doesn’t make too much sense. I should probably update the code to handle a depth limit value of 1 as a special case. In the meantime, use -d 2 as a minimum and it should work.

      • Pieter

        Thanks for the quick response. I am using the stock standard url.txt file that came with the package for testing purpose and with a depth of two I still get the same error (I have tried 3 as well)

        D:\Scripts\Webcrawler>ruby search-engine-main.rb -c web -d 2 -p 1 -f ‘urls.txt’
        ./spider.rb:46:in `crawl_web’: undefined method `times’ for “2”:String (NoMethodError) from search-engine-main.rb:9

        Could it be that my ruby configuration is missing something? I have hpricot, can get uri or uri-utils (I assume they are standard) and I am running the script from the D: drive if that could be a factor

        • Pieter

          What I want to do with the script is to actually just follow the links in the url.txt for a depth of 1 i.e. check link1 in url.txt, get links on the first page, return, check link2 in url.txt, get links etc etc rather than spider through the remainder of the web

          • http://www.skorks.com Alan Skorkin

            Hi, sorry about that, it is a little bug :). If you ran the crawler with all default it was all fine, but if you supplied the depth it is treating it as a string and therefore doesn’t work (I get the same on my end). Give me a few minutes, I’ll fix it up and upload again.

            To do what you want to do you would need to set the depth to be 2, then it would spider through the original urls as well as all the links it finds on those pages. If you set the depth to be 1 it would only visit 2 pages, the ones in urls.txt.

          • http://www.skorks.com Alan Skorkin

            It’s fixed now and I’ve uploaded the new version. Try downloading it again and giving it a try, should work fine now. By the way the page limit wasn’t working either, for the same reason the depth limit wasn’t working, a to_i call when parsing the arguments took care of it.

            Thanks for finding this bug for me, really shows the value of testing stuff, even a little bit :).

  • Pieter

    Thanks Alan, that was quick. Running as we speak, will let you know the outcome. So far it has not given any errors, and seems fine. What is the output, to a file or just to the console ?

    • http://www.skorks.com Alan Skorkin

      It just outputs to the console at the moment. Once I add an indexer into the mix then all sorts of things will start to become possible, so stay tuned :). Might also do a little bit more testing next time :).

      • Pieter

        I can help with testing with pleasure, as long as you do not ask me for any code!! Ok, you can ask, but it would be on your own risk basis :-)

        • http://www.skorks.com Alan Skorkin

          I might just take you up on that, once I get my act together and write the indexer :).

  • Pieter

    All went well ujp to this point:

    d:/Ruby/lib/ruby/1.8/timeout.rb:60:in `rbuf_fill’: execution expired (Timeout::Error)
    from d:/Ruby/lib/ruby/1.8/net/protocol.rb:132:in `rbuf_fill’
    from d:/Ruby/lib/ruby/1.8/net/protocol.rb:116:in `readuntil’
    from d:/Ruby/lib/ruby/1.8/net/protocol.rb:126:in `readline’
    from d:/Ruby/lib/ruby/1.8/net/http.rb:2020:in `read_status_line’
    from d:/Ruby/lib/ruby/1.8/net/http.rb:2009:in `read_new’
    from d:/Ruby/lib/ruby/1.8/net/http.rb:1050:in `request’
    from d:/Ruby/lib/ruby/1.8/open-uri.rb:248:in `open_http’
    from d:/Ruby/lib/ruby/1.8/net/http.rb:543:in `start’
    … 11 levels…
    from ./spider.rb:18:in `crawl_web’
    from ./spider.rb:16:in `times’
    from ./spider.rb:16:in `crawl_web’
    from search-engine-main.rb:9

    • http://www.skorks.com Alan Skorkin

      That one I have seen before, it seems to only happen intermittently though. Try running it a couple more times and see if it happens for you again. I just ran it through with depth 2 and it completed fine. Probably best way for me to handle it would be to catch the exception that causes the method to time out. It is possibly due to page taking too long to come back but not too sure without further investigation.

  • Pieter

    What I have also just picked up was that if the links in the url.txt file does not end with a trailing /, the crawler will only traverse the url.txt file, but will not crawl further irrespective of the depth setting. Perhaps it would be useful to do a check on the input and to add the / if it is missing?

    • http://www.skorks.com Alan Skorkin

      Hmmm, now that one shouldn’t happen at all…

      I just tried it here and it seemed to work fine regardless of trailing slash, I’ll need to investigate this some more. I am presuming you’re still using the default ‘urls.txt’?

  • Pieter

    No, I used some real links I wanted to crawl this time. But, I tested it again and you are right it does work with or without. The difference however was that in case 1 (which did not work) the number of links in url.txt was greater than the pages=100 setting (i.e. 400 links) and all it did was traverse through 100 links in url.txt and stopped, without doing any further crawling beyond that level (obviously as it has reached the pages limit). Would it be better if the links in the url.txt file are excluded from the page count?

    • http://www.skorks.com Alan Skorkin

      Ahh, ok well that explains it. The reason that the urls in urls.txt are included is that if you hook up an indexer to the crawler then it will index the pages in urls.txt as well as crawling the links and indexing those pages. So it actually makes sense to count them as well.

      It is possible to make this configurable in case people didn’t want to index the urls that they supply.

      • Pieter

        It works great. I look forward to see the progress (I have grabbed your RSS feed for that!). On the point above, the ‘risk’ of course is that as you do not know the number of links on that site or page, that you may never have a full index of that website, and will have to repeat with different page counts by trial and error until you have found all the links on a site (even if you crawl only at depth 2)

        The alternative could of course be to make the page count per url.txt link, so that it will always traverse the full url.txt link, but only ‘-p ####’ per link.

        Another way could be to keep the page count down by adding a blacklist or whitelist into the mix so that links to adservers, google etc could be excluded and not crawled.

        Look forward to your progress in this, and will grab and test when you update again, best regards

        • http://www.skorks.com Alan Skorkin

          Thanks man, I really appreciate all the feedback and testing and thanks for the ideas as well, they are certainly something that can be included in future versions. I am trying to keep it pretty simple for the moment. Once I’ve built a nice simple vertical slice, there are all sorts of different ways to expand and make it more configurable.

  • http://www.dancingbison.com Vasudev Ram

    >Regarding using the ‘if’ after, i was actually just trying it out to see how it feels :).

    Good reason :)

  • pramod

    Hi
    This is really good one. can u just give any references of book for learning the scraping or crawling on Ruby. i am new to this and very excited abt this .

  • http://blog.segment7.net Eric Hodel

    Why didn’t you use the built-in URI library?

    require ‘uri’
    uri = URI.parse ‘http://example.com/foo/bar/baz’
    uri + ‘../../other/place’

    • http://www.skorks.com Alan Skorkin

      Yeah I tried that, but the amount of work needed to get that to work ends up being similar.

      Depending on what the relative url looks like as well as the absolute url of the host e.g.
      if relative url starts with a / then it is relative to the base of the host
      if it doesn’t start with a slash then it is relative to it’s current location
      then you still need to calculate how many ../ you need to do to prepend to your relative url in order for everything to be correct. These are just some of the considerations

      You’re right though you can certainly do this using the uri library as well.

  • Georgios M

    Well done on this. :) Will you be making additions to this soon? (Indexing and Searching the index?)

    Thanks

    • http://www.skorks.com Alan Skorkin

      Timely question, I’ve actually been learning quite a bit about search and crawling and indexing etc lately, I am planning to do a whole lot of posts about search, indexing, etc. I also want to do more with this crawler as there are a few things that I am now aware of that I wasn’t before which would be worth discussing. So stay tuned, once my life settles a little and I get back into the groove, there will be a whole lot of interesting (hopefully) stuff coming up.