<?xml version="1.0" encoding="UTF-8"?><rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
> <channel><title>Comments on: How To Write A Simple Web Crawler In Ruby</title> <atom:link href="http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/feed/" rel="self" type="application/rss+xml" /><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/</link> <description>For the betterment of the software craft...</description> <lastBuildDate>Mon, 21 Nov 2011 13:57:06 +0000</lastBuildDate> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.1.2</generator> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-3336</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Wed, 18 Nov 2009 10:19:52 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-3336</guid> <description>Timely question, I&#039;ve actually been learning quite a bit about search and crawling and indexing etc lately, I am planning to do a whole lot of posts about search, indexing, etc. I also want to do more with this crawler as there are a few things that I am now aware of that I wasn&#039;t before which would be worth discussing. So stay tuned, once my life settles a little and I get back into the groove, there will be a whole lot of interesting  (hopefully) stuff coming up.</description> <content:encoded><![CDATA[<p>Timely question, I&#8217;ve actually been learning quite a bit about search and crawling and indexing etc lately, I am planning to do a whole lot of posts about search, indexing, etc. I also want to do more with this crawler as there are a few things that I am now aware of that I wasn&#8217;t before which would be worth discussing. So stay tuned, once my life settles a little and I get back into the groove, there will be a whole lot of interesting  (hopefully) stuff coming up.</p> ]]></content:encoded> </item> <item><title>By: Georgios M</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-3329</link> <dc:creator>Georgios M</dc:creator> <pubDate>Thu, 12 Nov 2009 14:49:43 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-3329</guid> <description>Well done on this. :) Will you be making additions to this soon? (Indexing and Searching the index?)
Thanks</description> <content:encoded><![CDATA[<p>Well done on this. :) Will you be making additions to this soon? (Indexing and Searching the index?)</p><p>Thanks</p> ]]></content:encoded> </item> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1863</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Tue, 04 Aug 2009 08:14:31 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1863</guid> <description>Yeah I tried that, but the amount of work needed to get that to work ends up being similar.
Depending on what the relative url looks like as well as the absolute url of the host e.g.
if relative url starts with a / then it is relative to the base of the host
if it doesn&#039;t start with a slash then it is relative to it&#039;s current location
then you still need to calculate how many ../ you need to do to prepend to your relative url in order for everything to be correct. These are just some of the considerations
You&#039;re right though you can certainly do this using the uri library as well.</description> <content:encoded><![CDATA[<p>Yeah I tried that, but the amount of work needed to get that to work ends up being similar.</p><p>Depending on what the relative url looks like as well as the absolute url of the host e.g.<br
/> if relative url starts with a / then it is relative to the base of the host<br
/> if it doesn&#8217;t start with a slash then it is relative to it&#8217;s current location<br
/> then you still need to calculate how many ../ you need to do to prepend to your relative url in order for everything to be correct. These are just some of the considerations</p><p>You&#8217;re right though you can certainly do this using the uri library as well.</p> ]]></content:encoded> </item> <item><title>By: Eric Hodel</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1857</link> <dc:creator>Eric Hodel</dc:creator> <pubDate>Mon, 03 Aug 2009 23:13:11 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1857</guid> <description>Why didn&#039;t you use the built-in URI library?
require &#039;uri&#039;
uri = URI.parse &#039;http://example.com/foo/bar/baz&#039;
uri + &#039;../../other/place&#039;</description> <content:encoded><![CDATA[<p>Why didn&#8217;t you use the built-in URI library?</p><p>require &#8216;uri&#8217;<br
/> uri = URI.parse &#8216;<a
href="http://example.com/foo/bar/baz" rel="nofollow">http://example.com/foo/bar/baz</a>&#8216;<br
/> uri + &#8216;../../other/place&#8217;</p> ]]></content:encoded> </item> <item><title>By: pramod</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1852</link> <dc:creator>pramod</dc:creator> <pubDate>Mon, 03 Aug 2009 07:57:38 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1852</guid> <description>Hi
This is really good one. can u just give any references of book for learning the scraping or crawling on Ruby. i am new to this and very excited abt this .</description> <content:encoded><![CDATA[<p>Hi<br
/> This is really good one. can u just give any references of book for learning the scraping or crawling on Ruby. i am new to this and very excited abt this .</p> ]]></content:encoded> </item> <item><title>By: Vasudev Ram</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1836</link> <dc:creator>Vasudev Ram</dc:creator> <pubDate>Sat, 01 Aug 2009 15:51:14 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1836</guid> <description>&gt;Regarding using the ‘if’ after, i was actually just trying it out to see how it feels :).
Good reason :)</description> <content:encoded><![CDATA[<p>&gt;Regarding using the ‘if’ after, i was actually just trying it out to see how it feels :).</p><p>Good reason :)</p> ]]></content:encoded> </item> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1819</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Wed, 29 Jul 2009 13:51:26 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1819</guid> <description>Thanks man, I really appreciate all the feedback and testing and thanks for the ideas as well, they are certainly something that can be included in future versions. I am trying to keep it pretty simple for the moment. Once I&#039;ve built a nice simple vertical slice, there are all sorts of different ways to expand and make it more configurable.</description> <content:encoded><![CDATA[<p>Thanks man, I really appreciate all the feedback and testing and thanks for the ideas as well, they are certainly something that can be included in future versions. I am trying to keep it pretty simple for the moment. Once I&#8217;ve built a nice simple vertical slice, there are all sorts of different ways to expand and make it more configurable.</p> ]]></content:encoded> </item> <item><title>By: Pieter</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1818</link> <dc:creator>Pieter</dc:creator> <pubDate>Wed, 29 Jul 2009 13:46:51 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1818</guid> <description>It works great. I look forward to see the progress (I have grabbed your RSS feed for that!). On the point above, the &#039;risk&#039; of course is that as you do not know the number of links on that site or page, that you may never have a full index of that website, and will have to repeat with different page counts by trial and error until you have found all the links on a site (even if you crawl only at depth 2)
The alternative could of course be to make the page count per url.txt link, so that it will always traverse the full url.txt link, but only &#039;-p ####&#039; per link.
Another way could be to keep the page count down by adding a blacklist or whitelist into the mix so that links to adservers, google etc could be excluded and not crawled.
Look forward to your progress in this, and will grab and test when you update again, best regards</description> <content:encoded><![CDATA[<p>It works great. I look forward to see the progress (I have grabbed your RSS feed for that!). On the point above, the &#8216;risk&#8217; of course is that as you do not know the number of links on that site or page, that you may never have a full index of that website, and will have to repeat with different page counts by trial and error until you have found all the links on a site (even if you crawl only at depth 2)</p><p>The alternative could of course be to make the page count per url.txt link, so that it will always traverse the full url.txt link, but only &#8216;-p ####&#8217; per link.</p><p>Another way could be to keep the page count down by adding a blacklist or whitelist into the mix so that links to adservers, google etc could be excluded and not crawled.</p><p>Look forward to your progress in this, and will grab and test when you update again, best regards</p> ]]></content:encoded> </item> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1817</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Wed, 29 Jul 2009 13:34:37 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1817</guid> <description>Ahh, ok well that explains it. The reason that the urls in urls.txt are included is that if you hook up an indexer to the crawler then it will index the pages in urls.txt as well as crawling the links and indexing those pages.  So it actually makes sense to count them as well.
It is possible to make this configurable in case people didn&#039;t want to index the urls that they supply.</description> <content:encoded><![CDATA[<p>Ahh, ok well that explains it. The reason that the urls in urls.txt are included is that if you hook up an indexer to the crawler then it will index the pages in urls.txt as well as crawling the links and indexing those pages.  So it actually makes sense to count them as well.</p><p>It is possible to make this configurable in case people didn&#8217;t want to index the urls that they supply.</p> ]]></content:encoded> </item> <item><title>By: Pieter</title><link>http://www.skorks.com/2009/07/how-to-write-a-web-crawler-in-ruby/comment-page-1/#comment-1816</link> <dc:creator>Pieter</dc:creator> <pubDate>Wed, 29 Jul 2009 13:28:53 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=894#comment-1816</guid> <description>No, I used some real links I wanted to crawl this time. But, I tested it again and you are right it does work with or without. The difference however was that in case 1 (which did not work) the number of links in url.txt was greater than the pages=100 setting (i.e. 400 links) and all it did was traverse through 100 links in url.txt and stopped, without doing any further crawling beyond that level (obviously as it has reached the pages limit). Would it be better if the links in the url.txt file are excluded from the page count?</description> <content:encoded><![CDATA[<p>No, I used some real links I wanted to crawl this time. But, I tested it again and you are right it does work with or without. The difference however was that in case 1 (which did not work) the number of links in url.txt was greater than the pages=100 setting (i.e. 400 links) and all it did was traverse through 100 links in url.txt and stopped, without doing any further crawling beyond that level (obviously as it has reached the pages limit). Would it be better if the links in the url.txt file are excluded from the page count?</p> ]]></content:encoded> </item> </channel> </rss>
