<?xml version="1.0" encoding="UTF-8"?><rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
> <channel><title>Comments on: How Search Engines Process Documents Before Indexing</title> <atom:link href="http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/feed/" rel="self" type="application/rss+xml" /><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/</link> <description>For the betterment of the software craft...</description> <lastBuildDate>Mon, 21 Nov 2011 13:57:06 +0000</lastBuildDate> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.1.2</generator> <item><title>By: Tesfaye</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-5902</link> <dc:creator>Tesfaye</dc:creator> <pubDate>Thu, 03 Jun 2010 05:46:42 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-5902</guid> <description>Thank u Skorkin, for ur prompt response!
Yeah I have seen stuff around lucene ...&quot;Indexing and searching&quot;...that is okay. My reason to go for Nutch was to use its advantage of crawling and interface design. Anyways I&#039;ll check around.
CHEERS!
- Tesfaye</description> <content:encoded><![CDATA[<p>Thank u Skorkin, for ur prompt response!<br
/> Yeah I have seen stuff around lucene &#8230;&#8221;Indexing and searching&#8221;&#8230;that is okay. My reason to go for Nutch was to use its advantage of crawling and interface design. Anyways I&#8217;ll check around.<br
/> CHEERS!<br
/> - Tesfaye</p> ]]></content:encoded> </item> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-5878</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Wed, 02 Jun 2010 14:36:33 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-5878</guid> <description>I know of Nutch, but I&#039;ve never used it in anger. Nutch builds on top of apache lucene, nutch provides a crawler and bunch of other stuff, but it is lucene that does the heavy lifting, such as indexing etc. I would recommend reading up on Lucene a bit.
I am not precisely sure, but you should find something along the lines of a pipeline of processor classes that you can configure in Lucene that do a bunch of stuff during indexing, such as possibly stemming etc. You should be able to write your own class that conforms to the same interface as these processors and should be able to configure it as one of the stages in the pipeline.
Like I said, this is only a guess, but an educated one :). See how you go with that. You can find books on Lucene (there are a few around) to help you out, look around. Hope this helps.</description> <content:encoded><![CDATA[<p>I know of Nutch, but I&#8217;ve never used it in anger. Nutch builds on top of apache lucene, nutch provides a crawler and bunch of other stuff, but it is lucene that does the heavy lifting, such as indexing etc. I would recommend reading up on Lucene a bit.</p><p>I am not precisely sure, but you should find something along the lines of a pipeline of processor classes that you can configure in Lucene that do a bunch of stuff during indexing, such as possibly stemming etc. You should be able to write your own class that conforms to the same interface as these processors and should be able to configure it as one of the stages in the pipeline.</p><p>Like I said, this is only a guess, but an educated one :). See how you go with that. You can find books on Lucene (there are a few around) to help you out, look around. Hope this helps.</p> ]]></content:encoded> </item> <item><title>By: Tesfaye</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-5877</link> <dc:creator>Tesfaye</dc:creator> <pubDate>Wed, 02 Jun 2010 14:14:46 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-5877</guid> <description>Alan Skorkin,
Oh! I read your article and found it really interesting. KEEP IT UP!
I have a question:
I am working on a local search engine for master&#039;s thesis. I configured &quot;Nutch&quot; (this is an open source search engine) and am able to pay with it. I found Nutch automatically indexing the documents it has crawled. But I want to do some preprocessing on the crawled documents before indexing. So can u please tell me how to go about (if u have any exposure of it)? Or can you tell me which group to join for discussion? Because that will greatly simplify my job.
Thanking u in advance and waiting for ur response soon!
-Tesfaye</description> <content:encoded><![CDATA[<p>Alan Skorkin,<br
/> Oh! I read your article and found it really interesting. KEEP IT UP!</p><p>I have a question:<br
/> I am working on a local search engine for master&#8217;s thesis. I configured &#8220;Nutch&#8221; (this is an open source search engine) and am able to pay with it. I found Nutch automatically indexing the documents it has crawled. But I want to do some preprocessing on the crawled documents before indexing. So can u please tell me how to go about (if u have any exposure of it)? Or can you tell me which group to join for discussion? Because that will greatly simplify my job.</p><p>Thanking u in advance and waiting for ur response soon!<br
/> -Tesfaye</p> ]]></content:encoded> </item> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-3765</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Tue, 02 Mar 2010 23:11:59 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-3765</guid> <description>Hi Bert,
I am glad you&#039;ve been enjoying it. Yeah, it&#039;s amazing just how much goes on behind the scenes when it comes to how search engines work. It is reasonably easy to set up a search system using something like Lucene, but this kind of deeper knowledge really helps when you want to tweak it to perform better, as well as debugging issues.</description> <content:encoded><![CDATA[<p>Hi Bert,</p><p>I am glad you&#8217;ve been enjoying it. Yeah, it&#8217;s amazing just how much goes on behind the scenes when it comes to how search engines work. It is reasonably easy to set up a search system using something like Lucene, but this kind of deeper knowledge really helps when you want to tweak it to perform better, as well as debugging issues.</p> ]]></content:encoded> </item> <item><title>By: Bert Willems</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-3763</link> <dc:creator>Bert Willems</dc:creator> <pubDate>Tue, 02 Mar 2010 15:22:40 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-3763</guid> <description>Hello Alan,
Thank you for this great article, it is written nicely.
I have done several search engine implementation using Lucene but these posts made me fully aware of all the decisions involved in constructing a search engine and I am looking forward to your next article.
Best regards,
Bert</description> <content:encoded><![CDATA[<p>Hello Alan,</p><p>Thank you for this great article, it is written nicely.</p><p>I have done several search engine implementation using Lucene but these posts made me fully aware of all the decisions involved in constructing a search engine and I am looking forward to your next article.</p><p>Best regards,<br
/> Bert</p> ]]></content:encoded> </item> <item><title>By: iPad Links: Monday, March 1, 2010 &#171; Mike Cane&#39;s iPad Test</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-3759</link> <dc:creator>iPad Links: Monday, March 1, 2010 &#171; Mike Cane&#39;s iPad Test</dc:creator> <pubDate>Mon, 01 Mar 2010 23:54:45 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-3759</guid> <description>[...] Metadata! More Important Than Ever! How I Made $6K With My eBook The $75 eBook: A True Story How Search Engines Process Documents Before Indexing Principles behind a Freemium Pricing Model CopyNazis On The [...]</description> <content:encoded><![CDATA[<p>[...] Metadata! More Important Than Ever! How I Made $6K With My eBook The $75 eBook: A True Story How Search Engines Process Documents Before Indexing Principles behind a Freemium Pricing Model CopyNazis On The [...]</p> ]]></content:encoded> </item> <item><title>By: Alan Skorkin</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-3758</link> <dc:creator>Alan Skorkin</dc:creator> <pubDate>Mon, 01 Mar 2010 23:16:50 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-3758</guid> <description>Hi Kartik,
I don&#039;t know of a pure Ruby lemmatizer, but this might be helpful http://english.rubyforge.org/. It contains a stemmer as well as a lot of other language utilities.</description> <content:encoded><![CDATA[<p>Hi Kartik,</p><p>I don&#8217;t know of a pure Ruby lemmatizer, but this might be helpful <a
href="http://english.rubyforge.org/" rel="nofollow">http://english.rubyforge.org/</a>. It contains a stemmer as well as a lot of other language utilities.</p> ]]></content:encoded> </item> <item><title>By: Kartik Rao</title><link>http://www.skorks.com/2010/03/how-search-engines-process-documents-before-indexing/comment-page-1/#comment-3756</link> <dc:creator>Kartik Rao</dc:creator> <pubDate>Mon, 01 Mar 2010 18:52:24 +0000</pubDate> <guid
isPermaLink="false">http://www.skorks.com/?p=1369#comment-3756</guid> <description>Hey Alan,
I&#039;m currently building a small search engine using Ruby on Rails and I found your article extremely helpful.
Now, do you know any good english lemmatizer written in Ruby?
Cheers,
Kartik</description> <content:encoded><![CDATA[<p>Hey Alan,</p><p>I&#8217;m currently building a small search engine using Ruby on Rails and I found your article extremely helpful.</p><p>Now, do you know any good english lemmatizer written in Ruby?</p><p>Cheers,<br
/> Kartik</p> ]]></content:encoded> </item> </channel> </rss>
