Fetching RSS Feeds With Ruby From Behind A Proxy

I was trying to fetch some RSS feeds with Ruby the other day. I just needed something quick (but not dirty, I don’t like dirty) to validate a couple of things I was trying out with Rails. It is not that difficult to fetch feeds, there is enough info around, but unfortunately I was behind a proxy at the time and this is where things went south quickly. It just seems that the Ruby ecosystem has this natural hatred for corporate proxies, it’s come to bite me time and time again. This is something that really needs to be rectified if we want Ruby to penetrate further into the corporate world, but that’s a story for another post.

Fetching Feeds Using Feedzirra

The first thing I tried to do was use Feedzirra, which I found out about from Railscasts. At first sight everything seemed ok, it will respect the proxy settings you have configured in your environment (i.e. the http_proxy environment variable) as it uses a fork of Curb (libcurl bindings for Ruby). However if you haven’t configured the proxy within your environment, everything falls over since there doesn’t seem to be a way to supply your proxy settings to Feedzirra directly. I would be happy to be corrected here! I wanted to fetch feeds from within a Rails app, and Rails doesn’t inherit your environment variables – no good (I am sure there are ways to make Rails do so, but like I said – don’t want dirty).

What was more frustrating was the fact that diagnosing this problem was a pain. Normally to fetch a feed with Feedzirra you have to do the following:

feed = Feedzirra::Feed.fetch_and_parse(feed_url)

You can then iterate over the entries:

feed.entries.each do |entry|
end

Without the proxy settings though, I was getting 0 (zero) as the return value of the initial call. Not very intuitive! Of course I suspected that the proxy was at fault, but still. After reading the documentation a little more closely I came across the following bit:

feed = Feedzirra::Feed.fetch_and_parse(feed_url,
:on_success => lambda {|feed| puts feed.title },
:on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })

This lets you define custom behaviour for successes and failures, this made everything simpler, since the failure block you supply gets invoked when the feed is not fetched successfully and the response_code parameter of the block contains the correct response code (407 proxy authentication required). Still, diagnosing the problem doesn’t provide me with a cure. I am sure there must be a way to patch something to allow Feedzirra to take proxy settings directly. I had a quick look at the code, but it seemed like I would need to patch multiple things in order to get it to work since Feedzirra hangs off several other libraries. Instead I decided to take a lower level look at my feed fetching (and parsing).

Fetching Feeds Using Open-uri

Apparently, there is a way to work with RSS feeds that is built into Ruby. After a quick glance this looked pretty much as simple as Feedzirra so I decided to give it a go. Guess what problem I ran into? Doesn’t play nice with proxy – what a freaking surprise! The Ruby RSS library relies on open-uri to actually fetch the feed which should pick up the proxy settings from your environment as well as give you the ability to pass the proxy configuration directly, but even a simple stand-alone script (with the environment configured correctly) refused to work for me. After searching around for a bit I found this little tidbit.

Apparently open-uri has a bit of a bug/issue whereby the proxy username and password don’t get picked up. Long story short, if you want to get it to work you need to patch the OpenURI module slightly. Fortunately this is easy to do with Ruby, just open up the module and go for your life. Like so:

module OpenURI
 def OpenURI.open_http(buf, target, proxy, options) # :nodoc:
   if proxy
     raise "Non-HTTP proxy URI: #{proxy}" if proxy.class != URI::HTTP
   end
 
   if target.userinfo && "1.9.0" <= RUBY_VERSION
     # don't raise for 1.8 because compatibility.
     raise ArgumentError, "userinfo not supported.  [RFC3986]"
   end
 
   require 'net/http'
   klass = Net::HTTP
   if URI::HTTP === target
     # HTTP or HTTPS
     if proxy
       klass = Net::HTTP::Proxy(proxy.host, proxy.port, proxy.user, proxy.password)
     end
     target_host = target.host
     target_port = target.port
     request_uri = target.request_uri
   else
     # FTP over HTTP proxy
     target_host = proxy.host
     target_port = proxy.port
     request_uri = target.to_s
   end
 
   http = klass.new(target_host, target_port)
   if target.class == URI::HTTPS
     require 'net/https'
     http.use_ssl = true
     http.verify_mode = OpenSSL::SSL::VERIFY_PEER
     store = OpenSSL::X509::Store.new
     store.set_default_paths
     http.cert_store = store
   end
 
   header = {}
   options.each {|k, v| header[k] = v if String === k }
 
   resp = nil
   http.start {
     req = Net::HTTP::Get.new(request_uri, header)
     if options.include? :http_basic_authentication
       user, pass = options[:http_basic_authentication]
       req.basic_auth user, pass
     end
     http.request(req) {|response|
       resp = response
       if options[:content_length_proc] && Net::HTTPSuccess === resp
         if resp.key?('Content-Length')
           options[:content_length_proc].call(resp['Content-Length'].to_i)
         else
           options[:content_length_proc].call(nil)
         end
       end
       resp.read_body {|str|
         buf << str
         if options[:progress_proc] && Net::HTTPSuccess === resp
           options[:progress_proc].call(buf.size)
         end
       }
     }
   }
   io = buf.io
   io.rewind
   io.status = [resp.code, resp.message]
   resp.each {|name,value| buf.io.meta_add_field name, value }
   case resp
   when Net::HTTPSuccess
   when Net::HTTPMovedPermanently, # 301
        Net::HTTPFound, # 302
        Net::HTTPSeeOther, # 303
        Net::HTTPTemporaryRedirect # 307
     throw :open_uri_redirect, URI.parse(resp['location'])
   else
     raise OpenURI::HTTPError.new(io.status.join(' '), io)
   end
 end
end

All of that is just a big copy/paste from the actual open-uri.rb source file, all you really need to worry about is one line. Completely off-topic, does anyone else feel like that method could use a bit of a refactor :). Back on topic, in the original open-uri, the line is:

klass = Net::HTTP::Proxy(proxy.host, proxy.port)

we changed it to:

klass = Net::HTTP::Proxy(proxy.host, proxy.port, proxy.user, proxy.password)

After doing this the standalone script started fetching feeds without any trouble. All you need to do is the following (as per the link above):

source = "feed_url"
content = ""
open(source) { |s| content = s.read }
rss = RSS::Parser.parse(content, false)

Now we just want the whole thing to work in Rails.

Getting It To Work In Rails

As I mentioned above, we don’t want Rails to inherit our environment variables, so we need to be able to supply the proxy configuration directly. This is fairly easy to do when you’re fetching the contents of the feed url:

open(source, :proxy => "proxy_url") { |s| content = s.read }
rss = RSS::Parser.parse(content, false)

The actual proxy url will have to be in the correct format. Assuming you’re on Linux, something along the lines of:

http://username:password@proxy-host:proxy-port

The only thing left to take care of now, is to find a place for our open-uri monkey-patch and also get Rails to recognise the fact that we actually patched a module in the first place. This is precisely what the:

RAILS_ROOT/config/initializers

directory is for. You can put the open-uri monkey-patch here and Rails will auto-magically load it. Alternatively you can put the patch file under:

RAILS_ROOT/lib

But you will still need to create a file under the initializers directory and then require the monkey-patch file from there. Incidentally this file would also be a good place to put any of the other extraneous requires if you don’t want to have them floating around your Rails controllers or models.

This whole ‘fetching feeds from behind a proxy‘ thing ended up being a lot more complicated than it had to be, or at least that’s how it felt. On the positive side, at least I learned a lot, so there is always a silver lining.

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by Stewart Ho

  • Korny

    The open-uri class in Ruby 1.9.1 fixes the above stuff – but you need to set an option :proxy_http_basic_authentication instead of just :proxy if you want to specify a username and password.
    The Rdoc is at http://ruby-doc.org/ruby-1.9/classes/OpenURI/OpenRead.html

    As an example, to use the http_proxy environment variable as it is commonly used, rather than in perfect-RFC-land where the open-uri authors seem to live:
    def fetch_stuff(url)
    open_uri_opts = {}
    if ENV['http_proxy']
    uri = URI.parse(ENV['http_proxy'])
    if uri.user || uri.password
    open_uri_opts[:proxy_http_basic_authentication] = [uri,uri.user,uri.password]
    else
    open_uri_opts[:proxy] = uri
    end
    end
    result = nil
    open(url,open_uri_opts) do |s|
    result = s.read
    end
    result
    end

    • http://www.skorks.com Alan Skorkin

      I guess that’s better than nothing although not really ideal either is it :).

  • Tom Meier
    • http://www.skorks.com Alan Skorkin

      And this is why I love open source :)

  • mosa

    That worked for me !!! Thanks a lot. I just changed line 216 in /usr/lib/ruby/1.8/open-uri.rb and wooooo, it worked.