Fetching RSS Feeds With Ruby From Behind A Proxy

I was trying to fetch some RSS feeds with Ruby the other day. I just needed something quick (but not dirty, I don’t like dirty) to validate a couple of things I was trying out with Rails. It is not that difficult to fetch feeds, there is enough info around, but unfortunately I was behind a proxy at the time and this is where things went south quickly. It just seems that the Ruby ecosystem has this natural hatred for corporate proxies, it’s come to bite me time and time again. This is something that really needs to be rectified if we want Ruby to penetrate further into the corporate world, but that’s a story for another post.

Fetching Feeds Using Feedzirra

The first thing I tried to do was use Feedzirra, which I found out about from Railscasts. At first sight everything seemed ok, it will respect the proxy settings you have configured in your environment (i.e. the _httpproxy environment variable) as it uses a fork of Curb (libcurl bindings for Ruby). However if you haven’t configured the proxy within your environment, everything falls over since there doesn’t seem to be a way to supply your proxy settings to Feedzirra directly. I would be happy to be corrected here! I wanted to fetch feeds from within a Rails app, and Rails doesn’t inherit your environment variables – no good (I am sure there are ways to make Rails do so, but like I said – don’t want dirty).

What was more frustrating was the fact that diagnosing this problem was a pain. Normally to fetch a feed with Feedzirra you have to do the following:

ruby feed = Feedzirra::Feed.fetch_and_parse(feed_url)

You can then iterate over the entries:

ruby feed.entries.each do |entry| end

Without the proxy settings though, I was getting 0 (zero) as the return value of the initial call. Not very intuitive! Of course I suspected that the proxy was at fault, but still. After reading the documentation a little more closely I came across the following bit:

ruby feed = Feedzirra::Feed.fetch_and_parse(feed_url, :on_success => lambda {|feed| puts feed.title }, :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })

This lets you define custom behaviour for successes and failures, this made everything simpler, since the failure block you supply gets invoked when the feed is not fetched successfully and the _responsecode parameter of the block contains the correct response code (407 proxy authentication required). Still, diagnosing the problem doesn’t provide me with a cure. I am sure there must be a way to patch something to allow Feedzirra to take proxy settings directly. I had a quick look at the code, but it seemed like I would need to patch multiple things in order to get it to work since Feedzirra hangs off several other libraries. Instead I decided to take a lower level look at my feed fetching (and parsing).

Fetching Feeds Using Open-uri

Apparently, there is a way to work with RSS feeds that is built into Ruby. After a quick glance this looked pretty much as simple as Feedzirra so I decided to give it a go. Guess what problem I ran into? Doesn’t play nice with proxy – what a freaking surprise! The Ruby RSS library relies on open-uri to actually fetch the feed which should pick up the proxy settings from your environment as well as give you the ability to pass the proxy configuration directly, but even a simple stand-alone script (with the environment configured correctly) refused to work for me. After searching around for a bit I found this little tidbit.

Apparently open-uri has a bit of a bug/issue whereby the proxy username and password don’t get picked up. Long story short, if you want to get it to work you need to patch the OpenURI module slightly. Fortunately this is easy to do with Ruby, just open up the module and go for your life. Like so:

```ruby module OpenURI def OpenURI.open_http(buf, target, proxy, options) # :nodoc: if proxy raise “Non-HTTP proxy URI: #{proxy}” if proxy.class != URI::HTTP end

if target.userinfo && “1.9.0” <= RUBY_VERSION # don’t raise for 1.8 because compatibility. raise ArgumentError, “userinfo not supported. [RFC3986]” end

require ‘net/http’ klass = Net::HTTP if URI::HTTP === target # HTTP or HTTPS if proxy klass = Net::HTTP::Proxy(proxy.host, proxy.port, proxy.user, proxy.password) end target_host = target.host target_port = target.port request_uri = target.request_uri else # FTP over HTTP proxy target_host = proxy.host target_port = proxy.port request_uri = target.to_s end

http = klass.new(target_host, target_port) if target.class == URI::HTTPS require ‘net/https’ http.use_ssl = true http.verify_mode = OpenSSL::SSL::VERIFY_PEER store = OpenSSL::X509::Store.new store.set_default_paths http.cert_store = store end

header = {} options.each {|k, v| header[k] = v if String === k }

resp = nil http.start { req = Net::HTTP::Get.new(request_uri, header) if options.include? :http_basic_authentication user, pass = options[:http_basic_authentication] req.basic_auth user, pass end http.request(req) {|response| resp = response if options[:content_length_proc] && Net::HTTPSuccess === resp if resp.key?(‘Content-Length’) options[:content_length_proc].call(resp[‘Content-Length’].to_i) else options[:content_length_proc].call(nil) end end resp.read_body {|str| buf << str if options[:progress_proc] && Net::HTTPSuccess === resp options[:progress_proc].call(buf.size) end } } } io = buf.io io.rewind io.status = [resp.code, resp.message] resp.each {|name,value| buf.io.meta_add_field name, value } case resp when Net::HTTPSuccess when Net::HTTPMovedPermanently, # 301 Net::HTTPFound, # 302 Net::HTTPSeeOther, # 303 Net::HTTPTemporaryRedirect # 307 throw :open_uri_redirect, URI.parse(resp[‘location’]) else raise OpenURI::HTTPError.new(io.status.join(’ ‘), io) end end end```

All of that is just a big copy/paste from the actual open-uri.rb source file, all you really need to worry about is one line. Completely off-topic, does anyone else feel like that method could use a bit of a refactor :). Back on topic, in the original open-uri, the line is:

ruby klass = Net::HTTP::Proxy(proxy.host, proxy.port)

we changed it to:

ruby klass = Net::HTTP::Proxy(proxy.host, proxy.port, proxy.user, proxy.password)

After doing this the standalone script started fetching feeds without any trouble. All you need to do is the following (as per the link above):

ruby source = "feed_url" content = "" open(source) { |s| content = s.read } rss = RSS::Parser.parse(content, false)

Now we just want the whole thing to work in Rails.

Getting It To Work In Rails

As I mentioned above, we don’t want Rails to inherit our environment variables, so we need to be able to supply the proxy configuration directly. This is fairly easy to do when you’re fetching the contents of the feed url:

ruby open(source, :proxy => "proxy_url") { |s| content = s.read } rss = RSS::Parser.parse(content, false)

The actual proxy url will have to be in the correct format. Assuming you’re on Linux, something along the lines of:

http://username:password@proxy-host:proxy-port

The only thing left to take care of now, is to find a place for our open-uri monkey-patch and also get Rails to recognise the fact that we actually patched a module in the first place. This is precisely what the:

RAILS_ROOT/config/initializers

directory is for. You can put the open-uri monkey-patch here and Rails will auto-magically load it. Alternatively you can put the patch file under:

RAILS_ROOT/lib

But you will still need to create a file under the initializers directory and then require the monkey-patch file from there. Incidentally this file would also be a good place to put any of the other extraneous requires if you don’t want to have them floating around your Rails controllers or models.

This whole ‘fetching feeds from behind a proxy’ thing ended up being a lot more complicated than it had to be, or at least that’s how it felt. On the positive side, at least I learned a lot, so there is always a silver lining.

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by Stewart Ho

Table of Contents

Fetching Feeds Using Feedzirra

Fetching Feeds Using Open-uri

Getting It To Work In Rails