# Fetching RSS Feeds With Ruby From Behind A Proxy

I was trying to fetch some RSS feeds with Ruby the other day. I just needed something quick (but not dirty, I don’t like dirty) to validate a couple of things I was trying out with Rails. It is not that difficult to fetch feeds, there is enough info around, but unfortunately I was behind a proxy at the time and this is where things went south quickly. It just seems that the Ruby ecosystem has this natural hatred for corporate proxies, it’s come to bite me time and time again. This is something that really needs to be rectified if we want Ruby to penetrate further into the corporate world, but that’s a story for another post.

## Fetching Feeds Using Feedzirra

The first thing I tried to do was use Feedzirra, which I found out about from Railscasts. At first sight everything seemed ok, it will respect the proxy settings you have configured in your environment (i.e. the http_proxy environment variable) as it uses a fork of Curb (libcurl bindings for Ruby). However if you haven’t configured the proxy within your environment, everything falls over since there doesn’t seem to be a way to supply your proxy settings to Feedzirra directly. I would be happy to be corrected here! I wanted to fetch feeds from within a Rails app, and Rails doesn’t inherit your environment variables – no good (I am sure there are ways to make Rails do so, but like I said – don’t want dirty).

What was more frustrating was the fact that diagnosing this problem was a pain. Normally to fetch a feed with Feedzirra you have to do the following:

feed = Feedzirra::Feed.fetch_and_parse(feed_url)

You can then iterate over the entries:

feed.entries.each do |entry|
end

Without the proxy settings though, I was getting 0 (zero) as the return value of the initial call. Not very intuitive! Of course I suspected that the proxy was at fault, but still. After reading the documentation a little more closely I came across the following bit:

feed = Feedzirra::Feed.fetch_and_parse(feed_url,
:on_success => lambda {|feed| puts feed.title },
:on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })

This lets you define custom behaviour for successes and failures, this made everything simpler, since the failure block you supply gets invoked when the feed is not fetched successfully and the response_code parameter of the block contains the correct response code (407 proxy authentication required). Still, diagnosing the problem doesn’t provide me with a cure. I am sure there must be a way to patch something to allow Feedzirra to take proxy settings directly. I had a quick look at the code, but it seemed like I would need to patch multiple things in order to get it to work since Feedzirra hangs off several other libraries. Instead I decided to take a lower level look at my feed fetching (and parsing).

## Fetching Feeds Using Open-uri

Apparently, there is a way to work with RSS feeds that is built into Ruby. After a quick glance this looked pretty much as simple as Feedzirra so I decided to give it a go. Guess what problem I ran into? Doesn’t play nice with proxy – what a freaking surprise! The Ruby RSS library relies on open-uri to actually fetch the feed which should pick up the proxy settings from your environment as well as give you the ability to pass the proxy configuration directly, but even a simple stand-alone script (with the environment configured correctly) refused to work for me. After searching around for a bit I found this little tidbit.

Apparently open-uri has a bit of a bug/issue whereby the proxy username and password don’t get picked up. Long story short, if you want to get it to work you need to patch the OpenURI module slightly. Fortunately this is easy to do with Ruby, just open up the module and go for your life. Like so:

module OpenURI
def OpenURI.open_http(buf, target, proxy, options) # :nodoc:
if proxy
raise "Non-HTTP proxy URI: #{proxy}" if proxy.class != URI::HTTP
end

if target.userinfo && "1.9.0" <= RUBY_VERSION
# don't raise for 1.8 because compatibility.
raise ArgumentError, "userinfo not supported.  [RFC3986]"
end

require 'net/http'
klass = Net::HTTP
if URI::HTTP === target
# HTTP or HTTPS
if proxy
klass = Net::HTTP::Proxy(proxy.host, proxy.port, proxy.user, proxy.password)
end
target_host = target.host
target_port = target.port
request_uri = target.request_uri
else
# FTP over HTTP proxy
target_host = proxy.host
target_port = proxy.port
request_uri = target.to_s
end

http = klass.new(target_host, target_port)
if target.class == URI::HTTPS
require 'net/https'
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
store = OpenSSL::X509::Store.new
store.set_default_paths
http.cert_store = store
end

options.each {|k, v| header[k] = v if String === k }

resp = nil
http.start {
if options.include? :http_basic_authentication
user, pass = options[:http_basic_authentication]
req.basic_auth user, pass
end
http.request(req) {|response|
resp = response
if options[:content_length_proc] && Net::HTTPSuccess === resp
if resp.key?('Content-Length')
options[:content_length_proc].call(resp['Content-Length'].to_i)
else
options[:content_length_proc].call(nil)
end
end
buf << str
if options[:progress_proc] && Net::HTTPSuccess === resp
options[:progress_proc].call(buf.size)
end
}
}
}
io = buf.io
io.rewind
io.status = [resp.code, resp.message]
resp.each {|name,value| buf.io.meta_add_field name, value }
case resp
when Net::HTTPSuccess
when Net::HTTPMovedPermanently, # 301
Net::HTTPFound, # 302
Net::HTTPSeeOther, # 303
Net::HTTPTemporaryRedirect # 307
throw :open_uri_redirect, URI.parse(resp['location'])
else
raise OpenURI::HTTPError.new(io.status.join(' '), io)
end
end
end

All of that is just a big copy/paste from the actual open-uri.rb source file, all you really need to worry about is one line. Completely off-topic, does anyone else feel like that method could use a bit of a refactor :). Back on topic, in the original open-uri, the line is:

klass = Net::HTTP::Proxy(proxy.host, proxy.port)

we changed it to:

klass = Net::HTTP::Proxy(proxy.host, proxy.port, proxy.user, proxy.password)

After doing this the standalone script started fetching feeds without any trouble. All you need to do is the following (as per the link above):

source = "feed_url"
content = ""
open(source) { |s| content = s.read }
rss = RSS::Parser.parse(content, false)

Now we just want the whole thing to work in Rails.

## Getting It To Work In Rails

As I mentioned above, we don't want Rails to inherit our environment variables, so we need to be able to supply the proxy configuration directly. This is fairly easy to do when you're fetching the contents of the feed url:

open(source, :proxy => "proxy_url") { |s| content = s.read }
rss = RSS::Parser.parse(content, false)

The actual proxy url will have to be in the correct format. Assuming you're on Linux, something along the lines of:

http://username:[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */:proxy-port

The only thing left to take care of now, is to find a place for our open-uri monkey-patch and also get Rails to recognise the fact that we actually patched a module in the first place. This is precisely what the:

RAILS_ROOT/config/initializers

directory is for. You can put the open-uri monkey-patch here and Rails will auto-magically load it. Alternatively you can put the patch file under:

RAILS_ROOT/lib

But you will still need to create a file under the initializers directory and then require the monkey-patch file from there. Incidentally this file would also be a good place to put any of the other extraneous requires if you don't want to have them floating around your Rails controllers or models.

This whole 'fetching feeds from behind a proxy' thing ended up being a lot more complicated than it had to be, or at least that's how it felt. On the positive side, at least I learned a lot, so there is always a silver lining.

For more tips and opinions on software development, process and people subscribe to skorks.com today.

Image by Stewart Ho

• Korny

The open-uri class in Ruby 1.9.1 fixes the above stuff – but you need to set an option :proxy_http_basic_authentication instead of just :proxy if you want to specify a username and password.

As an example, to use the http_proxy environment variable as it is commonly used, rather than in perfect-RFC-land where the open-uri authors seem to live:
def fetch_stuff(url)
open_uri_opts = {}
if ENV[‘http_proxy’]
uri = URI.parse(ENV[‘http_proxy’])
else
open_uri_opts[:proxy] = uri
end
end
result = nil
open(url,open_uri_opts) do |s|
end
result
end

• http://www.skorks.com Alan Skorkin

I guess that’s better than nothing although not really ideal either is it :).

• Tom Meier
• http://www.skorks.com Alan Skorkin

And this is why I love open source :)

• mosa

That worked for me !!! Thanks a lot. I just changed line 216 in /usr/lib/ruby/1.8/open-uri.rb and wooooo, it worked.