Nov 20 2008

Feed Me

Our first frost should come any time now, and I want to have warning of it so we can rescue our tomatoes. Well, I have a link to the NWS Area Forecast Discussion in my bookmarks which I try to read every day, but some days I forget. What I need is a feed. But as far as I can tell only Honolulu is cool enough to get a feed for AFDs. Weird. So what I needed was to create a feed from an existing web page.

I thought I would find a website that offers a service like this but I didn’t (in a few short minutes of searching). I found websites that came halfway, but they were very complicated to set up and/or they didn’t display the body of the page, only a link. The whole point is that I want to read it in my feed reader!

So I whipped up this Ruby code (standard libs only, no gems required):

#! /usr/bin/ruby
require 'erb'
require 'open-uri'
require 'ostruct'

# configuration
channel = OpenStruct.new(:url => "http://www.srh.noaa.gov/data/EPZ/AFDEPZ",
                         :title => "EPZAFD",
                         :description => "National Weather Service Area Forecast Discussion, El Paso TX/Santa Teresa NM")
item = OpenStruct.new(:url => channel.url,
                      :title => "Area Forecast Discussion",
                      :date => Time.now)

# fetch the page
afd = open(item.url)
item.date = afd.last_modified unless afd.last_modified.nil?
item.body = "<pre>" + afd.read + "</pre>"

# emit
include ERB::Util
template = ERB.new <<EOF
Content-Type: application/rss+xml

<?xml version="1.0"?>
<rss version="2.0">
   <channel>
      <title><%= h channel.title %></title>
      <link><%= h channel.url %></link>
      <description><%= h channel.description %></description>
      <lastBuildDate><%= h Time.now.rfc822 %></lastBuildDate>
      <generator>feedme</generator>
      <item>
         <title><%= h item.title %></title>
         <description><%= h item.body %></description>
         <pubDate><%= h item.date.rfc822 %></pubDate>
         <guid><%= h "#{channel.url}?date=#{item.date.iso8601}" %></guid>
      </item>
   </channel>
</rss>
EOF
puts template.result(binding)

Nothing overly fancy here. I use open-uri to fetch the page, extract the Last-Modified header (if it exists) and shoehorn it into an ERB template for the RSS.

In this case I just made it executable and slapped a Content-Header before the output and call it as a CGI. You could just as well run a cron job to update a file on disk (In which case remove the Content-Header from the template).

Once I found the pure text version of the AFD, it was just a matter of slapping it between <pre> tags, but if you had some actual screen scraping to do you might want to look at Hpricot which makes that really easy. In particular, I could have used the URL http://www.crh.noaa.gov/product.php?site=NWS&issuedby=EPZ&product=AFD&format=txt&version=1&glossary=1 and done

...
require 'hpricot'
...
item.body = (doc/"#content").to_html

which is in fact how I started out. But this page doesn’t have a Last-Modified header which means my feed reader would always show it as a new item (every time the cron job updated, or every time I hit the CGI script, either way). Luckily I found the text-only URL that doesn’t have this problem.