Feed Me
Our first frost should come any time now, and I want to have warning of it so we can rescue our tomatoes. Well, I have a link to the NWS Area Forecast Discussion in my bookmarks which I try to read every day, but some days I forget. What I need is a feed. But as far as I can tell only Honolulu is cool enough to get a feed for AFDs. Weird. So what I needed was to create a feed from an existing web page.
I thought I would find a website that offers a service like this but I didn’t (in a few short minutes of searching). I found websites that came halfway, but they were very complicated to set up and/or they didn’t display the body of the page, only a link. The whole point is that I want to read it in my feed reader!
So I whipped up this Ruby code (standard libs only, no gems required):
#! /usr/bin/ruby
require 'erb'
require 'open-uri'
require 'ostruct'
# configuration
channel = OpenStruct.new(:url => "http://www.srh.noaa.gov/data/EPZ/AFDEPZ",
:title => "EPZAFD",
:description => "National Weather Service Area Forecast Discussion, El Paso TX/Santa Teresa NM")
item = OpenStruct.new(:url => channel.url,
:title => "Area Forecast Discussion",
:date => Time.now)
# fetch the page
afd = open(item.url)
item.date = afd.last_modified unless afd.last_modified.nil?
item.body = "<pre>" + afd.read + "</pre>"
# emit
include ERB::Util
template = ERB.new <<EOF
Content-Type: application/rss+xml
<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title><%= h channel.title %></title>
<link><%= h channel.url %></link>
<description><%= h channel.description %></description>
<lastBuildDate><%= h Time.now.rfc822 %></lastBuildDate>
<generator>feedme</generator>
<item>
<title><%= h item.title %></title>
<description><%= h item.body %></description>
<pubDate><%= h item.date.rfc822 %></pubDate>
<guid><%= h "#{channel.url}?date=#{item.date.iso8601}" %></guid>
</item>
</channel>
</rss>
EOF
puts template.result(binding)
Nothing overly fancy here. I use open-uri to fetch the page, extract the Last-Modified header (if it exists) and shoehorn it into an ERB template for the RSS.
In this case I just made it executable and slapped a Content-Header before the output and call it as a CGI. You could just as well run a cron job to update a file on disk (In which case remove the Content-Header from the template).
Once I found the pure text version of the AFD, it was just a matter of slapping it between <pre> tags, but if you had some actual screen scraping to do you might want to look at Hpricot which makes that really easy. In particular, I could have used the URL http://www.crh.noaa.gov/product.php?site=NWS&issuedby=EPZ&product=AFD&format=txt&version=1&glossary=1 and done
...
require 'hpricot'
...
item.body = (doc/"#content").to_html
which is in fact how I started out. But this page doesn’t have a Last-Modified header which means my feed reader would always show it as a new item (every time the cron job updated, or every time I hit the CGI script, either way). Luckily I found the text-only URL that doesn’t have this problem.