Leveraging Google Reader’s Page Change Tracking for Web Page Preservation

January 26, 2010 5 Comments

The Official Google Reader Blog recently announced a new feature that will let users watch any page for updates. The way this works is that you add individual URLs to your Google Reader account. Just as with regular RSS feeds, when an update is detected – a new entry is added to that subscription.

My thinking is that this could be a really useful tool for archivists charged with preserving websites that change gradually over time, especially those fairly static sites that change infrequently with little or no notice of upcoming changes. If a web page was archived and then added to a dedicated Google Reader account, the archivist could scan their list of watch pages daily or weekly. Changes could then trigger the creation of a fresh snapshot of the site.

I will admit that there have been services out there for a while that do something similar to what Google has just rolled out. I personally have used Dapper.net to take a standard web page and generate an RSS feed based on updates to the page (sound familiar?). One Dapper.net feed that I created and follow is for the news archive page for the International Red Cross and can be found here. What is funny is that now they actually have an official RSS feed for their news that includes exactly what my Dapper.net feed harvested off their news archive page – but when I built that Dapper feed there was no other way for me to watch for those news updates.

There are lots of different tools out there that aim to archive websites. Archive-It is a subscription based service run by Internet Archive that targets institutions and will archive sites on demand or on a regular schedule. Internet Archive also has an open source crawler called Heritrix for those who are comfortable dealing with the code. Other institutions are building their own software to tackle this too. Harvard University has their own Web Archive Collection Service (WAX). The LiWA (Living Web Archives) Project is based in Germany and aims to “extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives.” One could even use something as simple as PDFmyURL.com – an online service that turns any URL into a PDF (be sure to play with the advanced options to make sure you get a wide enough snapshot). I know there are many more possibilities – these just scratch the surface.

What I like about my idea is that it isn’t meant to replace these services but rather work in tandem with them. The Internet Archive does an amazing job crawling and archiving many web pages – but they can’t archive everything and their crawl frequency may not match up with real world updates to a website. This approach certainly wouldn’t scale well for huge websites for which you would need to watch for changes on many pages. I am picturing this technique as being useful for small organizations or individuals who just need to make sure that a county government website makeover or a community organization’s website update doesn’t get lost in the shuffle. I like the idea of finding clever ways to leverage free services and tools to support those who want to protect a particular niche of websites from being lost.

Image Credit: The RSS themed image above is by Matt Forsythe.

Posted in at risk records, future-proofing, internet archiving, learning technology

5 Comments

Tom Carnell
January 29, 2010 at 7:01 am

Have you tried Femtoo (http://femtoo.com)? Femtoo is an advanced version of this and but ALSO has these key features:

– Monitor particular parts of a page
– Parse data and check for particular conditions (share price hit a certain amount etc)
– Premium accounts can create ‘low latency’ trackers for critical monitoring applications
– Receive notifications via email, Instant Messenger and soon SMS (I think)
– Add a ‘widget’ to any page to allow people to ‘subscribe’ to a ‘tracker’
– It uses the amazing cQuery (http://cquery.com) Server-side CSS Content Selection Engine
– You can publish ‘trackers’ to the ‘Tracker Library’ and anybody can subscribe.

tom
Jeanne Post author
January 29, 2010 at 11:14 pm

Thanks Tom. One question – do you plan to provide notifications via RSS in the future? I feel that for the purposes I am describing in this post, email and IM are more likely to get lost in the shuffle.

I do like the idea of being able to look for specific conditions – for example I may not care as much about a change to the header image as I do about a change to the text.
Mark
December 30, 2010 at 11:53 am

It’s interesting-if memory serves me correctly (not a given!) is there a pay version of a site like this that will show the images of a webpage over time, at least those images which a search engine has been willing to cache?

Jeanne-I think the RSS idea is an especially good one since it is a passive way to monitor.

This stuff is incredibly interesting to me, I’m not an archivist, but as a small business owner I think being able to study the progression that other businesses and governments have gone through can give me insight into the next area of focus for my company.
Pingback:[Lo mejor de 2010] Mis comentarios favoritos en Vivir México (Destacadas)
Pingback:[Lo mejor de 2010] Mis comentarios favoritos |

Comments are closed.