SAA 2006: Research Library Group Roundtable – Internet Archiving

Late in the afternoon on Thursday August 3rd I attended the Research Library Group Roundtable at SAA 2006. It was an opportunity for RLG to share information with the archival community about their latest products and services. This session included presentations on the Internet Archive , Archive-It and the Web Archives Workbench.

After some brief business related to the SAA 2007 program committee and the rapid election of Brian Stevens from NYU Archives as the new chair of the group, Anne Van Camp spoke about the period of transition as RLG merges with OCLC. In the interest of the blending of cultures – she told a bar joke (as all OCLC meetings apparently begin). She explained that RLG products and services will be integrated into the OCLC product line. RLG programs will continue as RLG becomes the research arm for the joined interest areas of libraries, archives and museums. This has not existed before and they believe it will be a great chance to explore things in ways that RLG hasn’t had the opportunity to do in the past.

The initiatives on their agenda:

archival gateways: convened 2 meetings recently. The first to see if there is a way to be interactive with international archive databases and the second to bring regional archives together to see how they can work together.
web archiving: started looking at it from a service point of view, but also some community issues that have to be worked out around web archiving. Looking at big problems that will need community involvement – issues like metadata and selection.
standards: continuing to support EAD, pursuing rigorous agenda regarding EAC
OCLC has a whole group of people who works on registries (where you put information about organizations). RLG has talked about building a registry on top of Archive Grid of US archives.

In her introduction, Merrilee (frequent poster on hangingtogether.org ) highlighted that there are lots of questions about the intellectual side of web archiving (vs the technical challenges) such as:

what to archive?
what metadata data and description is appropriate for it?
what would end users of web archives need? How would they use a web archive?
what about collaborative collection development? It is expensive to archive the web – how does an institution say “I am archiving this corner of the web – this deep – this often”. This information should be publicly available for others doing research and others archiving the web.

She pointed out that RLG is happy about their work with Internet Archive – they are doing work to make the technical side easier but they understand that there is a lot for the archival community to sort out.

Next up was Kristine Hanna of the Internet Archive giving her presentation ‘Archiving and Preserving the Web’. The Internet Archive has been working with RLG this year and they need information from the users in the RLG community. They are looking into how they are going to work with OCLC and have applied for an NDIIP grant.

The Internet Archive (IA), founded by Brewster Kahle in 1996, is built on open source principles and dedicated to Open Source software.

What do they collect in the archive? Over 2 billion pages a month in 21 languages. It is free and the largest archive on the web including 55 billion pages from 55 million sites and supporting 60,000 unique users per day.

Why try to collect it all? They don’t feel comfortable making the choices about appraisal. And at risk websites and collections are disappearing all the time. The average lifespan of a web page is 100 days. They did a case study of crawling websites associated with the Nigerian election – 6 months after the election 70% of the crawled sites were gone, but they live on in the archive.

How do they collect? They use these components and tools:

Heritrix – web crawler
Wayback Machine – access tools for rendering and viewing files
Nutch – search engine
Arc File – archival file format used for preservation

How do they preserve it? They keep multiple copies at different digital repositories (CA, Alexandria (Egypt), France, Amsterdam) using over 1300 server machines.

IA also does targeted archiving for partners. Institutions that want to create specific online collections or curated domain crawls can work with IA. These archives start at 100+ million documents and are based on crawls run by IA crawl engineers. The Library of Congress has arranged for an assortment of targeted archives including archives of US National Elections 2000, September 11 and the War in Iraq (not accessible yet – marked March 2003 – Ongoing). Australia arranged for archiving of the entire .au domain. Also see Purpose, Pragmatism and Perspective – Preserving Australian Web Resources at the National Library of Australia by Paul Koerbin of the National Library of Australia and published in February of 2006.

What’s Next for Internet Archive?

collaboration and partnerships
OCA – open content alliance
Multiple copies around the world

Next, Dan Avery of IA gave a 9 minute version of his 35 minute presentation on Archive-It. Archive-It is a web based annual subscription service provided by IA to permit the capture of up to 10 million pages. Kristine gave some examples of those using Archive-It during her presentation:

Indiana University – web sites
North Carolina State Archives – Government Agencies, Occupational Licensing Boards and commissions.
Library of Virginia – Jamestown 2007 commemoration and Governor Mark Warner’s last year in office. When Mark Warner was listed by the New York Times as a possible presidential candidate, this archive got lots of hits. (This brings up interesting questions of watching content that is being purposefully preserved to get an idea of what some expect for the future. Don’t be surprised by a post on this idea all by itself later. Need to think about it some more!)

He highlighted the different elements and techniques used in Archive-It: crawling, web user interface, storage, playback, text indexing and integration.

Crawling/Browsing:
- Heritrix :
  - open source java
  - Archival-quality (they preserve exactly what they get back from the server)
  - Highly configurable
- Wayback Machine :
  - lets you surf the web as it was
  - in Archive-It – each customer has their own wayback machine
  - not open source yet.. that is a work in progress
The user interface is a web application:
- collects all the info they need to do the crawling the customer requests
- schedule (monthly, daily, weekly, quarterly… etc)
- seed URLS (the starting point for archive web crawls)
- crawl parameters
NutchWAX
- extension of Nutch which is built on Lucene
- full text search plus link analysis
- can search by date instead of relevance – useful for individual archives

While there are public collections in Archive-It, logging in gives you access to personal sites: shows the total documents archived (and more), lets you check your list of active collections and set up a new collection (includes unique collection identifier). He showed some screen shots of the interface and examples (this was the first time there wasn’t a network available for his presentation – he was amused that his paranoia that forced him to always bring screen captures finally paid off!).

It was interesting seeing this presentation back to back with the general Internet Archive overview. There are lots of overlap in tools and approaches between them – but Archive-It definitely has it’s own unique requirements. It puts the tools for managing larges scale web crawling in the hands of archivists (or more likely information managers of some sort) – rather than the technical staff of IA.

The final presentation of the roundtable was by Judy Cobb – a Product Manager fromOCLC. She gave an overview of the Web Archives Workbench. (I hunted for a good link to this – but the best I came up with was acknowledgments document and the login page .)The inspiration for the creation of Workbench was the challenge of collecting from web. The Internet is a big place. It is hard to define the scope of what to archive.

Workbench is a discovery tool that will permit its users to investigate what domains should be included when crawling a website for archiving. It will ask you which domains should be included. For example, you can tell it not to crawl Adobe.com just because there is a link to it to let people download acrobat.

Workbench will let you set metadata data for your collection based on the domains you said were in scope. It will then let you appraise and rank the entities/domains being harvested, leaving you with a list of organizations or entities in scope and ranked by importance. Next it will translate a site map of what is going to be crawled, define parts of the map as series and put the harvested content and related metadata into a repository. Other configuration options permit setting how frequently you harvest various series, choosing to only get new content and requesting notification if the sitemap changes.

Workbench is currently in beta and is still under development. The 3rd phase will add the support for Richard Pierce-Moses’s Arizona Model for Web Preservation and Access. The focus of the Arizona Model is curation, not technology. It strives to find a solution somewhere between manual harvesting and bulk harvesting that is based on standard archival theories. Workbench will be open source and funded by LOC.

I wasn’t sure what to expect from the roundtable – but I was VERY glad that I attended. The group was very enthusiastic – cramming in everything they could manage to share with those in the room. The Internet Archive, Archive-It and the Web Archives Workbench represent the front of the pack of software tools intended to support archiving the web. It was easy to see that if the Workbench is integrated in with Archive-It, that it should permit archivists to start paying more attention to the identification of what should be archived rather than figuring out how to do the actual archiving.