internet archiving | Spellbound Blog

Book Review: Digital Preservation

January 18, 2007

In my quest for information about archiving geospatial data last term, I got my hands on a copy of Marilyn Deegan and Simon Tanner’s Digital Preservation (part of the Digital Futures Series). This excellent volume consists of nine chapters each written by different authors who are leaders in their respective fields (shown in order of their respective chapters):

David Holdsworth: known for his work on the CEDARS and CAMiLEON projects
Robin Wendler: metadata analyst at the Harvard University Library Office for Information Studies
Julien Masanès: co-founder of the European Archive
Elisa Mason: maintains the Forced Migration Current Awareness Blog
Brian F. Lavoie: a research scientist at OCLC
Stephen Chapman: Preservation Librarian for Digital Initiatives in the Weissman Preservation Center, Harvard University Library
Peter McKinney: research officer for the espida project at the University of Glasgow
Jasmine Kelly: a former research assistant at the Centre for Computing in the Humanities, King’s College London

This fabulous band of writers and researchers were led by Marilyn Deegan and Simon Tanner, both based out of the King’s College London.

Published in 2006, this is one of the most comprehensive and up to date books I found on the subject. The book starts out with two chapters addressing the basic issues related to digital preservation. Subsequent chapters present information about all kinds of metadata, web archiving, the costs of digital preservation and an overview of European approaches. The final chapter presents an extensive series of case studies – complete with URLs to give you plently of information online to explore.

This book gave me a great foundation from which to explore the details of various geospatial data and GIS archiving efforts. For those faced with the challenge of planning for digital preservation, the two chapters on costs should be very useful. So many articles talk about how it will be so expensive to ensure proper digital preservation, but don’t give people in the field any practical advice in planning for the costs – this book is different. The exploration of existing approaches being used at major institutions throughout Europe give a good sense of evolving standards and best practices.

If you are looking for a way to get a handle on the issues involved in digital preservation – this a great starting point. The final chapter on case studies alone could keep you busy for a month as you explore all the websites of projects from around the world. While the book has a decidedly European focus, the concepts are applicable the world over. If you are responsible for ensuring that digital records (either digitized or born digital) are protected and preserved – this book explains the basics and explores various strategies. They don’t oversimplify things – but take the time to explain things well. They are honest about those questions that aren’t answered yet… and they point to as many resources, standards and examples as they can. While Digital Preservation cannot provide a formula for everyone to follow, it can help you start asking the right questions and begin to understand the possibilities.

Interesting Interface for exploring E-mail: TrampolineSystem SONAR Platform

October 26, 2006 1 Comment

Conversations and articles about the problem of archiving and accessing e-mail are often accompanied by the wringing of hands or the shrugging of shoulders. It has often seemed to me that figuring out how to archive and facilitate access to e-mail is a challenge that most people would rather ignore because it seems so difficult (and because there are plenty other things that need work too).

“In October 2003 the US Federal Energy Regulatory Commission placed 200,000 of Enron‘s internal emails from 1999-2002 into the public domain as part of its ongoing investigations.” So says TrampolineSystems on their facinating website that lets you explore those 200,000 public domain e-mails using their SONAR platform (that stands for Social Networks and Relevance). I would highly recommend taking a look and browsing around the Enron e-mails.

It appears that SONAR somehow tags the emails without human intervention – though they do not state this specifically one way or the other. The implication from the SONAR PR page is that you plug in the platform – and you instantly have this new access to your information. It is my impression that this works for either a fixed collection of e-mails (as is the case with the Enron emails) – or for an active live e-mail collection that is changing over time.

I like the social network Visualizer and the way it shows you how people are related to one another as represented by their e-mail correspondence. I like the theme and people tag clouds. I like the ease with which I can search for and read emails. I like how clearly they specify what you searched on at the top of your e-mail result list – and how many e-mails, people and themes the list represents.

On the other hand, there are a number of things I wish I could do. I wish that it was clear to me what order the emails are listed in when I do a search on a term. I searched on the word ‘pager’ – and received 2012 emails in no obvious order (most likely relevance – but that is not at all clear). I would like to be able to re-sort the results (by date for example). I would like to be able to add together multiple tags and people to get a scoped list of emails between two people on a specific set of theme.

Just as in traditional archival collections – there is some non-unique information in the mix. I found a generic Hotwire promotional email while looking at the theme The Insider (4th hit on the list). While I suppose spam and legitimate e-mail ads (ie, ones you asked for) are interesting – perhaps software considering e-mail to retain permanently could block some of these somehow.

I like clicking on things in the Visualizer and seeing the social networks hidden within the e-mails – but that gets old quickly unless you are looking for something very specific. I found myself wanting more context. Who are these people? What are their jobs? How are they ‘officially’ related in the corporate hierarchy? How do these e-mails compare with a timeline of events? What about the content of attachments (they don’t seem to be part of this interface)? All of this information could be linked into this interface in such a way as to improve an outsider’s understanding of this amazing landscape of 200,000 e-mails.

All in all I think it is an excellent starting point and I applaud them for trying to find an answer to the email question rather than just ignoring the problem.

(Thanks to Boing Boing for the pointer to this site.)

The Yahoo! Time Capsule

October 13, 2006 4 Comments

Yahoo! is creating a time capsule. The first paragraph of the Yahoo! Time Capsule Overview concludes by claiming “This is the first time that digital data will be gathered and preserved for historical purposes”. Excuse me? What has the Internet Archive been doing since 1996? What are the Hurricane Digital Memory Bank and The September 11 Digital Archive doing? And that is just off the top of my head – the list could go on and on.

I think that what they are doing (collecting digital content from around the world for 30 days, then giving the timecapsule to the Smithsonian Folkways Recordings in Washington, DC) is great. I am not sure what the bit about being “beamed along a path of laser light into space” is all about – but it sounds sort of cool. To add an entry, it must be put under one of 10 themes: Love, Anger, Fun, Sorrow, Faith, Beauty, Past, Now, Hope or You. It seems like an interesting attempt at organizing what would could otherwise be just an endless stream of images. At the time of this post, they had 15,564 contributions over the course of the first 3 days. I even explored some of what they have – it is pretty. It reminded me a bit of the America 24/7 project from a few years back – though with more types of media and an aim to record a snapshot of the world, not just America.

They have another ridiculous claim on the main time capsule page: “This first-ever collection of electronic anthropology captures the voices, images and stories of the online global community.”

Go ahead and make a fabulous digital archive of contributions from around the world Yahoo!, but please stop claiming that you invented the idea. I can’t be the only person who is frustrated by the way they are presenting this. Please tell me I am not alone!

My New Daydream: A Hosting Service for Digitized Collections

September 20, 2006 3 Comments

In her post Predictions over on hangingtogether.org, Merrilee asked “Where do you predict that universities, libraries, archives, and museums will be irresistibly drawn to pooling their efforts?” after reading this article.

And I say: what if there were an organization that created a free (or inexpensive fee-based) framework for hosting collections of digitized materials? What I am imagining is a large group of institutions conspiring to no longer be in charge of designing, building, installing, upgrading and supporting the websites that are the vehicle for sharing digital historical or scholarly materials. I am coming at this from the archivists perspective (also having just pondered the need for something like this in my recent post: Promise to Put It All Online ) – so I am imagining a central repository that would support the upload of digitized records, customizable metadata and a way to manage privacy and security.

The hurdles I imagine this dream solution removing are those that are roughly the same for all archival digitization projects. Lack of time, expertise and ongoing funding are huge challenges to getting a good website up and keeping it running – and that is even before you consider the effort required to digitize and map metadata to records or collections of records. It seems to me that if a central organization of some sort could build a service that everyone could use to publish their content – then the archivists and librarians and other amazing folks of all different titles could focus on the actual work of handling, digitizing and describing the records.

Being the optimist I am I of course imagine this service as providing easy to use software with the flexibility for building custom DTDs for metadata and security to protect those records that cannot (yet or ever) be available to the public. My background as a software developer drives me to imagine a dream team of talented analysts, designers and programmers building an elegant web based solution that supports everything needed by the archival community. The architecture of deployment and support would be managed by highly skilled technology professionals who would guarantee uptime and redundant storage.

I think the biggest difference between this idea and the wikipedias of the world is that there would be some step required for an institution to ‘join’ such that they could use this service. The service wouldn’t control the content (in fact would need to be super careful about security and the like considering all the issues related to privacy and copyright) – rather it would provide the tools to support the work of others. While I know that some institutions would not be willing to let ‘control’ of their content out of their own IT department and their own hard drives, I think others would heave a huge sigh of relief.

There would still be a place for the Archons and the Archivists’ Toolkits of the world (and any and all other fabulous open-source tools people might be building to support archivists’ interactions with computers), but the manifestation of my dream would be the answer for those who want to digitize their archival collection and provide access easily without being forced to invent a new wheel along the way.

If you read my GIS daydreams post, then you won’t be surprised to know that I would want GIS incorporated from the start so that records could be tied into a single map of the world. The relationships among records related to the same geographic location could be found quickly and easily.

Somehow I feel a connection in these ideas to the work that the Internet Archive is doing with Archive-IT.org. In that case, producers of websites want them archived. They don’t want to figure out how to make that happen. They don’t want to figure out how to make sure that they have enough copies in enough far flung locations with enough bandwidth to support access – they just want it to work. They would rather focus on creating the content they want Archive-It to keep safe and accessible. The first line on Archive-It’s website says it beautifully: “Internet Archive’s new subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise.”

So, the tag line for my new dream service would be “DigiCollection’s new subscription service, Digitize-It, allows institutions to upload, manage and search their own digitized collections through a user friendly web application, without requiring any technical expertise.”

SAA 2006: Research Library Group Roundtable – Internet Archiving

August 29, 2006

Late in the afternoon on Thursday August 3rd I attended the Research Library Group Roundtable at SAA 2006. It was an opportunity for RLG to share information with the archival community about their latest products and services. This session included presentations on the Internet Archive , Archive-It and the Web Archives Workbench.

After some brief business related to the SAA 2007 program committee and the rapid election of Brian Stevens from NYU Archives as the new chair of the group, Anne Van Camp spoke about the period of transition as RLG merges with OCLC. In the interest of the blending of cultures – she told a bar joke (as all OCLC meetings apparently begin). She explained that RLG products and services will be integrated into the OCLC product line. RLG programs will continue as RLG becomes the research arm for the joined interest areas of libraries, archives and museums. This has not existed before and they believe it will be a great chance to explore things in ways that RLG hasn’t had the opportunity to do in the past.

The initiatives on their agenda:

archival gateways: convened 2 meetings recently. The first to see if there is a way to be interactive with international archive databases and the second to bring regional archives together to see how they can work together.
web archiving: started looking at it from a service point of view, but also some community issues that have to be worked out around web archiving. Looking at big problems that will need community involvement – issues like metadata and selection.
standards: continuing to support EAD, pursuing rigorous agenda regarding EAC
OCLC has a whole group of people who works on registries (where you put information about organizations). RLG has talked about building a registry on top of Archive Grid of US archives.

In her introduction, Merrilee (frequent poster on hangingtogether.org ) highlighted that there are lots of questions about the intellectual side of web archiving (vs the technical challenges) such as:

what to archive?
what metadata data and description is appropriate for it?
what would end users of web archives need? How would they use a web archive?
what about collaborative collection development? It is expensive to archive the web – how does an institution say “I am archiving this corner of the web – this deep – this often”. This information should be publicly available for others doing research and others archiving the web.

She pointed out that RLG is happy about their work with Internet Archive – they are doing work to make the technical side easier but they understand that there is a lot for the archival community to sort out.

Next up was Kristine Hanna of the Internet Archive giving her presentation ‘Archiving and Preserving the Web’. The Internet Archive has been working with RLG this year and they need information from the users in the RLG community. They are looking into how they are going to work with OCLC and have applied for an NDIIP grant.

The Internet Archive (IA), founded by Brewster Kahle in 1996, is built on open source principles and dedicated to Open Source software.

What do they collect in the archive? Over 2 billion pages a month in 21 languages. It is free and the largest archive on the web including 55 billion pages from 55 million sites and supporting 60,000 unique users per day.

Why try to collect it all? They don’t feel comfortable making the choices about appraisal. And at risk websites and collections are disappearing all the time. The average lifespan of a web page is 100 days. They did a case study of crawling websites associated with the Nigerian election – 6 months after the election 70% of the crawled sites were gone, but they live on in the archive.

How do they collect? They use these components and tools:

Heritrix – web crawler
Wayback Machine – access tools for rendering and viewing files
Nutch – search engine
Arc File – archival file format used for preservation

How do they preserve it? They keep multiple copies at different digital repositories (CA, Alexandria (Egypt), France, Amsterdam) using over 1300 server machines.

IA also does targeted archiving for partners. Institutions that want to create specific online collections or curated domain crawls can work with IA. These archives start at 100+ million documents and are based on crawls run by IA crawl engineers. The Library of Congress has arranged for an assortment of targeted archives including archives of US National Elections 2000, September 11 and the War in Iraq (not accessible yet – marked March 2003 – Ongoing). Australia arranged for archiving of the entire .au domain. Also see Purpose, Pragmatism and Perspective – Preserving Australian Web Resources at the National Library of Australia by Paul Koerbin of the National Library of Australia and published in February of 2006.

What’s Next for Internet Archive?

collaboration and partnerships
OCA – open content alliance
Multiple copies around the world

Next, Dan Avery of IA gave a 9 minute version of his 35 minute presentation on Archive-It. Archive-It is a web based annual subscription service provided by IA to permit the capture of up to 10 million pages. Kristine gave some examples of those using Archive-It during her presentation:

Indiana University – web sites
North Carolina State Archives – Government Agencies, Occupational Licensing Boards and commissions.
Library of Virginia – Jamestown 2007 commemoration and Governor Mark Warner’s last year in office. When Mark Warner was listed by the New York Times as a possible presidential candidate, this archive got lots of hits. (This brings up interesting questions of watching content that is being purposefully preserved to get an idea of what some expect for the future. Don’t be surprised by a post on this idea all by itself later. Need to think about it some more!)

He highlighted the different elements and techniques used in Archive-It: crawling, web user interface, storage, playback, text indexing and integration.

Crawling/Browsing:
- Heritrix :
  - open source java
  - Archival-quality (they preserve exactly what they get back from the server)
  - Highly configurable
- Wayback Machine :
  - lets you surf the web as it was
  - in Archive-It – each customer has their own wayback machine
  - not open source yet.. that is a work in progress
The user interface is a web application:
- collects all the info they need to do the crawling the customer requests
- schedule (monthly, daily, weekly, quarterly… etc)
- seed URLS (the starting point for archive web crawls)
- crawl parameters
NutchWAX
- extension of Nutch which is built on Lucene
- full text search plus link analysis
- can search by date instead of relevance – useful for individual archives

While there are public collections in Archive-It, logging in gives you access to personal sites: shows the total documents archived (and more), lets you check your list of active collections and set up a new collection (includes unique collection identifier). He showed some screen shots of the interface and examples (this was the first time there wasn’t a network available for his presentation – he was amused that his paranoia that forced him to always bring screen captures finally paid off!).

It was interesting seeing this presentation back to back with the general Internet Archive overview. There are lots of overlap in tools and approaches between them – but Archive-It definitely has it’s own unique requirements. It puts the tools for managing larges scale web crawling in the hands of archivists (or more likely information managers of some sort) – rather than the technical staff of IA.

The final presentation of the roundtable was by Judy Cobb – a Product Manager fromOCLC. She gave an overview of the Web Archives Workbench. (I hunted for a good link to this – but the best I came up with was acknowledgments document and the login page .)The inspiration for the creation of Workbench was the challenge of collecting from web. The Internet is a big place. It is hard to define the scope of what to archive.

Workbench is a discovery tool that will permit its users to investigate what domains should be included when crawling a website for archiving. It will ask you which domains should be included. For example, you can tell it not to crawl Adobe.com just because there is a link to it to let people download acrobat.

Workbench will let you set metadata data for your collection based on the domains you said were in scope. It will then let you appraise and rank the entities/domains being harvested, leaving you with a list of organizations or entities in scope and ranked by importance. Next it will translate a site map of what is going to be crawled, define parts of the map as series and put the harvested content and related metadata into a repository. Other configuration options permit setting how frequently you harvest various series, choosing to only get new content and requesting notification if the sitemap changes.

Workbench is currently in beta and is still under development. The 3rd phase will add the support for Richard Pierce-Moses’s Arizona Model for Web Preservation and Access. The focus of the Arizona Model is curation, not technology. It strives to find a solution somewhere between manual harvesting and bulk harvesting that is based on standard archival theories. Workbench will be open source and funded by LOC.

I wasn’t sure what to expect from the roundtable – but I was VERY glad that I attended. The group was very enthusiastic – cramming in everything they could manage to share with those in the room. The Internet Archive, Archive-It and the Web Archives Workbench represent the front of the pack of software tools intended to support archiving the web. It was easy to see that if the Workbench is integrated in with Archive-It, that it should permit archivists to start paying more attention to the identification of what should be archived rather than figuring out how to do the actual archiving.

Session 510: Digital History and Digital Collections (aka, a fan letter for Roy and Dan)

August 6, 2006

There were lots of interesting ideas in the talks given by Dan Cohen and Roy Rosenzweig during their SAA session Archives Seminar: Possibilities and Problems of Digital History and Digital Collections (session 510).

Two big ideas were discussed: the first about historians and their relationship to internet archiving and the second about using the internet to create collections around significant events. These are not the same thing.

In his article Scarcity or Abundance? Preserving the Past in a Digital Era, Roy talks extensively about the dual challenges of loosing information as it disappears from the net before being archived and the future challenge to historians faced with a nearly complete historical record. This assumes we get the internet archiving thing right in the first place. It assumes those in power let the multitude of voices be heard. It assumes corporately sponsored sites providing free services for posting content survive, are archived and do the right thing when it comes to preventing censorship.

The Who Built America CD-ROM, released in 1993 and bundled with Apple computers for K-12 educational use, covered the history of America from 1876 and 1914. It came under fire in the Wall Street Journal for including discussions of homosexuality, birth control and abortion. Fast forward to now when schools use filtering software to prevent ‘inappropriate’ material from being viewed by students – in much the same way as Google China uses to filter search results. He shared with us the contrast of the search results from Google Images for ‘Tiananmen square’ vs the search results from Google Images China for ‘Tiananmen square’. Something so simple makes you appreciate the freedoms we often forget here in the US.

It makes me look again at the DOPA (Deleting Online Predators Act) legislation recently passed by the House of Representatives. In the ALA’s analysis of DOPA, they point out all the basics as to why DOPA is a rotten idea. Cool Cat Teacher Blog has a great point by point analysis of What’s Wrong with DOPA. There are many more rants about this all over the net – and I don’t feel the need to add my voice to that throng – but I can’t get it out of my head that DOPA’s being signed into law would be a huge step BACK for freedom of speech and learning and internet innovation in the USA. How crazy is it that at the same time that we are fighting to get enough funding for our archivists, librarians and teachers – we should also have to fight initiatives such as this that would not only make their jobs harder but also siphon away some of those precious resources in order to enforce DOPA?

In the category of good things for historians and educators is the great progress of open source projects of all sorts. When I say Open Source I don’t just mean software – but also the collection and communication of knowledge and experience in many forms. Wikipedia and YouTube are not just fun experiments – but sources of real information. I can only imagine the sorts of insights a researcher might glean from the specific clips of TV shows selected and arranged as music videos by TV show fans (to see what I am talking about, take a look at some of the video’s returned from a search on gilmore girls music video – or the name of your favorite pop TV characters). I would even venture to say that YouTube has found a way to provide a method of responding to TV, perhaps starting down a path away from TV as the ultimate passive one way experience.

Roy talked about ‘Open Sources’ being the ultimate goal – and gave a final plug to fight to increase budgets of institutions that are funding important projects.

Dan’s part of the session addressed that second big idea I listed – using the internet to document major events. He presented an overview of the work of ECHO: Exploring and Collecting History Online. ECHO had been in existence for a year at the time of 9/11 and used 9/11 as a test case for their research to that point. The Hurricane Digital Memory Bank is another project launched by ECHO to document stories of Katrina, Rita and Wilma.

He told us the story behind the creation of the 9/11 digital archive – how they decided they had to do something quickly to collect the experiences of people surrounding the events of September 11th, 2001. They weren’t quite sure what they were doing – if they were making the best choices – but they just went for it. They keep everything. There was no ‘appraisal’ phase to creating this ‘digital archive’. He actually made a point a few minutes into his talk to say he would stop using the word archive, and use the term collection instead, in the interest of not having tomatoes thrown at him by his archivist audience.

The lack of appraisal issue brought a question at the end of the session about where that leaves archivists who believe that appraisal is part of the foundation of archival practice? The answer was that we have the space – so why not keep it all? Dan gave an example of a colleague who had written extensively based on research done using World War II rumors they found in the Library of Congress. These easily could have been discarded as not important – but you never know how information you keep can be used later. He told a story about how they noticed that some people are using the 9/11 digital archive as a place to research teen slang because it has such a deep collection of teen narratives submitted to be part of the archive.

This reminded me a story that Prof. Bruce Ambacher told us during his Archival Principals, Practices and Programs course at UMD. During the design phase for the new National Archives building in College Park, MD, the Electronic Records division was approached to find out how much room they needed for future records. Their answer was none. They believed that the speed at which the space required to store digital data was shrinking was faster than the rate of growth of new records coming into the archive. One of the driving forces behind the strong arguments for the need for appraisal in US archives was born out of the sheer bulk of records that could not possibly be kept. While I know that I am oversimplifying the arguments for and against appraisal (Jenkinson vs Schellenberg, etc) – at the same time it is interesting to take a fresh look at this in the light of removing the challenges of storage.

Dan also addressed some interesting questions about the needs of ‘digital scholarship’. They got zip codes from 60% of the submissions for the 9/11 archive – they hope to increase the accuracy and completeness of GIS information in the hurricane archive by using Google Maps new feature to permit pinpointing latitude and longitude based on an address or intersection. He showed us some interesting analysis made possible by pulling slices of data out of the 9/11 archive and placing it as layers on a Google Map. In the world of mashups, one can see this as an interesting and exciting new avenue for research. I will update this post with links to his promised details to come on his website about how to do this sort of analysis with Google Maps. There will soon be a researchers interface of some kind available at the 9/11 archive (I believe in sync with the 5 year annivarsary of September 11).
Near the end of the session a woman took a moment to thank them for taking the initiative to create the 9/11 archive. She pointed out that much of what is in archives across the US today is the result of individuals choosing to save and collect things they believed to be important. The woman who had originally asked about the place of appraisal in a ‘keep everything digital world’ was clapping and nodding and saying ‘she’s right!’ as the full room applauded.

So – keep it all. Snatch it up before it disappears (there were fun stats like the fact that most blogs remain active for 3 months, most email addresses last about 2 years and inactive Yahoo Groups are deleted after 6 months). There is likely a place for ‘curitorial views’ of the information created by those who evaluate the contents of the archive – but why assume that something isn’t important? I would imagine that as computers become faster and programming becomes smarter – if we keep as much as we can now, we can perhaps automate the sorting it out later with expert systems that follow very detailed rules for creating more organized views of the information for researchers.

This panel had so many interesting themes that crossed over into other panels throughout the conference. The Maine Archivist talking about ‘stopping the bleeding’ of digital data loss in his talk about the Maine GeoArchives. The panel on blogging (that I will write more about in a future post). The RLG Roundtable with presentations from people over at InternetArchive and their talks about archiving everything (ALSO deserves it’s own future post).

I feel guilty for not managing to touch on everything they spoke about – it really was one of the best sessions I attended at the conference. I think that having voices from outside the archival profession represented is both a good reality check and great for the cross-polination of ideas. Roy and Dan have recently published a book titled Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web – definitely on my ‘to be read’ list.

Thoughts on Archiving Web Sites

July 26, 2006 1 Comment

Shortly after my last post, a thread surfaced on the Archives Listserv asking the best way to crawl and record the top few layers of a website. This led to many posts suggesting all sorts of software geared toward this purpose. This post shares some of my thinking on the subject.

Adobe Acrobat can capture a website and convert it into a PDF. As pointed out in the thread above, that would loose the original source HTML – yet there are more issues than that alone. It would also loose any interaction other than links to other pages. It is not clear to me what would happen to a video or flash interface on a site being ‘captured’ by Acrobat. Quoting a lesson for Acrobat7 titled Working with the Web : “Acrobat can download HTML pages, JPEG, PNG, SWF, and GIF graphics (including the last frame of animated GIFs), text files, image maps and form fields. HTML pages can include tables, linkes, frames, background colors, text colors, and forms. Cascading Stylesheets are supported. HTML links are turned into Web links, and HTML forms are turned into PDF forms.”

I looked at a few website HTML capture programs such as Heritrix, Teleport Pro, HTTrack Web and the related ProxyTrack. I hope to take the time to compare each of these options and discover what it does when confronted with something more complicated than HTML, images or cascading style sheets. It also got me thinking about HTML and versions of browsers. It think it safe to say that most people who browse the internet with any regularity have had the experience of viewing a page that just didn’t look right. Not looking right might be anything from strange alignment or odd fonts all the way to a page that is completely illegible. If you are a bit of a geek (like me) you might have gotten clever and tried another browser to see if it looked any better. Sometimes it does – sometimes it doesn’t. Some sites make you install something special (flash or some other type of plugin or local program).

Where does this leave us when archiving websites? A website is much more than just it’s text. If the text were all we worried about I am sure you could crawl and record (or screen scrape) just the text and links and call it a day being fairly confident that text stored as a plain ASCII file (with some special notation for links) would continue to be readable even if browsers disappeared from the world. While keeping the words is useful, it also looses a lot of the intended meaning. Have you read full text journal articles online that don’t have the images? I have – and I hate it. I am a very visually oriented person. It doesn’t help me to know there WAS a diagram after the 3rd paragraph if I can’t actually see it. Keeping all the information on a webpage is clearly important. The full range of content (all the audio, video, images and text on a page) is important to viewing the information in its original context.

Archivists who work with non-print media records that require equipment for access are already in the practice of saving old machines hoping to ensure access to their film, video and audio records. I know there are recommendations for retaining older computers and software to ensure access to data ‘trapped’ in ‘dead’ programs (I will define a dead program here as one which is no longer sold, supported or upgraded – often one that is only guaranteed to run on a dead operating system). My fear is for the websites that ran beautifully on specific old browsers. Are we keeping copies of old browsers? Will the old browsers even run on newer operating systems? The internet and its content is constantly changing – even just keeping the HTML may not be enough. What about those plugins – what about the streaming video or audio. Do the crawlers pull and store that data as well?

One of the most interesting things about reading old newspapers can be the ads. What was being advertised at the time? How much was the sale price for laundry detergent in 1948? With the internet customizing itself to individuals or simply generating random ads how would that sort of snapshot of products and prices be captured? I wonder if there is a place for advertising statistics as archival records. What google ads were most popular on a specific day? Google already has interesting graphs to show the correspondence between specific keyword searches and news stories that google perceives as related to the event. The Internet Archive (IA) could be another interesting source for statistical analysis of advertising for those sites that permit crawling.

What about customization? Only I (or someone looking over my shoulder) can see my MyYahoo page. And it changes each time I view it. It is a conglomeration of the latest travel discounts, my favorite comics, what is on my favorite TV and cable channels tonight, the headlines of the newspapers/blogs I follow and a snapshot of my stock portfolio. Take even a corporate portal inside an intranet. Often a slightly less moving target – but still customizable to the individual. Is there a practical way to archive these customized pages – even if only for a specific user of interest? Would it be worthwhile to be archiving the personalized portal pages of an ‘important’ or ‘interesting’ person on a daily basis – such that their ‘view’ of the world via a customized portal could be examined by researchers later?

A wealth of information can be found on the website for the Joint Workshop on Future-proofing Institutional Websites from January 2006. The one thing most of these presentations agree upon is that ‘future-proofing’ is something that institutions should think about at the time of website design and creation. Standards for creating future-proof websites directs website creators to use and validate against open standards. Preservation Strategies for institutional website content shows insight into NARA‘s approach for archiving US government sites, the results of which can be viewed at http://www.webharvest.gov/. A summary of the issues they found can be read in the tidy 11 page web harvesting survey.

I definitely have more work ahead of me to read through all the information available from the International Internet Preservation Consortium and the National Library of Australia’s Preserving Access to Digital Information (PADI). More posts on this topic as I have time to read through their rich resources.

All around, a lot to think about. Interesting challenges for researchers in the future. The choices archivists face today often will depend on the type of site they are archiving. Best practices are evolving both for ‘future-proofing’ sites and for harvesting sites for archiving. Unfortunately, not everyone building a website that may be worth archiving is particularly concerned with validating their sites against open standards. Institutions that KNOW that they want to archive their sites are definitely a step ahead. They can make choices in their design and development to ensure success in archiving at a later date. It is the wild west fringe of the internet that are likely to present the greatest challenge for archivists and researchers.

Category: internet archiving