Menu Close

Category: internet archiving

Chapter 4: Link Rot, Reference Rot and the Thorny Problems of Legal Citation by Ellie Margolis

The fourth chapter in Partners for Preservation is ‘Link Rot, Reference Rot and the Thorny Problems of Legal Citation’ by Ellie Margolis. Links that no longer work and pages that have been updated since they were referenced are an issue that everyone online has struggled with. In this chapter, Margolis gives us insight into why these challenges are particularly pernicious for those working in the legal sphere.

This passage touches on the heart of the problem.

Fundamentally, link and reference rot call into question the very foundation on which legal analysis is built. The problem is particularly acute in judicial opinions because the common law concept of stare decisis means that subsequent readers must be able to trace how the law develops from one case to the next. When a source becomes unavailable due to link rot, it is as though a part of the opinion disappears. Without the ability to locate and assess the sources the court relied on, the very validity of the court’s decision could be called into question. If precedent is not built on a foundation of permanently accessible sources, it loses
its authority.

While working on this blog post, I found a WordPress Plugin called Broken Link Checker. It does exactly what you expect – scans through all your blog posts to check for broken URLs. In my 201 published blog posts (consisting of just shy of 150,000 words), I have 3002 unique URLs. The plugin checked them all and found 766 broken links! Interestingly, the plugin updates the styling of all broken links to show them with strikethroughs – see the strikethrough in the link text of the last link in the image below:

For each of the broken URLs it finds, you can click on “Edit Link”. You then have the option of updating it manually or using a suggested link to a Wayback Machine archived page – assuming it can find one.

It is no secret that link rot is a widespread issue. Back in 2013, the Internet Archive announced an initiative to fix broken links on the Internet – including the creation of the Broken Link Checker plugin I found. Three years later, on the Wikipedia blog, they announced that over a million broken outbound links on English Wikipedia had been fixed. Fast forward to October of 2018 and an Internet Archive blog post announced that More than 9 million broken links on Wikipedia are now rescued.

I particularly love this example because it combines proactive work and repair work. This quote from the 2018 blog post explains the approach:

For more than 5 years, the Internet Archive has been archiving nearly every URL referenced in close to 300 wikipedia sites as soon as those links are added or changed at the rate of about 20 million URLs/week.

And for the past 3 years, we have been running a software robot called IABot on 22 Wikipedia language editions looking for broken links (URLs that return a ‘404’, or ‘Page Not Found’). When broken links are discovered, IABot searches for archives in the Wayback Machine and other web archives to replace them with.

There are no silver bullets here – just the need for consistent attention to the problem. The examples of issues being faced by the law community, and their various approaches to prevent or work around them, can only help us all move forward toward a more stable web of internet links.

Ellie Margolis

Bio:
Ellie Margolis is a Professor of Law at Temple University, Beasley School of law, where she teaches Legal Research and Writing, Appellate Advocacy, and other litigation skills courses. Her work focuses on the effect of technology on legal research and legal writing. She has written numerous law review articles, essays and textbook contributions. Her scholarship is widely cited in legal writing textbooks, law review articles, and appellate briefs.

Image credit: Image from page 235 of “American spiders and their spinningwork. A natural history of the orbweaving spiders of the United States, with special regard to their industry and habits” (1889)

Encouraging Participation in the Census

1940-census-posterWhile smart folks over at NARA are thinking about the preservation strategy for digitized 2010 census forms, I got inspired to take a look at what we have preserved from past censuses. In specific, I wanted to look at posters, photos and videos that give us a glimpse into how we encouraged and documented the activity of participation in the past.

There is a dedicated Census History area on the Census website, as well as a section of the 2010 website called The Big Count Archive. While I like the wide range of 2010 Census Posters – the 1940 census poster shown here (thank you Library of Congress) is just so striking.

I also loved the videos I found, especially when I realized that they were all available on YouTube – uploaded by a user named JasonGCensus. I am not clear on the relationship between JasonGCensus and the official U.S. Census Bureau’s Channel (which seems focused on 2010 Census content), but there are some real gems posted there.

For example, in the 1970 Census PSA shown below we learn about the privacy of our census data: “Our separate identities will be lost in the process which is concerned only with what we say, not who said it”. We are shown technology details – complete with old school beeping and blooping computer sounds. (NOTE: this video is also available on Census.gov, but I saw no way to embed that video here – hence my cheer at finding the same video on YouTube)

For the 1960 census, a PSA explains the new FOSDIC technology which removed the need for punch-cards. With the tagline ‘Operation Rollcall, USA’, the ad presents our part in “this enterprise” as cooperation with the enumerators. In the 1980 PSA the tag line is ‘Answer the Census: We’re counting on you!’ and stresses that it is kept confidential and is used to provide services to communities. By the time you get to the 1990 and 2000 PSAs we see more stress on the benefits to communities that fill out the census and less stress on how the census is actually recorded.

I also found some lovely census images in the Library of Congress Prints and Photographs catalog including the image shown here and:

Exploring the area of Census.gov dedicated to the 2010 census made me wonder what was available online for the 2000 census.

Wayback Machine to the rescue! They have what appears to be a fairly deep crawl of the 2000 Census.gov site dating from March of 2000. For example – the posters section seems to include all the images and PDFs of the originals. I even found functional Quicktime videos in the Video Zone, like this one: How America Knows What America Needs.

The ten year interval makes for a nice way to get a sense of the country from the PR perspective. What did the Census Bureau think was the right way to appeal to the American public? Were we more intrigued by the latest technology or worried about our privacy? Did they need to communicate what the census is used for? Or was it okay to simply express it as an American’s duty? I appreciate the ease with which I can find and share the resources above. Great fun.

And for those of you in the United States, please consider this my personal encouragement to fill out your census forms!

Update: The WashingtonPost has an interesting article about the ‘Snapshot of America’ series of promotional videos for the 2010 census. Definitely an interesting contrast to the videos I reviewed for this post.

Leveraging Google Reader’s Page Change Tracking for Web Page Preservation

The Official Google Reader Blog recently announced a new feature that will let users watch any page for updates. The way this works is that you add individual URLs to your Google Reader account. Just as with regular RSS feeds, when an update is detected – a new entry is added to that subscription.

My thinking is that this could be a really useful tool for archivists charged with preserving websites that change gradually over time, especially those fairly static sites that change infrequently with little or no notice of upcoming changes. If a web page was archived and then added to a dedicated Google Reader account, the archivist could scan their list of watch pages daily or weekly. Changes could then trigger the creation of a fresh snapshot of the site.

I will admit that there have been services out there for a while that do something similar to what Google has just rolled out. I personally have used Dapper.net to take a standard web page and generate an RSS feed based on updates to the page (sound familiar?). One Dapper.net feed that I created and follow is for the news archive page for the International Red Cross and can be found here. What is funny is that now they actually have an official RSS feed for their news that includes exactly what my Dapper.net feed harvested off their news archive page – but when I built that Dapper feed there was no other way for me to watch for those news updates.

There are lots of different tools out there that aim to archive websites. Archive-It is a subscription based service run by Internet Archive that targets institutions and will archive sites on demand or on a regular schedule. Internet Archive also has an open source crawler called Heritrix for those who are comfortable dealing with the code. Other institutions are building their own software to tackle this too. Harvard University has their own Web Archive Collection Service (WAX). The LiWA (Living Web Archives) Project is based in Germany and aims to “extend the current state of the art and develop the next generation of Web content capture, preservation, analysis, and enrichment services to improve fidelity, coherence, and interpretability of web archives.” One could even use something as simple as PDFmyURL.com – an online service that turns any URL into a PDF (be sure to play with the advanced options to make sure you get a wide enough snapshot). I know there are many more possibilities – these just scratch the surface.

What I like about my idea is that it isn’t meant to replace these services but rather work in tandem with them. The Internet Archive does an amazing job crawling and archiving many web pages – but they can’t archive everything and their crawl frequency may not match up with real world updates to a website. This approach certainly wouldn’t scale well for huge websites for which you would need to watch for changes on many pages. I am picturing this technique as being useful for small organizations or individuals who just need to make sure that a county government website makeover or a community organization’s website update doesn’t get lost in the shuffle. I like the idea of finding clever ways to leverage free services and tools to support those who want to protect a particular niche of websites from being lost.

Image Credit: The RSS themed image above is by Matt Forsythe.

After The Games Are Over: Olympic Archival Records

What does an archivist ponder after she turns off the Olympics? What happens to all the records of the Olympics after the closing ceremonies? Who decides what to keep? Not knowing any Olympic Archivists personally, I took to the web to see what I could find.

Olympics.org uses the tag line “Official Website of the Olympic Movement” and include information about The International Olympic Committee’s Historical Archives. The even have an Olympic Medals Database with all the results from all the games.

The most detailed list of Olympics archives that I could find is the Olympic Studies International Directory listing of Archives & Olympic Heritage sites. It is from this page that I found my way to records from the Sydney Olympic Park Authority.

The Olympic Television Archive Bureau (OTAB) website explains that this UK based company “has over 30,000 hours of the most sensational sports footage ever seen, uniquely available in one library”  and aims to provide “prompt fulfilment of your Olympic footage requirements”.

Then I thought to dig into the Internet Archive. What a great treasure trove for all sorts of interesting Olympic bits!

First I found a Universal Newsreel from the 1964 Olympics in Tokyo (embedded below).

I also found a 2002 Computer Chronicles episode Computer Technology and the Olympics which explores the “high-tech innovations that ran the 2002 Winter Olympic Games” (embedded below).

Other fun finds included a digitized copy of a book titled The Olympic games, Stockholm, 1912 and the oldest snapshot of the Beijing 2008 website (from December of 2006). Seeing the 2008 Summer Games pages in the archive made me curious. I found the old site of the official Athens summer games from 2004 which kindly states: “The site is no longer available, please visit http://www.olympic.org or http://en.beijing2008.com/”. The Internet Archive has a bit more than that on the athens2004.com archive page – though some clicking through definitely made it clear that not all of the site was crawled. Lucky for us we can still see the Athens 2004 Olympics E-Cards you could send!

Then I turned to explore NARA‘s assorted web resources. I found a few photos on the Digital Vaults website (search on the keyword Olympics).  A search in the Archival Research Catalog (ARC) generates a long list – including footage of the US National Rifle Team in the 1960 Olympics in Italy.

My favorite items from NARA’s collections are in the Access to Archival Databases (AAD). First I found this telegram from the American Embassy in Ottawa to the Secretary of State in Washington DC (Document ID # 1975OTTAWA02204) sent in June 1975:

 1. EMBASSY APPRECIATES DEPARTMENT’S EFFORTS TO ASSIST CONGEN IN CARING FOR VIPS WHO CERTAINLY WILL ARRIVE FOR 1976 OLYMPIC GAMES WITHOUT TICKETS OR LODGING. HAS DEPARTMENT EXPLORED POSSIBILITY OF OBTAINING 4,000 TICKETS ON CONSIGNMENT BASIS FROM MONTGOMERY WARD, WITH UNDERSTANDING THAT, AS TICKETS ARE SOLD, PROCEEDS WILL BE REMITTED? PERHAPS SUCH AN ARRANGEMENT COULD BE WORKED OUT WITH FURTHER UNDERSTANDING THAT UNSOLD TICKETS BE RETURNED TO MONTGOMERY WARD AT SOME SPECIFIED DATE PRIOR TO BEGINNING OF EVENTS.

2. EMBASSY WILL FURNISH AMOUNT REQUIRED TO RESERVE SIX DOUBLE ROOMS FOR PERIOD OF GAMES. AT PRESENT HOTEL OWNERS AND OLYMPIC OFFICIALS ARE IN DISAGREEMENT AS TO AMOUNTS THAT MAY BE CHARGED FOR ROOMS DURING OLYMPIC PERIOD. NEGOTIATIONS ARE CURRENTLY BEING CARRIED OUT AND AS SOON AS ROOM RATES HAVE BEEN ESTABLISHED, QUEEN ELIZABETH HOTEL MANAGER WILL ADVISE US OF THEIR REQUIREMENTS TO RESERVE THE SIX DOUBLE ROOMS.

Immediately beneath that one, I found this telegram from October 1975 (Document Number 1975STATE258427):

SUBJECT:INVITATION TO PRESIDENT FORD AND SECRETARY
KISSINGER TO ATTEND OLYMPIC GAMES IN AUSTRIA,
FEBRUARY 4-15, 1976

THE EMBASSY IS REQUESTED TO INFORM THE GOA THAT MUCH TO THE PRESIDENT’S AND THE SECRETARY’S REGRET, THE DEMANDS ON THEIR SCHEDULES DURING THAT PERIOD WILL NOT MAKE IT POSSIBLE FOR THEM TO ATTEND THE WINTER GAMES. KISSINGER

There are definitely a lot of moving parts to Olympic Archival Records. So many nations participate.  New host countries with the option to handle records however they see fit. I explored this whole question two years ago and came up against the fact that control over the archival records produced by each Olympics was really in the hands of the hosting committee and their country. A quick glance down the list of Archives & Olympic Heritage sites I mentioned above gives you an idea of all the different corners of the world in which one can find Olympic Archival Records in both government and independent repositories. Given that clearly not all Olympic Games are represented in that list, it makes me wonder what we will see on this front from China now that the closing ceremony is complete.

I also suspect that with each Olympic Games we increase the complexity of the electronic records being generated. Would it be worthwhile to create an online collection for each games – as has been done for the Hurricane Digital Memory Bank or The September 11 Digital Archive, but extend it to include access to Olympic electronic records data sets? The shear quantity of information is likely overwhelming – but I suspect there is a lot of interesting information that people would love to examine.

Update: For those of you (like me) who wondered what Montgomery Ward had to do with Olympic Tickets – take a look at Tickets For The ’76 Olympics Go On Sale Shortly At Montgomery Ward over in the Sports Illustrated online SI Vault. Sports Illustrated’s Vault is definitely another interesting source of information about the Olympic Games. If my post above has made you nostalgic for Olympics gone by – definitely take a look at the current Summer Games feature on their front page. I couldn’t figure out a permanent link to this feature, but if I ever do I will update this post later.

New Skills for a Digital Era: Official Proceedings Now Available

New Skills for a Digital Era LogoFrom May 31st through June 2nd of 2006, The National Archives, the Arizona State Library and Archives, and the Society of American Archivists hosted a colloquium to consider the question “What are the practical, technical skills that all library and records professionals must have to work with e-books, electronic records, and other digital materials?”. The website for the New Skills for a Digital Era colloquium already includes links to the eleven case studies considered over the course of the three days of discussion as well as a list of additional suggested readings. As mentioned over on The Ten Thousand Year Blog, the pre-print of the proceedings has been available since August, 2007.

As announced in SAA’s online newsletter, the Official Proceedings of the New Skills for a Digital Era Colloquium, edited by Richard Pearce-Moses and Susan E. Davis, is now available for free download. Published under Creative Commons Attribution, this document is 143 pages long and includes all the original case studies. I have a lot of reading to do!

The meat of the proceedings consists of a 32 page ‘Knowledge and Skills Inventory’ and a page and a half of reflections – both co-authored by Richard Pearce-Moses and Susan E. Davis. The Keynote Address by Margaret Hedstrom titled ‘Are We Ready for New Skills Yet?’ is also included.

I am very pleased with how much access has been provided to these materials. These topics are clearly of interest to many beyond the 60 individuals who were able to take part in the original gathering. As an archival studies student it has often been a great source of frustration that so few of the archives related conferences publish proceedings of any kind. It is part of what has driven me to attempt to assemble exhaustive session summaries for those sessions I have personally attended at the past two SAA Annual meetings (see SAA2006 and SAA2007). I think that the Unofficial Conference Wiki for SAA2007 was also a big step in the right direction and I hope it will continue to evolve and improve for the upcoming SAA2008 annual meeting in San Francisco.

The course I elected to take this term is dedicated to studying Communities of Practice. This announcement about the New Skills for a Digital Era’s proceedings has me thinking about the community of practice that seems to currently be taking form across the library, archives and records management communities. I will share more thoughts on this as I sort through them myself.

Finally, a question for anyone reading this post who attended the colloquium: Are you still discussing the case studies with others from that session two years ago? If not, do you wish you were?

Image Credit: The image at the top of this post is from the New Skills for a Digital Era website.

Blog Action Day: A Look At Earth Day as Archived Online

In honor of this year’s Blog Action Day theme of discussing the environment, I decided to see what records the Internet had available about the history of Earth Day.

I started by simply Googling Earth Day. In a new browser window I opened the Internet Archive’s Wayback Machine. These were to be my two main avenues for unearthing the way that Earth Day was represented on the internet over the years.

Wikipedia’s first version of an Earth Day page was created on December 16th, 2002. This is the current Earth Day page as of the creation of this post – last updated about a week ago.

The current home page for the Earthday Network appears identical to the most recent version stored in the Wayback Machine, dated June 29, 2007 – until you notice that the featured headline on the link to http://www.earthdaynetwork.tv is different.

The site that claims to be ‘The Official Site of International Earth Day’ is EarthSite.org. The oldest version from the Wayback Machine is from December of 1996. This version shows a web visitor counter perpetually set to 1,671. Earth day ten years ago was scheduled for March 20th, 1997. If you scroll down a bit on the What’s New page you can read the 1997 State of the World Message By John McConnell (attributed as the founder of Earth Day).

The U.S. Government portal for Earth Day was first archived in the Internet Archive on April 6, 2003. The site, EarthDay.gov, hasn’t changed much in the past 4 years. The EPA has an Earth Day page of it’s own, that was first archived in early 1999. No clear way to know if that actually means that the EPA’s Earth Day page is older or if it was just found earlier by the Internet Archives ambitious web crawlers.

Envirolink.org, with the tagline “The Online Environmental Community”, was first archived back in 1996. You can see on the Wayback Machine page for Environlink.org, has a fairly full ten years worth of web page archiving.

Next I wanted to explore what the world of government records might produce on the subject. A quick stop over at Footnote.com to search for “Earth Day” didn’t yield a terribly promising list of results (no surprise there – most of their records date to before the time period we are looking for). Next I tried searching in Archival Research Catalog (ARC) over on the U.S. National Archives website. I got 15 hits – all fairly interesting looking… but none of them linked to digitized content. A search in Access to Archival Databases (AAD) system found 2 hits – one to some sort of contract between the EPA and a Fairfax Virginia company named EARTH DAY XXV from 1995 and the other a State Department telegram including this passage:

THIS NATION IS COMMITTED TO STRIVING FOR AN ENVIRONMENT THAT NOT ONLY SUSTAINS LIFE, BUT ALSO ENRICHES THE LIVES OF PEOPLE EVERYWHERE – – HARMONIZING THE WORKS OF MAN AND NATURE. THIS COMMITMENT HAS RECENTLY BEEN REINFORCED BY MY PROCLAMATION, PURSUANT TO A JOINT RESOLUTION OF THE CONGRESS, DESIGNATING MARCH 21, 1975 AS EARTH DAY, AND ASKING THAT SPECIAL ATTENTION BE GIVEN TO EDUCATIONAL EFFORTS DIRECTED TOWARD PROTECTING AND ENHANCING OUR LIFE-GIVING ENVIRONMENT.

I also thought to check the Government Printing Office’s (GPO) website for the Public Papers of the Presidents of the United States. Currently it only permits searching back through 1991 online – but my search for “Earth Day” did bring back 50 speeches, proclamations and other writings by the various presidents.

Frustrated by the total scattering of documents without any big picture, I headed back to Google – this time to search the Google News Archive for articles including “Earth Day” published before 1990. The timeline display showed me articles mostly from TIME, the Washington Post and the New York Times – some of which claimed I would need to pay in order to read.

Back again to do one more regular Google search – this time for earth day archive. This yielded an assortment of hits – and just above the fold I found my favorite snapshot of Earth Day history. The TIME Earth Day Archive Collection is a selection of the best covers, quotes and articles about Earth Day – from February 2, 1970 to the present. This is the gold mine for getting perspective on Earth Day as it has been perceived and celebrated in the United States. The covers are brilliant! If I had started this post early enough, I would have requested permission to include some here.

With the passionate title Fighting to Save the Earth from Man, the first article in the TIME Earth Day Collection begins by quoting then President Nixon’s first State of the Union Address:

The great question of the seventies is, shall we surrender to our surroundings, or shall we make our peace with nature and begin to make reparations for the damage we have done to our air, to our land, and to our water?

Fast forward to the recent awarding of the Nobel Peace Prize for 2007 to the Intergovernmental Panel on Climate Change (IPCC) and Al Gore and I have to image that the answer to that question of if we were ready to make peace with nature asked so long ago was ‘Not Yet’.

Overall, this was an interesting experiment. The hunt for ‘old’ (such as it is in the fast moving world of the Internet) data about a topic online is a strange and frustrating experience. Even with the Wayback Machine, I often found myself with only part of the picture. Often the pages I tried to view were missing images or other key elements. Sometimes I found a link to something tantalizing, only to realize that the target page was not archived (or is so broken as to be of no use). The search through government records and old newspaper stories did produce some interesting results – but again seemed to fail to produce any sense of the big picture of Earth Day over the years.

The TIME Collection about Earth Day was assembled by humans and arranged nicely for examination by those interested in the subject. It is properly named a ‘collection’ (in the archival sense) because it is not the pure output of activities surrounding Earth Day, but rather a selected snapshot of related articles and images that share a common topic. That said, it is my fervent hope that websites such as these appear more and more. I suspect that the lure of attracting more readers to their websites with existing content will only encourage more content creators with a long history to join in the fun. If other do it as well as TIME has seemed to in this case, it will be a win/win situation for everyone.

Thoughts on Digital Preservation, Validation and Community

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.


Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By Rate­Limited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

Digital Archiving Articles – netConnect Spring 2007

Thanks to Jessamyn West’s blog post, I found my way to a series of articles in the Spring 2007 edition of netConnect:

“Saving Digital History” is the longest of the three and is a nice survey of many of the issues found at the interseciton of archiving, born digital records and the wild world of the web. I especially love the extensive Link List at the end of the articles — there are lots of interesting related resources. This is the sort of list of links I wish were available with ALL articles online!

I can see the evolution of some of the ideas she and her co-speakers touched on in their session at SAA 2006: Everyone’s Doing It: What Blogs Mean for Archivists in the 21st Century. I hope we continue to see more of these sorts of panels and articles. There is a lot to think about related to these issues – and there are no easy answers to the many hard questions.

Update: Here is a link to Jessamyn’s presentation from the SAA session mentioned above: Capturing Collaborative Information News, Blogs, Librarians, and You.

Google, Privacy, Records Managment and Archives

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

The Archives and Archivists Listserv: hoping for a stay of execution

There has been a lot of discussion (both on the Archives & Archivists (A&A) Listserv and in blog posts) about the SAA‘s recent decision to not preserve the A&A listserv posts from 1996 through 2006 when they are removed from the listserv’s old hosting location at Miami University of Ohio.

Most of the outcry against this decision has fallen into two camps:

  • Those who don’t understand how the SAA task force assigned to appraise the listserv archives could decide it does not have informational value – lots of discussion about how the listserv reflects the move of archivists into the digital age as well as it’s usefulness for students
  • Those who just wish it wouldn’t go away because they still use it to find old posts. Some mentioned that there are scholarly papers that reference posts in the listserv archives as their primary sources.

I added this suggestion on the listserv:

I would have thought that the Archives Listserv would be the ideal test case for developing a set of best practices for archiving an organization’s web based listserv or bboard.

Perhaps a graduate student looking for something to work on as an independent project could take this on? Even if they only got permission for working with posts from 2001 onward [post 2001 those who posted had to agree to ‘terms of participation’ that reduce issues with copyright and ownership] – I suspect it would still be worthwhile.

I have always found that you can’t understand all the issues related to a technical project (like the preservation of a listserv) until you have a real life case to work on. Even if SAA doesn’t think we need to keep the data forever – here is the perfect set of data for archivists to experiment with. Any final set of best practices would be meant for archivists to use in the future – and would be all the easier to comprehend if they dealt with a listserv that many of them are already familiar with.

Another question: couldn’t the listserv posts still be considered ‘active records’? Many current listserv posters claim they still access the old list’s archives on a regular basis. I would be curious what the traffic for the site is. That is one nice side effect of this being on a website – it makes the usage of records quantifiable.

There are similar issues in the analog world when records people still want to use loose their physical home and are disposed of but, as others have also pointed out, digital media is getting cheaper and smaller by the day. We are not talking about paying rent on a huge wharehouse or a space that needs serious temperature and humidity control.

I was glad to see Rick Prelinger’s response on the current listerv that simply reads:

The Internet Archive is looking into this issue.

I had already checked when I posted my response to the listerv yesterday – having found my way to the A&A old listserv page in the Wayback Machine. For now all that is there is the list of links to each week’s worth of postings – nothing beyond that has been pulled in.

I have my fingers crossed that enough of the right people have become aware of the situation to pull the listserv back from the brink of the digital abyss.