Menu Close

Category: electronic records

After The Games Are Over: Olympic Archival Records

What does an archivist ponder after she turns off the Olympics? What happens to all the records of the Olympics after the closing ceremonies? Who decides what to keep? Not knowing any Olympic Archivists personally, I took to the web to see what I could find.

Olympics.org uses the tag line “Official Website of the Olympic Movement” and include information about The International Olympic Committee’s Historical Archives. The even have an Olympic Medals Database with all the results from all the games.

The most detailed list of Olympics archives that I could find is the Olympic Studies International Directory listing of Archives & Olympic Heritage sites. It is from this page that I found my way to records from the Sydney Olympic Park Authority.

The Olympic Television Archive Bureau (OTAB) website explains that this UK based company “has over 30,000 hours of the most sensational sports footage ever seen, uniquely available in one library”  and aims to provide “prompt fulfilment of your Olympic footage requirements”.

Then I thought to dig into the Internet Archive. What a great treasure trove for all sorts of interesting Olympic bits!

First I found a Universal Newsreel from the 1964 Olympics in Tokyo (embedded below).

I also found a 2002 Computer Chronicles episode Computer Technology and the Olympics which explores the “high-tech innovations that ran the 2002 Winter Olympic Games” (embedded below).

Other fun finds included a digitized copy of a book titled The Olympic games, Stockholm, 1912 and the oldest snapshot of the Beijing 2008 website (from December of 2006). Seeing the 2008 Summer Games pages in the archive made me curious. I found the old site of the official Athens summer games from 2004 which kindly states: “The site is no longer available, please visit http://www.olympic.org or http://en.beijing2008.com/”. The Internet Archive has a bit more than that on the athens2004.com archive page – though some clicking through definitely made it clear that not all of the site was crawled. Lucky for us we can still see the Athens 2004 Olympics E-Cards you could send!

Then I turned to explore NARA‘s assorted web resources. I found a few photos on the Digital Vaults website (search on the keyword Olympics).  A search in the Archival Research Catalog (ARC) generates a long list – including footage of the US National Rifle Team in the 1960 Olympics in Italy.

My favorite items from NARA’s collections are in the Access to Archival Databases (AAD). First I found this telegram from the American Embassy in Ottawa to the Secretary of State in Washington DC (Document ID # 1975OTTAWA02204) sent in June 1975:

 1. EMBASSY APPRECIATES DEPARTMENT’S EFFORTS TO ASSIST CONGEN IN CARING FOR VIPS WHO CERTAINLY WILL ARRIVE FOR 1976 OLYMPIC GAMES WITHOUT TICKETS OR LODGING. HAS DEPARTMENT EXPLORED POSSIBILITY OF OBTAINING 4,000 TICKETS ON CONSIGNMENT BASIS FROM MONTGOMERY WARD, WITH UNDERSTANDING THAT, AS TICKETS ARE SOLD, PROCEEDS WILL BE REMITTED? PERHAPS SUCH AN ARRANGEMENT COULD BE WORKED OUT WITH FURTHER UNDERSTANDING THAT UNSOLD TICKETS BE RETURNED TO MONTGOMERY WARD AT SOME SPECIFIED DATE PRIOR TO BEGINNING OF EVENTS.

2. EMBASSY WILL FURNISH AMOUNT REQUIRED TO RESERVE SIX DOUBLE ROOMS FOR PERIOD OF GAMES. AT PRESENT HOTEL OWNERS AND OLYMPIC OFFICIALS ARE IN DISAGREEMENT AS TO AMOUNTS THAT MAY BE CHARGED FOR ROOMS DURING OLYMPIC PERIOD. NEGOTIATIONS ARE CURRENTLY BEING CARRIED OUT AND AS SOON AS ROOM RATES HAVE BEEN ESTABLISHED, QUEEN ELIZABETH HOTEL MANAGER WILL ADVISE US OF THEIR REQUIREMENTS TO RESERVE THE SIX DOUBLE ROOMS.

Immediately beneath that one, I found this telegram from October 1975 (Document Number 1975STATE258427):

SUBJECT:INVITATION TO PRESIDENT FORD AND SECRETARY
KISSINGER TO ATTEND OLYMPIC GAMES IN AUSTRIA,
FEBRUARY 4-15, 1976

THE EMBASSY IS REQUESTED TO INFORM THE GOA THAT MUCH TO THE PRESIDENT’S AND THE SECRETARY’S REGRET, THE DEMANDS ON THEIR SCHEDULES DURING THAT PERIOD WILL NOT MAKE IT POSSIBLE FOR THEM TO ATTEND THE WINTER GAMES. KISSINGER

There are definitely a lot of moving parts to Olympic Archival Records. So many nations participate.  New host countries with the option to handle records however they see fit. I explored this whole question two years ago and came up against the fact that control over the archival records produced by each Olympics was really in the hands of the hosting committee and their country. A quick glance down the list of Archives & Olympic Heritage sites I mentioned above gives you an idea of all the different corners of the world in which one can find Olympic Archival Records in both government and independent repositories. Given that clearly not all Olympic Games are represented in that list, it makes me wonder what we will see on this front from China now that the closing ceremony is complete.

I also suspect that with each Olympic Games we increase the complexity of the electronic records being generated. Would it be worthwhile to create an online collection for each games – as has been done for the Hurricane Digital Memory Bank or The September 11 Digital Archive, but extend it to include access to Olympic electronic records data sets? The shear quantity of information is likely overwhelming – but I suspect there is a lot of interesting information that people would love to examine.

Update: For those of you (like me) who wondered what Montgomery Ward had to do with Olympic Tickets – take a look at Tickets For The ’76 Olympics Go On Sale Shortly At Montgomery Ward over in the Sports Illustrated online SI Vault. Sports Illustrated’s Vault is definitely another interesting source of information about the Olympic Games. If my post above has made you nostalgic for Olympics gone by – definitely take a look at the current Summer Games feature on their front page. I couldn’t figure out a permanent link to this feature, but if I ever do I will update this post later.

New Skills for a Digital Era: Official Proceedings Now Available

New Skills for a Digital Era LogoFrom May 31st through June 2nd of 2006, The National Archives, the Arizona State Library and Archives, and the Society of American Archivists hosted a colloquium to consider the question “What are the practical, technical skills that all library and records professionals must have to work with e-books, electronic records, and other digital materials?”. The website for the New Skills for a Digital Era colloquium already includes links to the eleven case studies considered over the course of the three days of discussion as well as a list of additional suggested readings. As mentioned over on The Ten Thousand Year Blog, the pre-print of the proceedings has been available since August, 2007.

As announced in SAA’s online newsletter, the Official Proceedings of the New Skills for a Digital Era Colloquium, edited by Richard Pearce-Moses and Susan E. Davis, is now available for free download. Published under Creative Commons Attribution, this document is 143 pages long and includes all the original case studies. I have a lot of reading to do!

The meat of the proceedings consists of a 32 page ‘Knowledge and Skills Inventory’ and a page and a half of reflections – both co-authored by Richard Pearce-Moses and Susan E. Davis. The Keynote Address by Margaret Hedstrom titled ‘Are We Ready for New Skills Yet?’ is also included.

I am very pleased with how much access has been provided to these materials. These topics are clearly of interest to many beyond the 60 individuals who were able to take part in the original gathering. As an archival studies student it has often been a great source of frustration that so few of the archives related conferences publish proceedings of any kind. It is part of what has driven me to attempt to assemble exhaustive session summaries for those sessions I have personally attended at the past two SAA Annual meetings (see SAA2006 and SAA2007). I think that the Unofficial Conference Wiki for SAA2007 was also a big step in the right direction and I hope it will continue to evolve and improve for the upcoming SAA2008 annual meeting in San Francisco.

The course I elected to take this term is dedicated to studying Communities of Practice. This announcement about the New Skills for a Digital Era’s proceedings has me thinking about the community of practice that seems to currently be taking form across the library, archives and records management communities. I will share more thoughts on this as I sort through them myself.

Finally, a question for anyone reading this post who attended the colloquium: Are you still discussing the case studies with others from that session two years ago? If not, do you wish you were?

Image Credit: The image at the top of this post is from the New Skills for a Digital Era website.

Digital Preservation via Emulation – Dioscuri and the Prevention of Digital Black Holes

dioscuri.JPGAvailable Online posted about the open source emulator project Dioscuri back in late September. In the course of researching Thoughts on Digital Preservation, Validation and Community I learned a bit about the Microsoft Virtual PC software. Virtual PC permits users to run multiple operating systems on the same physical computer and can therefore facilitate access to old software that won’t run on your current operating system. That emulator approach pales in comparison with what the folks over at Dioscuri are planning and building.

On the Digital Preservation page of the Dioscuri website I found this paragraph on their goals:

To prevent a digital black hole, the Koninklijke Bibliotheek (KB), National Library of the Netherlands, and the Nationaal Archief of the Netherlands started a joint project to research and develop a solution. Both institutions have a large amount of traditional documents and are very familiar with preservation over the long term. However, the amount of digital material (publications, archival records, etc.) is increasing with a rapid pace. To manage them is already a challenge. But as cultural heritage organisations, more has to be done to keep those documents safe for hundreds of years at least.

They are nothing if not ambitious… they go on to state:

Although many people recognise the importance of having a digital preservation strategy based on emulation, it has never been taken into practice. Of course, many emulators already exist and showed the usefulness and advantages it offer. But none of them have been designed to be digital preservation proof. For this reason the National Library and Nationaal Archief of the Netherlands started a joint project on emulation.

The aim of the emulation project is to develop a new preservation strategy based on emulation.

Dioscuri is part of Planets (Preservation and Long-term Access via NETworked Services) – run by the Planets consortium and coordinated by the British Library. The Dioscuri team has created an open source emulator that can be ported to any hardware that can run a Java Virtual Machine (JVM). Individual hardware components are implemented via separate modules. These modules should make it possible to mimic many different hardware configurations without creating separate programs for every possible combination.

You can get a taste of the big thinking that is going into this work by reviewing the program overview and slide presentations from the first Emulation Expert Meeting (EEM) on digital preservation that took place on October 20th, 2006.

In the presentation given by Geoffrey Brown from Indiana University titled Virtualizing the CIC Floppy Disk Project: An Experiment in Preservation Using Emulation I found the following simple answer to the question ‘Why not just migrate?’:

  • Loss of information — e.g. word edits

  • Loss of fidelity — e.g. WordPerfect to Word isn’t very good

  • Loss of authenticity — users of migrated document need access to original to verify authenticity

  • Not always possible — closed proprietary formats

  • Not always feasible — costs may be too high

  • Emulation may necessary to enable migration

After reading through Emulation at the German National Library, presented by Tobias Steinke, I found my way to the kopal website. With their great tagline ‘Data into the future’, they state their goal is “…to develop a technological and organizational solution to ensure the long-term availability of electronic publications.” The real gem for me on that site is what they call the kopal demonstrator. This is a well thought out Flash application that explains the kopal project’s ‘procedures for archiving and accessing materials’ within the OAIS Reference Model framework. But it is more than that – if you are looking for a great way to get your (or someone else’s) head around digital archiving, software and related processes – definitely take a look. They even include a full Glossary.

I liked what I saw in Defining a preservation policy for a multimedia and software heritage collection, a pragmatic attempt from the Bibliothèque nationale de France, a presentation by Grégory Miura, but felt like I was missing some of the guts by just looking at the slides. I was pleased to discover what appears to be a related paper on the same topic presented at IFLA 2006 in Seoul titled: Pushing the boundaries of traditional heritage policy: Maintaining long-term access to multimedia content by introducing emulation and contextualization instead of accepting inevitable loss . Hurrah for NOT ‘accepting inevitable loss’.

Vincent Joguin’s presentation, Emulating emulators for long-term digital objects preservation: the need for a universal machine, discussed a virtual machine project named Olonys. If I understood the slides correctly, the idea behind Olonys is to create a “portable and efficient virtual processor”. This would provide an environment in which to run programs such as emulators, but isolate the programs running within it from the disparities between the original hardware and the actual current hardware. Another benefit to this approach is that only the virtual processor need be ported to new platforms rather than each individual program or emulator.

Hilde van Wijngaarden presented an Introduction to Planets at EEM. I also found another introductory level presentation that was given by Jeffrey van der Hoeven at wePreserve in September of 2007 titled Dioscuri: emulation for digital preservation.

The wePreserve site is a gold mine for presentations on these topics. They bill themselves as “the window on the synergistic activities of DigitalPreservationEurope (DPE), Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval (CASPAR), and Preservation and Long-term Access through NETworked Services (PLANETS).” If you have time and curiosity on the subject of digital preservation, take a glance down their home page and click through to view some of the presentations.

On the site of The International Journal of Digital Curation there is a nice ten page paper that explains the most recent results of the Dioscuri project. Emulation for Digital Preservation in Practice: The Results was published in December 2007. I like being able to see slides from presentations (as linked to above), but without the notes or audio to go with them I am often left staring at really nice diagrams wondering what the author’s main point was. The paper is thorough and provides lots of great links to other reading, background and related projects.

There is a lot to dig into here. It is enough to make me wish I had a month (maybe a year?) to spend just following up on this topic alone. I found my struggle to interpret many of the Power Point slide decks that have no notes or audio very ironic. Here I was hunting for information about the preservation of born digital records and I kept finding that the records of the research provided didn’t give me the full picture. With no context beyond the text and images on the slides themselves, I was left to my own interpretation of their intended message. While I know that these presentations are not meant to be the official records of this research, I think that the effort obviously put into collecting and posting them makes it clear that others are as anxious as I to see this information.

The best digital preservation model in the world will only preserve what we choose to save. I know the famous claim on the web is that ‘content is king’ – but I would hazard to suggest that in the cultural heritage community ‘context is king’.

What does this have to do with Dioscuri and emulators? Just that as we solve the technical problems related to preservation and access, I believe that we will circle back around to realize that digital records need the same careful attention to appraisal, selection and preservation of context as ‘traditional’ records. I would like to believe that the huge hurdles we now face on the technical and process side of things will fade over time due to the immense efforts of dedicated and brilliant individuals. The next big hurdle is the same old hurdle – making sure the records we fight to preserve have enough context that they will mean anything to those in the future. We could end up with just as severe a ‘digital black hole’ due to poorly selected or poorly documented records as we could due to records that are trapped in a format we can no longer access. We need both sides of the coin to succeed in digital preservation.

Did I mention the part about ‘Hurray for open source emulator projects with ambitious goals for digital preservation’? Right. I just wanted to be clear about that.

Image Credit: The image included at the top of this post was taken from a screen shot of Dioscuri itself, the original version of which may be seen here.

Will Crashed Hard Drives Ever Equal Unlabeled Cardboard Boxes?

Photo of Crashed Hard Drive - wonderferret on FlickrHow many of us have an old hard drive hanging around? I am talking about the one you were told was unfixable. The one that has 3 bad sectors. The one they replaced and handed to you in one of those distinctive anti-static bags. You know the ones I mean – the steely grey translucent plastic ones that look like they should contain space food.

I have more than one ‘dead’ hard drive. I can’t quite bring myself to throw them out – but I have no immediate plans to try and reclaim their files.

I know that there are services and techniques for pulling data off otherwise inaccessible hard drives. You hear about it in court cases and see it on TV shows. A quick Google search on hard drive rescue turns up businesses like Disk Data Recovery

Do archivists already make it a policy to hunt not just for computers, but for discarded and broken hard drives lurking in filing cabinets and desk drawers? Compare this to a carton of documents that needed special treatment to permit access to the records they contained and yet are appraised as valuable. If the treatment required were within budgetary and time constraints – it would be performed. Mold, bugs, rusty staples, photos that are stuck together… archivists generally know where to get the answers they need to tackle these sorts of problems. I suspect that a hard drive advertised or discovered to be broken would be treated more like an empty box than a moldy box.

For now I would stack this challenge near the bottom of the list below archiving digital records that we can access easily but that run on old hardware or software, but I can imagine a time when standard hard drive rescue techniques will need to be a tool for the average archivist.

Thoughts on Digital Preservation, Validation and Community

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.


Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By Rate­Limited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

Book Review: Dreaming in Code (a book about why software is hard)

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software
(or “A book about why software is hard”) by Scott Rosenberg

Before I dive into my review of this book – I have to come clean. I must admit that I have lived and breathed the world of software development for years. I have, in fact, dreamt in code. That is NOT to say that I was programming in my dream, rather that the logic of the dream itself was rooted in the logic of the programming language I was learning at the time (they didn’t call it Oracle Bootcamp for nothing).

With that out of the way I can say that I loved this book. This book was so good that I somehow managed to read it cover to cover while taking two graduate school courses and working full time. Looking back, I am not sure when I managed to fit in all 416 pages of it (ok, there are some appendices and such at the end that I merely skimmed).

Rosenberg reports on the creation of an open source software tool named Chandler. He got permission to report on the project much as an embedded journalist does for a military unit. He went to meetings. He interviewed team members. He documented the ups and downs and real-world challenges of building a complex software tool based on a vision.

If you have even a shred of interest in the software systems that are generating records that archivists will need to preserve in the future – read this book. It is well written – and it might just scare you. If there is that much chaos in the creation of these software systems (and such frequent failure in the process), what does that mean for the archivist charged with the preservation of the data locked up inside these systems?

I have written about some of this before (see Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges), but it stands repeating: If you think preserving records originating from standardized packages of off-the-shelf software is hard, then please consider that really understanding the meaning of all the data (and business rules surrounding its creation) in custom built software systems is harder still by a factor of 10 (or a 100).

It is interesting for me to feel so pessimistic about finding (or rebuilding) appropriate contextual information for electronic records. I am usually such an optimist. I suspect it is a case of knowing too much for my own good. I also think that so many attempts at preservation of archival electronic records are in their earliest stages – perhaps in that phase in which you think you have all the pieces of the puzzle. I am sure there are others who have gotten further down the path only to discover that their map to the data does not bear any resemblance to the actual records they find themselves in charge of describing and arranging. I know that in some cases everything is fine. The records being accessioned are well documented and thoroughly understood.

My fear is that in many cases we won’t know that we don’t have all the pieces we need to decipher the data until many years down the road leads me to an even darker place. While I may sound alarmist, I don’t think I am overstating the situation. This comes from my first hand experience in working with large custom built databases. Often (back in my life as a software consultant) I would be assigned to fix or add on to a program I had not written myself. This often feels like trying to crawl into someone else’s brain.

Imagine being told you must finish a 20 page paper tonight – but you don’t get to start from scratch and you have no access to the original author. You are provided a theoretically almost complete 18 page paper and piles of books with scraps of paper stuck in them. The citations are only partly done. The original assignment leaves room for original ideas – so you must discern the topic chosen by the original author by reading the paper itself. You decide that writing from scratch is foolish – but are then faced with figuring out what the person who originally was writing this was trying to say. You find 1/2 finished sentences here and there. It seems clear they meant to add entire paragraphs in some sections. The final thorn in your side is being forced to write in a voice that matches that of the original author – one that is likely odd sounding and awkward for you. About halfway through the evening you start wishing you had started from scratch – but now it is too late to start over, you just have to get it done.

So back to the archivist tasked with ensuring that future generations can make use of the electronic records in their care. The challenges are great. This sort of thing is hard even when you have the people who wrote the code sitting next to you available to answer questions and a working program with which to experiment. It just makes my head hurt to imagine piecing together the meaning of data in custom built databases long after the working software and programmers are well beyond reach.

Does this sound interesting or scary or relevant to your world? Dreaming in Code is really a great read. The people are interesting. The issues are interesting. The author does a good job of explaining the inner workings of the software world by following one real world example and grounding it in the landscape of the history of software creation. And he manages to include great analogies to explain things to those looking in curiously from outside of the software world. I hope you enjoy it as much as I did.

Digital Archiving Articles – netConnect Spring 2007

Thanks to Jessamyn West’s blog post, I found my way to a series of articles in the Spring 2007 edition of netConnect:

“Saving Digital History” is the longest of the three and is a nice survey of many of the issues found at the interseciton of archiving, born digital records and the wild world of the web. I especially love the extensive Link List at the end of the articles — there are lots of interesting related resources. This is the sort of list of links I wish were available with ALL articles online!

I can see the evolution of some of the ideas she and her co-speakers touched on in their session at SAA 2006: Everyone’s Doing It: What Blogs Mean for Archivists in the 21st Century. I hope we continue to see more of these sorts of panels and articles. There is a lot to think about related to these issues – and there are no easy answers to the many hard questions.

Update: Here is a link to Jessamyn’s presentation from the SAA session mentioned above: Capturing Collaborative Information News, Blogs, Librarians, and You.

Copyright Law: Archives, Digital Materials and Section 108

I just found my way today to Copysense (obviously I don’t have enough feeds to read as it is!). Their current clippings post highlighted part of the following quote as their Quote of the Week.

Marybeth Peters (from http://www.copyright.gov/about.html)“[L]egislative changes to the copyright law are needed. First, we need to amend the law to give the Library of Congress additional flexibility to acquire the digital version of a work that best meets the Library’s future needs, even if that edition has not been made available to the public. Second, section 108 of the law, which provides limited exceptions for libraries and archives, does not adequately address many of the issues unique to digital media—not from the perspective of copyright owners; not from the perspective of libraries and archives.” Marybeth Peters , Register of Copyrights, March 20, 2007

Marybeth Peters was speaking to the Subcommittee on Legislative Branch of the Committee on Appropriations about the Future of Digital Libraries.

Copysense makes some great points about the quote:

Two things strike us as interesting about Ms. Peters’ quote. First, she makes the quote while The Section 108 Study Group continues to work through some very thorny issues related to the statutes application in the digital age […] Second, while Peters’ quote articulates what most information professionals involved in copyright think is obvious, her comments suggest that only recently is she acknowledging the effect of copyright law on this nation’s de facto national library. […] [S]omehow it seems that Ms. Peters is just now beginning to realize that as the Library of Congress gets involved in the digitization and digital work so many other libraries already are involved in, that august institution also may be hamstrung by copyright.

I did my best to read through Section 108 of the Copyright Law – subtitled “Limitations on exclusive rights: Reproduction by libraries and archives”. I found it hard to get my head around … definitely stiff going. There are 9 different subsections (‘a’ through ‘i’) each with there own numbered exceptions or requirements. Anxious to get a grasp on what this all really means – I found LLRX.com and their Library Digitization Projects and Copyright page. This definitely was an easier read and helped me get further in my understanding of the current rules.

Next I explored the website for the Section 108 Study Group that is hard at work figuring out what a good new version of Section 108 would look like. I particularly like the overview on the About page. They have a 32 page document titled Overview of the Libraries and Archives Exception in the Copyright Act: Background, History, and Meaning for those of you who want the full 9 years on what has gotten us to where we are today with Section 108.

For a taste of current opinions – go to the Public Comments page which provides links to all the written responses submitted to the Notice of public roundtable with request for comments. There are clear representatives from many sides of the issue. I spotted responses from SAA, ALA and ARL as well as from MPAA, AAP and RIAA. All told there are 35 responses (and no, I didn’t read them all). I was more interested in all the different groups and individuals that took the time to write and send comments (and a lot of time at that – considering the complicated nature of the original request for comments and the length of the comments themselves). I was also intrigued to see the wide array of job titles of the authors. These are leaders and policy makers (and their lawyers) making sure their organizations’ opinions are included in this discussion.

Next stop – the Public Roundtables page with it’s links to transcripts from the roundtables – including the most recent one held January 31, 2007. Thanks to the magic of Victoria’s Transcription Services, the full transcripts of the roundtables are online. No, I haven’t read all of these either. I did skim through a bit of it to get a taste of the discussions – and there is some great stuff here. Lots of people who really care about the issues carefully and respectfully exploring the nitty-gritty details to try and reach good compromises. This is definitely on my ‘bookmark to read later’ list.

Karen Coyle has a nice post over on Coyle’s InFormation that includes all sorts of excerpts from the transcripts. It gives you a good flavor of what some of these conversations are like – so many people in the same room with such different frames of reference.

This is not easy stuff. There is no simple answer. It will be interesting to see what shape the next version of Section 108 takes with so many people with very different priorities pulling in so many directions.

Section 108 Study GroupThe good news is that there are people with the patience and dedication to carefully gather feedback, hold roundtables and create recommendations. Hurrah for the hard working members of the Section 108 Study Group – all 19 of them!