Menu Close

Category: preservation

Phoenix DVD destined for Mars

Hubble's Sharpest View Of Mars

When the Phoenix Mars Mission launches (possible as early as this Friday August 3rd, 2007), it will have something unusual on board. The Planetary Society has created what they call the Phoenix DVD.

In late May of 2007 they proudly announced that their special DVD was ready for launch:

… the silica glass mini-DVD with a quarter million names on it (including all Planetary Society members) has been installed on the Phoenix spacecraft and is ready to go to Mars!

In addition to the names, the disc also contains Visions of Mars, a collection of literature and art about the Red Planet. The names and Visions of Mars were written to the silica mini-DVD by the company Plasmon OMS using a special technique. The resulting archival disk should last at least hundreds of years on the Martian surface, ready to be picked up by future explorers.

After the disc was written, a special label was applied to the disc to identify it for future explorers.

The page about Visions of Mars describes it as follows:

Visions of Mars is a message from our world to future human inhabitants of Mars. It will launch on its way to the Red Planet in the summer of 2007 aboard the spacecraft Phoenix. Along with personal messages from leading space visionaries of our time, Visions of Mars includes a priceless collection of Mars literature and art, and a list of hundreds of thousands of names of space enthusiasts from around the world. The entire collection will be encoded on a mini-DVD provided by The Planetary Society, which will be affixed to the spacecraft.

All this has been inscribed on a silica mini-DVD – and has the phrase “Attention Astronauts: Take This With You” in bright red letters on the front. I hate to be cynical (and those of you who read this blog know that it is not my nature to be so) but where will those ‘future human inhabitants of Mars’ find a DVD player to watch this DVD? I know I am not the first to doubt their plan – but I couldn’t resist. Given my suspicion of the whole affair I thought I would at least look into the company that created this very special disk.

Plasmon has an extensive website with all sorts of interesting tidbits. They explain their trademarked Ultra Density Optical (UDO) technology. They feature two PDFs – one called Archiving Defined and another labeled Plasmon Archive Solution. It looks very interesting. My VERY oversimplified summary is that they have combined a RAID approach with a very durable and secure WORM (Write Once, Read Many) flavor of DVD and packaged it into a solution for companies who need to ensure their data remains safe.

I have been meaning to learn more about the latest and greatest in hardware and material solutions aimed at digital preservation in the corporate world – and Boing Boing’s post Mars Library of books, DVDs, and database is now ready for launch just gave me a great excuse to start to scratch the surface.

I have also been following the blog StorageSwitched! for a while. It is written by the CEO of StorageSwitch LLC (“a technology provider for the fixed content data storage market with a multitude of gateway and utility products and services”). I have found it interesting to take a look at the business and technology side of preserving information. I plan more posts in this vein as I learn more about what is out there and how it is being used.

Photo Credit: David Crisp and the WFPC2 Science Team (Jet Propulsion Laboratory/California Institute of Technology)

Thoughts on Digital Preservation, Validation and Community

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.


Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By Rate­Limited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

International Environmental Data Rescue Organization: Rescuing At Risk Weather Records Around the World

iedro.jpgIn the middle of my crazy spring semester a few months back, I got a message about volunteer opportunities at the International Environmental Data Rescue Organization (IEDRO). I get emails from from VolunteerMatch.org every so often because I am always curious about virtual volunteer projects (ie, ways you can volunteer via your computer while in your pajamas). I filed the message away for when I actually had more time to take a closer look and it has finally made it to the top of my list.

A non-profit organization, IEDRO states their vision as being “.. to find, rescue, and digitize all historical environmental data and to make those data available to the world community.” They go on to explain on their website:

Old weather records are indeed worth the paper they are written on…actually tens of thousands times that value. These historic data are of critical importance to the countries within which they were taken, and to the world community as well. Yet, millions of these old records have already perished with the valuable information contained within, lost forever. These unique records, some dating back to the 1500s, now reside on paper at great risk from mold, mildew, fire, vermin, and old age (paper and ink deteriorate) or being tossed away because of lack of storage space. Once these data are lost, they are lost forever. There are no back up sources; nothing in reserve.

Why are these weather records valuable? IEDRO gives lots of great examples. Old weather records can:

  • inform the construction and engineering community about maximum winds recorded, temperature extremes, rainfall and floods
  • let farmers know the true frequency of drought, flood, extreme temperatures and in some areas, the amount of sunshine enabling them to better plan crop varieties and irrigation or drainage systems increasing their food production and helping to alleviate hunger.
  • assist in explaining historical events such as plague and famine, movement of cultures, insect movements (i.e. locusts in Africa), and are used in epidemiological studies.
  • provide our global climate computer models with baseline information enabling them to better predict seasonal extremes. This provides more accurate real-time forecasts and warnings and a better understanding of global change and validation of global warming.

The IEDRO site includes excellent scenarios in which accurate historical weather data can help save lives. You can read about the subsistence farmer who doesn’t understand the frequency of droughts well enough to make good choices about the kind of rice he plants, the way that weather impacts the vectorization models of diseases such as malaria and about the computer programs that need historical weather data to accurately predict floods. I also found this Global Hazards and Extremes page on the NCDC’s site – and I wonder what sorts of maps they could make about the weather one or two hundred years ago if all the historical climate data records were already available.

There was additional information available on IEDRO’s VolunteerMatch page. Another activity they list for their organization is: “Negotiating with foreign national meteorological services for IEDRO access to their original observations or microfilm/microfiche or magnetic copies of those observations and gaining their unrestricted permission to make copies of those data”.

IEDRO is making it their business to coordinate efforts in multiple countries to find and take digital photos of at risk weather records. They include information on their website about their data rescue process. I love their advice about being tenacious and creative when considering where these weather records might be found. Don’t only look at the national meteorological services! Consider airports, military sites, museums, private homes and church archives. The most unusual location logged so far was a monastery in Chile.

Once the records are located, each record is photographed with a digital camera. They have a special page showing examples of bad digital photos to help those taking the digital photos in the field, as well as a guidelines and procedures document available in PDF (and therefore easy to print and use as reference offline).

The digital images of the rescued records are then sent to NOAA’s National Climatic Data Center (NCDC) in Asheville, North Carolina. The NCDC is part of the National Environmental Satellite, Data and Information Service (NESDIS) which is in turn under the umbrella of the National Oceanic and Atmospheric Administration (NOAA). The NCDC’s website claims they have the “World’s Largest Archive of Climate Data”. The NCDC has people contracted to transcribe the data and ensure the preservation of the digital image copies. Finally, the data will be made available to the world.

IEDRO already lists these ten countries as locations where activities are underway: Kenya, Malawi, Mozambique, Niger, Senegal, Zambia, Chile, Uruguay, Dominican Republic and Nicaragua.

I am fascinated by this organization. On a personal level it brings together a lot of things I am interested in – archives, the environment, GIS data, temporal data and an interesting use of technology. This is such a great example of records that might seem unimportant – but turn out to be crucial to improving lives in the here and now. It shows the need for international cooperation, good technical training and being proactive. I know that a lot of archivists would consider this more of a scientific research mission (the goal here is to get that data for the purposes of research), but no matter what else these are – they are still archival records.

Book Review: Dreaming in Code (a book about why software is hard)

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software
(or “A book about why software is hard”) by Scott Rosenberg

Before I dive into my review of this book – I have to come clean. I must admit that I have lived and breathed the world of software development for years. I have, in fact, dreamt in code. That is NOT to say that I was programming in my dream, rather that the logic of the dream itself was rooted in the logic of the programming language I was learning at the time (they didn’t call it Oracle Bootcamp for nothing).

With that out of the way I can say that I loved this book. This book was so good that I somehow managed to read it cover to cover while taking two graduate school courses and working full time. Looking back, I am not sure when I managed to fit in all 416 pages of it (ok, there are some appendices and such at the end that I merely skimmed).

Rosenberg reports on the creation of an open source software tool named Chandler. He got permission to report on the project much as an embedded journalist does for a military unit. He went to meetings. He interviewed team members. He documented the ups and downs and real-world challenges of building a complex software tool based on a vision.

If you have even a shred of interest in the software systems that are generating records that archivists will need to preserve in the future – read this book. It is well written – and it might just scare you. If there is that much chaos in the creation of these software systems (and such frequent failure in the process), what does that mean for the archivist charged with the preservation of the data locked up inside these systems?

I have written about some of this before (see Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges), but it stands repeating: If you think preserving records originating from standardized packages of off-the-shelf software is hard, then please consider that really understanding the meaning of all the data (and business rules surrounding its creation) in custom built software systems is harder still by a factor of 10 (or a 100).

It is interesting for me to feel so pessimistic about finding (or rebuilding) appropriate contextual information for electronic records. I am usually such an optimist. I suspect it is a case of knowing too much for my own good. I also think that so many attempts at preservation of archival electronic records are in their earliest stages – perhaps in that phase in which you think you have all the pieces of the puzzle. I am sure there are others who have gotten further down the path only to discover that their map to the data does not bear any resemblance to the actual records they find themselves in charge of describing and arranging. I know that in some cases everything is fine. The records being accessioned are well documented and thoroughly understood.

My fear is that in many cases we won’t know that we don’t have all the pieces we need to decipher the data until many years down the road leads me to an even darker place. While I may sound alarmist, I don’t think I am overstating the situation. This comes from my first hand experience in working with large custom built databases. Often (back in my life as a software consultant) I would be assigned to fix or add on to a program I had not written myself. This often feels like trying to crawl into someone else’s brain.

Imagine being told you must finish a 20 page paper tonight – but you don’t get to start from scratch and you have no access to the original author. You are provided a theoretically almost complete 18 page paper and piles of books with scraps of paper stuck in them. The citations are only partly done. The original assignment leaves room for original ideas – so you must discern the topic chosen by the original author by reading the paper itself. You decide that writing from scratch is foolish – but are then faced with figuring out what the person who originally was writing this was trying to say. You find 1/2 finished sentences here and there. It seems clear they meant to add entire paragraphs in some sections. The final thorn in your side is being forced to write in a voice that matches that of the original author – one that is likely odd sounding and awkward for you. About halfway through the evening you start wishing you had started from scratch – but now it is too late to start over, you just have to get it done.

So back to the archivist tasked with ensuring that future generations can make use of the electronic records in their care. The challenges are great. This sort of thing is hard even when you have the people who wrote the code sitting next to you available to answer questions and a working program with which to experiment. It just makes my head hurt to imagine piecing together the meaning of data in custom built databases long after the working software and programmers are well beyond reach.

Does this sound interesting or scary or relevant to your world? Dreaming in Code is really a great read. The people are interesting. The issues are interesting. The author does a good job of explaining the inner workings of the software world by following one real world example and grounding it in the landscape of the history of software creation. And he manages to include great analogies to explain things to those looking in curiously from outside of the software world. I hope you enjoy it as much as I did.

Copyright Law: Archives, Digital Materials and Section 108

I just found my way today to Copysense (obviously I don’t have enough feeds to read as it is!). Their current clippings post highlighted part of the following quote as their Quote of the Week.

Marybeth Peters (from http://www.copyright.gov/about.html)“[L]egislative changes to the copyright law are needed. First, we need to amend the law to give the Library of Congress additional flexibility to acquire the digital version of a work that best meets the Library’s future needs, even if that edition has not been made available to the public. Second, section 108 of the law, which provides limited exceptions for libraries and archives, does not adequately address many of the issues unique to digital media—not from the perspective of copyright owners; not from the perspective of libraries and archives.” Marybeth Peters , Register of Copyrights, March 20, 2007

Marybeth Peters was speaking to the Subcommittee on Legislative Branch of the Committee on Appropriations about the Future of Digital Libraries.

Copysense makes some great points about the quote:

Two things strike us as interesting about Ms. Peters’ quote. First, she makes the quote while The Section 108 Study Group continues to work through some very thorny issues related to the statutes application in the digital age […] Second, while Peters’ quote articulates what most information professionals involved in copyright think is obvious, her comments suggest that only recently is she acknowledging the effect of copyright law on this nation’s de facto national library. […] [S]omehow it seems that Ms. Peters is just now beginning to realize that as the Library of Congress gets involved in the digitization and digital work so many other libraries already are involved in, that august institution also may be hamstrung by copyright.

I did my best to read through Section 108 of the Copyright Law – subtitled “Limitations on exclusive rights: Reproduction by libraries and archives”. I found it hard to get my head around … definitely stiff going. There are 9 different subsections (‘a’ through ‘i’) each with there own numbered exceptions or requirements. Anxious to get a grasp on what this all really means – I found LLRX.com and their Library Digitization Projects and Copyright page. This definitely was an easier read and helped me get further in my understanding of the current rules.

Next I explored the website for the Section 108 Study Group that is hard at work figuring out what a good new version of Section 108 would look like. I particularly like the overview on the About page. They have a 32 page document titled Overview of the Libraries and Archives Exception in the Copyright Act: Background, History, and Meaning for those of you who want the full 9 years on what has gotten us to where we are today with Section 108.

For a taste of current opinions – go to the Public Comments page which provides links to all the written responses submitted to the Notice of public roundtable with request for comments. There are clear representatives from many sides of the issue. I spotted responses from SAA, ALA and ARL as well as from MPAA, AAP and RIAA. All told there are 35 responses (and no, I didn’t read them all). I was more interested in all the different groups and individuals that took the time to write and send comments (and a lot of time at that – considering the complicated nature of the original request for comments and the length of the comments themselves). I was also intrigued to see the wide array of job titles of the authors. These are leaders and policy makers (and their lawyers) making sure their organizations’ opinions are included in this discussion.

Next stop – the Public Roundtables page with it’s links to transcripts from the roundtables – including the most recent one held January 31, 2007. Thanks to the magic of Victoria’s Transcription Services, the full transcripts of the roundtables are online. No, I haven’t read all of these either. I did skim through a bit of it to get a taste of the discussions – and there is some great stuff here. Lots of people who really care about the issues carefully and respectfully exploring the nitty-gritty details to try and reach good compromises. This is definitely on my ‘bookmark to read later’ list.

Karen Coyle has a nice post over on Coyle’s InFormation that includes all sorts of excerpts from the transcripts. It gives you a good flavor of what some of these conversations are like – so many people in the same room with such different frames of reference.

This is not easy stuff. There is no simple answer. It will be interesting to see what shape the next version of Section 108 takes with so many people with very different priorities pulling in so many directions.

Section 108 Study GroupThe good news is that there are people with the patience and dedication to carefully gather feedback, hold roundtables and create recommendations. Hurrah for the hard working members of the Section 108 Study Group – all 19 of them!

Google, Privacy, Records Managment and Archives

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

The Archives and Archivists Listserv: hoping for a stay of execution

There has been a lot of discussion (both on the Archives & Archivists (A&A) Listserv and in blog posts) about the SAA‘s recent decision to not preserve the A&A listserv posts from 1996 through 2006 when they are removed from the listserv’s old hosting location at Miami University of Ohio.

Most of the outcry against this decision has fallen into two camps:

  • Those who don’t understand how the SAA task force assigned to appraise the listserv archives could decide it does not have informational value – lots of discussion about how the listserv reflects the move of archivists into the digital age as well as it’s usefulness for students
  • Those who just wish it wouldn’t go away because they still use it to find old posts. Some mentioned that there are scholarly papers that reference posts in the listserv archives as their primary sources.

I added this suggestion on the listserv:

I would have thought that the Archives Listserv would be the ideal test case for developing a set of best practices for archiving an organization’s web based listserv or bboard.

Perhaps a graduate student looking for something to work on as an independent project could take this on? Even if they only got permission for working with posts from 2001 onward [post 2001 those who posted had to agree to ‘terms of participation’ that reduce issues with copyright and ownership] – I suspect it would still be worthwhile.

I have always found that you can’t understand all the issues related to a technical project (like the preservation of a listserv) until you have a real life case to work on. Even if SAA doesn’t think we need to keep the data forever – here is the perfect set of data for archivists to experiment with. Any final set of best practices would be meant for archivists to use in the future – and would be all the easier to comprehend if they dealt with a listserv that many of them are already familiar with.

Another question: couldn’t the listserv posts still be considered ‘active records’? Many current listserv posters claim they still access the old list’s archives on a regular basis. I would be curious what the traffic for the site is. That is one nice side effect of this being on a website – it makes the usage of records quantifiable.

There are similar issues in the analog world when records people still want to use loose their physical home and are disposed of but, as others have also pointed out, digital media is getting cheaper and smaller by the day. We are not talking about paying rent on a huge wharehouse or a space that needs serious temperature and humidity control.

I was glad to see Rick Prelinger’s response on the current listerv that simply reads:

The Internet Archive is looking into this issue.

I had already checked when I posted my response to the listerv yesterday – having found my way to the A&A old listserv page in the Wayback Machine. For now all that is there is the list of links to each week’s worth of postings – nothing beyond that has been pulled in.

I have my fingers crossed that enough of the right people have become aware of the situation to pull the listserv back from the brink of the digital abyss.

NARA’s Electronic Records Archives in West Virginia

“WVU, NATIONAL ARCHIVES PARTNER” from http://wvutoday.wvu.edu/news/page/5419/

In a press release dated February 28, 2007, the National Archives and Records Administration of the United States (NARA) and West Virginia University (WVU) declared they had signed “a Memorandum of Understanding to establish a 10-year research and educational partnership in the study of electronic records and the promotion of civic awareness of the use of electronic records as educational resources.” It goes on to say that the two organizations “will engage in collaborative research and associated educational activities” including “research in the preservation and long-term access to complex electronic records and engineering design documentation.” WVU will receive “test collections” of electronic records from NARA to support their research and educational activities.

This sounded interesting. I stumbled across this on NARA’s website while looking for something else. No blog chatter or discussions about what this means for electronic records research (thinking of course of the big Footnote.com announcement and all the back and forth discussion that inspired). So I went hunting to see if I could find the actual Memorandum of Understanding. No sign of it. I did find WVU’s press release which included the photo above. This next quote is in the press release as well:

The new partnership complements NARA’s establishment of the Electronic Records Archives Program operations at the U.S. Navy’s Allegany Ballistics Laboratory in Rocket Center near Keyser in Mineral County.

Googling Allegany Ballistics Laboratory got me information about how it is a superfund site that is in late or final stages of cleanup. It also led me to an article from Senator Byrd about how pleased he was in October of 2006 about a federal spending bill that included funds for projects at ABL – including a sentence mentioning how NARA “will use the Mineral County complex for its electronic records archive program.” No mention of this on the Electronic Records Archive (ERA) website or on their special press release page. I don’t see any info about any NARA installations in West Virginia on their Locations webpage .

Then I found the WVU newspaper The Daily Athenaeum and an article titled “National Archives, WVU join forces ” dated March 1, 2007. (If the link gives you trouble – just search on the Athenaeum site for NARA and it should come right up.) The following quote is from the article:

”This is a tremendous opportunity for WVU. The National Archives has no other agreements like this with anyone else,” said John Weete, WVU’s vice president for research and economic development.

The University will help the NARA develop the next generation of technologies for the Electronic Records Archives. WVU will also assist in the management of NARA’s tremendous amount of data, Weete said.

”This is a great opportunity for students. The Archives will look for students who are masters at handling records and who care about the documents (for future job opportunities), ” said WVU President David Hardesty.

WVU students and faculty will hopefully soon have access to the Rocket Center archives, and faculty will be overseeing the maintenance of such records, Hardesty said.

Perhaps I am reading more into this than was intended, but I am confused. I was unable to find any information on the WVU website about an MLS or Archival Studies program there. I checked in both the ALA’s LIS directory and the SAA’s Directory of Archival Education to confirm there are no MLS or Archives degree programs in West Virginia. So where are the “students who are masters at handling records” going to come from? I work daily in the world of software development and I can imagine Computer Scientists who are interested in electronic records and their preservation. But as I have discovered many times over during my archives coursework there are a lot of important and unique ideas to learn in order to understand everything that is needed for the archival preservation of electronic records for “the life of the republic” (as NARA’s ERA project is so fond of saying).

I am pleased for WVU to have made such a landmark agreement with NARA to study and further research into the preservation and educational use of electronic records. Unfortunately I am also suspicious of this barely mentioned bit about the Rocket Center archives and ABL and how WVU is going to help NARA manage their data.

Has anyone else heard more about this?

Update (03/07/07):

Thanks to Donna in the comments for suggesting that WVU’s program is in ‘Public History’ (a aterm I had not thought to look under). This is definitely more reassuring.

WVU appears to offer both a Certificate in Cultural Resource Management and a M.A. in Public History – both described here on the Cultural Resource Management and Public History Requirements page.

The page listing History Department graduate courses included the two ‘public history’ courses listed below:

412 Introduction to Public History. 3 hr. Introduction to a wide range of career possibilities for historians in areas such as archives, historical societies, editing projects, museums, business, libraries, and historic preservation. Lectures, guest speakers, field trips, individual projects.

614 Internship in Public History. 6 hr. PR: HIST 212 and two intermediate public history courses. A professional internship at an agency involved in a relevant area of public history. Supervision will be exercised by both the Department of History and the host agency. Research report of finished professional project required.

Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges

My most recent Archival Access class had a great guest speaker from the Journalism department. Professor Ira Chinoy is currently teaching a course on Computer-Assisted Reporting. In the first half of the session, he spoke about ways that archival records can fuel and support reporting. He encouraged the class to brainstorm about what might make archival records newsworthy. How do old records that have been stashed away for so long become news? It took a bit of time, but we got into the swing of it and came up with a decent list. He then went through his own list and gave examples of published news stories that fit each of the scenarios.

In the second half of class he moved on to address issues related to the freedom of information and struggling to gain access to born digital public records. Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They have many hurdles. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state (see the Maryland Public Information Act as an example) and journalists often must quote chapter and verse to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hopes of proving that it can be done. I love this article linked to in Prof. Chinoy’s syllabus: The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (And Suggestions on What You Should Say or Do).

After all that work – just getting your hands on the magic file of data is not enough. The data is of no use without the decoder ring of documentation and context.

I spent most of the 1990s designing and building custom databases, many for federal government agencies. There are an almost inconceivable number of person hours that go into the creation of most of these systems. Stakeholders from all over the organization destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated … and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in it – even if they are on the IT team for the full development life cycle.

Sometimes you get lucky and the project includes people with amazing technical writing skills, but usually those talented people are aimed at writing documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They will rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself and that is not likely to be preserved along with the data.

Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection remember such things as where frequently requested documents are or what restrictions must be applied to certain documents.

So where does that leave those who are playing detective to understand the records in these systems? Trying to figure out what the data in the tables mean based on the understanding of end-users can be a fool’s errand – and that is if you even have access to actual users of the system in the first place. I don’t think there is any easy answer given the realities of how many unique systems of managing data are being used throughout the public sector.

Archivists often find themselves struggling with the same problems. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records – though I recall discussions in my appraisal class last term about all the benefits of working with the producers of records long before they are earmarked to head to the archives. Unfortunately, it appeared that this was often the exception rather than the rule – even if it is the preferred scenario.

The overly ambitious and optimistic part had the idea that what ‘we’ really need is a database that lists common commercial off-the-shelf (COTS) packages used by public agencies – along with information on how to extract and redact data from these packages. For those agencies using custom systems, we could include any information on what company or contractors did the work – that sort of thing can only help later. Or how about just a list of which agencies use what software? Does something like this exist? The records of what technology is purchased are public record – right? Definitely an interesting idea (for when I have all that spare time I dream about). I wonder if I set up a wiki for people to populate with this information if people would share what they already know.

I would like to imagine a future world in which all this stuff is online and you can login and download any public record you like at any time. You can get a taste of where we are on the path to achieving this dream on the archives side of things by exploring a single series of electronic records published on the US National Archives site. For example, look at the search screen for World War II Army Enlistment Records. It includes links to sample data, record group info and an FAQ. Once you make it to viewing a record – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War II and Army enlistment. You still need the context of the events and this is where the FAQ comes in. Look at the information they provide – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place).

This isn’t a new problem that has appeared with born digital records. Archivists and journalists have always sought the context of the information with which they are working. The new challenge is in the added obstacles that a cryptic database system can add on top of the already existing challenges of decrypting the meaning of the records.

Archivists and Journalists care about a lot of the same issues related to born digital records. How do we acquire the records people will care about? How do we understand what they mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this?

It is interesting for me find a new angle from which to examine rapid software development. I have spent so much of my time creating software based on the needs of a specific user community. Usually those who are paying for the software get to call the shots on the features that will be included. Certain industries do have detailed regulations designed to promote access by external observers (I am thinking of applications related to medical/pharmaceutical research and perhaps HAZMAT data) but they are definitely exceptions.

Many people are worrying about how we will make sure that the medium upon which we record our born digital records remains viable. I know that others are pondering how to make sure we have software that can actually read the data such that it isn’t just mysterious 1s and 0s. What I am addressing here is another aspect of preservation – the preservation of context. I know this too is being worried about by others, but while I suspect we can eventually come up with best practices for the IT folks to follow to ensure we can still access the data itself – it will ultimately be up to the many individuals carrying on their daily business in offices around the world to ensure that we can understand the information in the records. I suppose that isn’t new either – just another reason for journalists and archivists to make their voices heard while the people who can explain the relationships between the born digital records and the business processes that created them are still around to answer questions.

Should we be archiving fonts?

I am a fan of beautiful fonts. This is why I find myself on the mailing list if MyFonts.com. I recently received their Winter 2007 newsleter featuring the short article titled ‘A cast-iron investment’. It starts out with:

Of all the wonderful things about fonts, there’s one that is rarely mentioned by us font sellers. It’s this: fonts last for a very long time. Unlike almost all the other software you may have bought 10 or 15 years ago, any fonts you bought are likely still working well, waiting to be called back into action when you load up that old newsletter or greetings card you made!

Interesting. The article goes on to point out:

But, of course, foundries make updates to their fonts every now and then, with both bug fixes and major upgrades in features and language coverage.

All this leaves me wondering if there is a place in the world for a digital font archive. A single source of digital font files for use by archives around the world. Of course, there would be a number of hurdles:

  1. How do you make sure that the fonts are only available for use in documents that used the fonts legally?
  2. How do you make sure that the right version of the font is used in the document to show us how the document appeared originally?

You could say this is made moot by using something like Adobe’s PDF/A format. It is also likely that we won’t be running the original word processing program that used the fonts a hundred years from now.

Hurdles aside, somehow it feels like a clever thing to do. We can’t know how we might enable access to documents that use fonts in the future. What we can do is keep the font files so we have the option to do clever things with them in the future.

I would even make a case for the fact that fonts are precious in their own right and deserve to be preserved. My mother spent many years as a graphic designer. From her I inherited a number of type specimen books – including one labeled “Adcraft Typographers, Inc”. Google led me to two archival collections that include font samples from Adcraft:

Another great reason for a digital font archive is the surge in individual foundries creating new fonts every day. What once was an elite craft now has such a low point of entry that anyone can download some software and hang out their shingle as a font foundry. Take a look around MyFonts.com. Read about selling your fonts on MyFonts.com.

While looking for a good page about type foundries I discovered the site for Precision Type which shows this on their only remaining page:

For the last 12 years, Precision Type has sought to provide our customers with convenient access to a large and diverse range of font software products. Our business grew as a result of the immense impact that digital technology had in the field of type design. At no other time in history had type ever been available from so many different sources. Precision Type was truly proud to play a part in this exciting evolution.

Unfortunately however, sales of font software for Precision Type and many others companies in the font business have been adversely affected in recent years by a growing supply of free font software via the Internet. As a result, we have decided to discontinue our Precision Type business so that we can focus on other business opportunities.

I have to go back to May 23, 2004 in the Internet Archive Wayback Machine to see what Precision Type’s used to look like.

There are more fonts than ever before. Amateurs are driving professionals out of business. Definitely sounds like digital fonts and their history are a worthy target for archival preservation.