Menu Close

Category: future-proofing

New Skills for a Digital Era: Official Proceedings Now Available

New Skills for a Digital Era LogoFrom May 31st through June 2nd of 2006, The National Archives, the Arizona State Library and Archives, and the Society of American Archivists hosted a colloquium to consider the question “What are the practical, technical skills that all library and records professionals must have to work with e-books, electronic records, and other digital materials?”. The website for the New Skills for a Digital Era colloquium already includes links to the eleven case studies considered over the course of the three days of discussion as well as a list of additional suggested readings. As mentioned over on The Ten Thousand Year Blog, the pre-print of the proceedings has been available since August, 2007.

As announced in SAA’s online newsletter, the Official Proceedings of the New Skills for a Digital Era Colloquium, edited by Richard Pearce-Moses and Susan E. Davis, is now available for free download. Published under Creative Commons Attribution, this document is 143 pages long and includes all the original case studies. I have a lot of reading to do!

The meat of the proceedings consists of a 32 page ‘Knowledge and Skills Inventory’ and a page and a half of reflections – both co-authored by Richard Pearce-Moses and Susan E. Davis. The Keynote Address by Margaret Hedstrom titled ‘Are We Ready for New Skills Yet?’ is also included.

I am very pleased with how much access has been provided to these materials. These topics are clearly of interest to many beyond the 60 individuals who were able to take part in the original gathering. As an archival studies student it has often been a great source of frustration that so few of the archives related conferences publish proceedings of any kind. It is part of what has driven me to attempt to assemble exhaustive session summaries for those sessions I have personally attended at the past two SAA Annual meetings (see SAA2006 and SAA2007). I think that the Unofficial Conference Wiki for SAA2007 was also a big step in the right direction and I hope it will continue to evolve and improve for the upcoming SAA2008 annual meeting in San Francisco.

The course I elected to take this term is dedicated to studying Communities of Practice. This announcement about the New Skills for a Digital Era’s proceedings has me thinking about the community of practice that seems to currently be taking form across the library, archives and records management communities. I will share more thoughts on this as I sort through them myself.

Finally, a question for anyone reading this post who attended the colloquium: Are you still discussing the case studies with others from that session two years ago? If not, do you wish you were?

Image Credit: The image at the top of this post is from the New Skills for a Digital Era website.

Digital Preservation via Emulation – Dioscuri and the Prevention of Digital Black Holes

dioscuri.JPGAvailable Online posted about the open source emulator project Dioscuri back in late September. In the course of researching Thoughts on Digital Preservation, Validation and Community I learned a bit about the Microsoft Virtual PC software. Virtual PC permits users to run multiple operating systems on the same physical computer and can therefore facilitate access to old software that won’t run on your current operating system. That emulator approach pales in comparison with what the folks over at Dioscuri are planning and building.

On the Digital Preservation page of the Dioscuri website I found this paragraph on their goals:

To prevent a digital black hole, the Koninklijke Bibliotheek (KB), National Library of the Netherlands, and the Nationaal Archief of the Netherlands started a joint project to research and develop a solution. Both institutions have a large amount of traditional documents and are very familiar with preservation over the long term. However, the amount of digital material (publications, archival records, etc.) is increasing with a rapid pace. To manage them is already a challenge. But as cultural heritage organisations, more has to be done to keep those documents safe for hundreds of years at least.

They are nothing if not ambitious… they go on to state:

Although many people recognise the importance of having a digital preservation strategy based on emulation, it has never been taken into practice. Of course, many emulators already exist and showed the usefulness and advantages it offer. But none of them have been designed to be digital preservation proof. For this reason the National Library and Nationaal Archief of the Netherlands started a joint project on emulation.

The aim of the emulation project is to develop a new preservation strategy based on emulation.

Dioscuri is part of Planets (Preservation and Long-term Access via NETworked Services) – run by the Planets consortium and coordinated by the British Library. The Dioscuri team has created an open source emulator that can be ported to any hardware that can run a Java Virtual Machine (JVM). Individual hardware components are implemented via separate modules. These modules should make it possible to mimic many different hardware configurations without creating separate programs for every possible combination.

You can get a taste of the big thinking that is going into this work by reviewing the program overview and slide presentations from the first Emulation Expert Meeting (EEM) on digital preservation that took place on October 20th, 2006.

In the presentation given by Geoffrey Brown from Indiana University titled Virtualizing the CIC Floppy Disk Project: An Experiment in Preservation Using Emulation I found the following simple answer to the question ‘Why not just migrate?’:

  • Loss of information — e.g. word edits

  • Loss of fidelity — e.g. WordPerfect to Word isn’t very good

  • Loss of authenticity — users of migrated document need access to original to verify authenticity

  • Not always possible — closed proprietary formats

  • Not always feasible — costs may be too high

  • Emulation may necessary to enable migration

After reading through Emulation at the German National Library, presented by Tobias Steinke, I found my way to the kopal website. With their great tagline ‘Data into the future’, they state their goal is “…to develop a technological and organizational solution to ensure the long-term availability of electronic publications.” The real gem for me on that site is what they call the kopal demonstrator. This is a well thought out Flash application that explains the kopal project’s ‘procedures for archiving and accessing materials’ within the OAIS Reference Model framework. But it is more than that – if you are looking for a great way to get your (or someone else’s) head around digital archiving, software and related processes – definitely take a look. They even include a full Glossary.

I liked what I saw in Defining a preservation policy for a multimedia and software heritage collection, a pragmatic attempt from the Bibliothèque nationale de France, a presentation by Grégory Miura, but felt like I was missing some of the guts by just looking at the slides. I was pleased to discover what appears to be a related paper on the same topic presented at IFLA 2006 in Seoul titled: Pushing the boundaries of traditional heritage policy: Maintaining long-term access to multimedia content by introducing emulation and contextualization instead of accepting inevitable loss . Hurrah for NOT ‘accepting inevitable loss’.

Vincent Joguin’s presentation, Emulating emulators for long-term digital objects preservation: the need for a universal machine, discussed a virtual machine project named Olonys. If I understood the slides correctly, the idea behind Olonys is to create a “portable and efficient virtual processor”. This would provide an environment in which to run programs such as emulators, but isolate the programs running within it from the disparities between the original hardware and the actual current hardware. Another benefit to this approach is that only the virtual processor need be ported to new platforms rather than each individual program or emulator.

Hilde van Wijngaarden presented an Introduction to Planets at EEM. I also found another introductory level presentation that was given by Jeffrey van der Hoeven at wePreserve in September of 2007 titled Dioscuri: emulation for digital preservation.

The wePreserve site is a gold mine for presentations on these topics. They bill themselves as “the window on the synergistic activities of DigitalPreservationEurope (DPE), Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval (CASPAR), and Preservation and Long-term Access through NETworked Services (PLANETS).” If you have time and curiosity on the subject of digital preservation, take a glance down their home page and click through to view some of the presentations.

On the site of The International Journal of Digital Curation there is a nice ten page paper that explains the most recent results of the Dioscuri project. Emulation for Digital Preservation in Practice: The Results was published in December 2007. I like being able to see slides from presentations (as linked to above), but without the notes or audio to go with them I am often left staring at really nice diagrams wondering what the author’s main point was. The paper is thorough and provides lots of great links to other reading, background and related projects.

There is a lot to dig into here. It is enough to make me wish I had a month (maybe a year?) to spend just following up on this topic alone. I found my struggle to interpret many of the Power Point slide decks that have no notes or audio very ironic. Here I was hunting for information about the preservation of born digital records and I kept finding that the records of the research provided didn’t give me the full picture. With no context beyond the text and images on the slides themselves, I was left to my own interpretation of their intended message. While I know that these presentations are not meant to be the official records of this research, I think that the effort obviously put into collecting and posting them makes it clear that others are as anxious as I to see this information.

The best digital preservation model in the world will only preserve what we choose to save. I know the famous claim on the web is that ‘content is king’ – but I would hazard to suggest that in the cultural heritage community ‘context is king’.

What does this have to do with Dioscuri and emulators? Just that as we solve the technical problems related to preservation and access, I believe that we will circle back around to realize that digital records need the same careful attention to appraisal, selection and preservation of context as ‘traditional’ records. I would like to believe that the huge hurdles we now face on the technical and process side of things will fade over time due to the immense efforts of dedicated and brilliant individuals. The next big hurdle is the same old hurdle – making sure the records we fight to preserve have enough context that they will mean anything to those in the future. We could end up with just as severe a ‘digital black hole’ due to poorly selected or poorly documented records as we could due to records that are trapped in a format we can no longer access. We need both sides of the coin to succeed in digital preservation.

Did I mention the part about ‘Hurray for open source emulator projects with ambitious goals for digital preservation’? Right. I just wanted to be clear about that.

Image Credit: The image included at the top of this post was taken from a screen shot of Dioscuri itself, the original version of which may be seen here.

Will Crashed Hard Drives Ever Equal Unlabeled Cardboard Boxes?

Photo of Crashed Hard Drive - wonderferret on FlickrHow many of us have an old hard drive hanging around? I am talking about the one you were told was unfixable. The one that has 3 bad sectors. The one they replaced and handed to you in one of those distinctive anti-static bags. You know the ones I mean – the steely grey translucent plastic ones that look like they should contain space food.

I have more than one ‘dead’ hard drive. I can’t quite bring myself to throw them out – but I have no immediate plans to try and reclaim their files.

I know that there are services and techniques for pulling data off otherwise inaccessible hard drives. You hear about it in court cases and see it on TV shows. A quick Google search on hard drive rescue turns up businesses like Disk Data Recovery

Do archivists already make it a policy to hunt not just for computers, but for discarded and broken hard drives lurking in filing cabinets and desk drawers? Compare this to a carton of documents that needed special treatment to permit access to the records they contained and yet are appraised as valuable. If the treatment required were within budgetary and time constraints – it would be performed. Mold, bugs, rusty staples, photos that are stuck together… archivists generally know where to get the answers they need to tackle these sorts of problems. I suspect that a hard drive advertised or discovered to be broken would be treated more like an empty box than a moldy box.

For now I would stack this challenge near the bottom of the list below archiving digital records that we can access easily but that run on old hardware or software, but I can imagine a time when standard hard drive rescue techniques will need to be a tool for the average archivist.

Preserving Virtual Worlds – TinyMUD to SecondLife

A recent press release from the Library of Congress, Digital Preservation Program Makes Awards to Preserve American Creative Works, describes the newly funded project aimed at the preservation of ‘virtual worlds’:

The Preserving Virtual Worlds project will explore methods for preserving digital games and interactive fiction. Major activities will include developing basic standards for metadata and content representation and conducting a series of archiving case studies for early video games, electronic literature and Second Life, an interactive multiplayer game. Second Life content participants include Life to the Second Power, Democracy Island and the International Spaceflight Museum. Partners: University of Maryland, Stanford University, Rochester Institute of Technology and Linden Lab.

This has gotten a fair amount of coverage from the gaming and humanities sides of the world, but I learned about it via Professor Matthew Kirschenbaum‘s blog post Just Funded: Preserving Virtual Worlds.

The How They Got Game 2 post Library of Congress announces grants for preservation of digital games gives a more in depth summary of the Preserving Virtual Worlds project goals:

The main goal of the project is to help develop generalizable mechanisms and methods for preserving digital games and interactive fiction, and to begin to test these mechanism through the archiving of selected test cases. Key deliverables include the development of metadata schema and wrapper recommendations, and the long-term curation of archived cases.

I take this all a bit more personally than most might. I was a frequent denizen of an online virtual world known as TinyMUD (now usually referred to as TinyMUD Classic). TinyMUD was a text based, online, multi-player game that existed for seven months beginning in August of 1989. In practice it was sort of a cross between a chat room and a text based adventure. The players could build new parts of the MUD as they went – in many ways it was an early example of crowdsourcing. There was a passionate core of players who were constantly building new areas for others to explore and experience – not unlike what is currently the case in SecondLife. These types of text based games still exist – see MudMagic for listings.

Apparently August 20, 2007 will be TinyMUD’s 18th Annual Brigadoon Day. It will be celebrated by putting TinyMUD classic online for access. The page includes careful notes about finding and using a MUD Client to access TinyMUD. The existence of an ongoing MUD community of users has kept software like this alive and available almost 20 years later.

With projects like Preserving Virtual Worlds getting grants and gaining momentum it seems more plausible with each passing day that 18 years from now, parts of 2007’s SecondLife will still be available for people to experience. I am thankful to know that a copy of the TinyMUD world I helped build is still out there. I am even more thankful to know that the technology still exists to permit users to access it even if it is only once a year.

Update: 20th Anniversary of TinyMud Brigadoon day is set for Thursday, August 20, 2009

Phoenix DVD destined for Mars

Hubble's Sharpest View Of Mars

When the Phoenix Mars Mission launches (possible as early as this Friday August 3rd, 2007), it will have something unusual on board. The Planetary Society has created what they call the Phoenix DVD.

In late May of 2007 they proudly announced that their special DVD was ready for launch:

… the silica glass mini-DVD with a quarter million names on it (including all Planetary Society members) has been installed on the Phoenix spacecraft and is ready to go to Mars!

In addition to the names, the disc also contains Visions of Mars, a collection of literature and art about the Red Planet. The names and Visions of Mars were written to the silica mini-DVD by the company Plasmon OMS using a special technique. The resulting archival disk should last at least hundreds of years on the Martian surface, ready to be picked up by future explorers.

After the disc was written, a special label was applied to the disc to identify it for future explorers.

The page about Visions of Mars describes it as follows:

Visions of Mars is a message from our world to future human inhabitants of Mars. It will launch on its way to the Red Planet in the summer of 2007 aboard the spacecraft Phoenix. Along with personal messages from leading space visionaries of our time, Visions of Mars includes a priceless collection of Mars literature and art, and a list of hundreds of thousands of names of space enthusiasts from around the world. The entire collection will be encoded on a mini-DVD provided by The Planetary Society, which will be affixed to the spacecraft.

All this has been inscribed on a silica mini-DVD – and has the phrase “Attention Astronauts: Take This With You” in bright red letters on the front. I hate to be cynical (and those of you who read this blog know that it is not my nature to be so) but where will those ‘future human inhabitants of Mars’ find a DVD player to watch this DVD? I know I am not the first to doubt their plan – but I couldn’t resist. Given my suspicion of the whole affair I thought I would at least look into the company that created this very special disk.

Plasmon has an extensive website with all sorts of interesting tidbits. They explain their trademarked Ultra Density Optical (UDO) technology. They feature two PDFs – one called Archiving Defined and another labeled Plasmon Archive Solution. It looks very interesting. My VERY oversimplified summary is that they have combined a RAID approach with a very durable and secure WORM (Write Once, Read Many) flavor of DVD and packaged it into a solution for companies who need to ensure their data remains safe.

I have been meaning to learn more about the latest and greatest in hardware and material solutions aimed at digital preservation in the corporate world – and Boing Boing’s post Mars Library of books, DVDs, and database is now ready for launch just gave me a great excuse to start to scratch the surface.

I have also been following the blog StorageSwitched! for a while. It is written by the CEO of StorageSwitch LLC (“a technology provider for the fixed content data storage market with a multitude of gateway and utility products and services”). I have found it interesting to take a look at the business and technology side of preserving information. I plan more posts in this vein as I learn more about what is out there and how it is being used.

Photo Credit: David Crisp and the WFPC2 Science Team (Jet Propulsion Laboratory/California Institute of Technology)

Thoughts on Digital Preservation, Validation and Community

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.


Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By Rate­Limited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

Considering Historians, Archivists and Born Digital Records

I think I renamed this post at least 12 times. My original intention was was to consider the impact of born digital records on the skills needed for the historian/researchers of the future. In addition I found myself exploring the dividing lines among a number of possible roles in ensuring access to the information written in the 1s and 0s of our born digital records.

After my last post about the impact of anonymization of Google Logs, a friend directed me to the work of Dr. Latanya Sweeney. Reading through the information about her research I found Trail Re-identification: Learning Who You are From Where You Have Been. Given enough data to work with, algorithms can be written that often can re-identify the individuals who performed the original searches. Carnegie Mellon University‘s Data Privacy Lab includes the Trails Learning Project with the goal of answering the question “How can people be identified to the trail of seemingly innocent and anonymous data they leave behind at different locations?”. So it seems that there may be a lot of born digital records that start out anonymous but that may permit ‘re-identification’ – given the application of the right tools or techniques. That is fine – historians have often needed to become detectives. They have spent years developing techniques for the analysis of paper documents to support ‘re-identification’. Who wrote this letter? Is this document real or a forgery? Who is the ‘Mildred’ referenced in this record?

The field of diplomatics studies the authenticity and provenance of documents by looking at everything from the paper they were written on to the style of writing to the ink used. I like the idea of using the term ‘digital diplomatics’ for the ever increasing process of verifying and validating born digital records. Google found me the Digital Diplomatics conference that took place earlier this year in Munich. Unfortunately it was more geared toward investigating how the use of computers can enhance traditional diplomatic approaches rather than how to authenticate the provenance of born digital records.

In the March 2007 issue of Scientific American I found the article A Digital Life. It talks primarily about the Microsoft Research project MyLifeBits. A team at Microsoft Research has spent the last six years creating what they call a ‘digital personal archive’ of team member Gordon Bell. This archive hopes to “record all of Bell’s communications with other people and machines, as well as the images he sees, the sounds he hears and the Web sites he visits–storing everything in a personal digital archive that is both searchable and secure.”

They are not blind to the long term challenges of preserving the data itself in some accessible format:

Digital archivists will have to constantly convert their files to the latest formats, and in some cases they may need to run emulators of older machines to retrieve the data. A small industry will probably emerge just to keep people from losing information because of format evolution.

The article concludes:

Digital memories will yield benefits in a wide spectrum of areas, providing treasure troves of information about how people think and feel. By constantly monitoring the health of their patients, future doctors may develop better treatments for heart disease, cancer and other illnesses. Scientists will be able to get a glimpse into the thought processes of their predecessors, and future historians will be able to examine the past in unprecedented detail. The opportunities are restricted only by our ability to imagine them.

Historians will have at least these two types of digital artifacts to explore – those gathered purposefully (such as the digital personal archives described above) and those generated as a byproduct of other activity (such as the Google search logs). Might these be the future parallels to the ‘manuscript’ and ‘corporate’ archives of today?

So we have both the ideas of the Digital Archivist and the Digital Historian. What about a Digital Archaeologist? I am not the first to ponder the possible future job of Digital Archaeologist. A bit of googling of the term led me to Dark Star Gazette and Dear Digital Archaeologist. Back in February of 2007 they pondered:

Will there be digital archaeologists, people who sift through our society’s discarded files and broken web links, carefully brushing away revisions and piecing together antiquated file formats? Will a team of grad students working on their PhDs a thousand, or two thousand, years from now be digging through old blog entries, still archived online in some remote descendant of the Wayback Machine or a copy of Google’s backup tapes?

I can only imagine a world in which this is in fact the case. Given that premise, at what point does the historian get too far from the primary source? If the historian does not understand exactly what a computer program does to extract the information they want from logs or ‘digital memory repositories’ – are they no longer working with the primary source?

Imagine any field in which historians do research. Music? Accounting? Science? In order examine and interpret primary source records a historian becomes something of an expert in that field. Consider the historian documenting the life of a famous scientist based partly on their lab notebooks. That historian would be best served by being taught how to interpret the notebooks themselves. The historian must be fluent in the language of the record in order to gain the most direct access to the information.

Ah – but if there really are Digital Archaeologists in the far future, perhaps they would be the connection between the primary source born digital records and the historians who wish to study them. Or perhaps the Digital Archivist, in a new take on ‘arranging records’, would transform digital chaos into meaningful records for use by researchers? The field of expertise on the historians part would need only be in the content of the records – not exactly how they were rescued from the digital abyss.

Would a Digital Historian be someone who only considers the history of the digital landscape or a historian especially well versed in the interpretation of digital records? In Daniel Cohen and Roy Rosenzweig‘s book Digital History: A Guide to Gathering, Preserving, And Presenting the Past on the Web they seem to use the term in the present tense to refer to historians who uses computers and technology to support and expand the reach of their research. Yet, in his essay Scarcity or Abundance? Preserving the Past in a Digital Era, Roy Rosenzweig proposes:

Future graduate programs will probably have to teach such social-scientific and quantitative methods as well as such other skills as “digital archaeology”(the ability to “read” arcane computer formats), “digital diplomatics” (the modern version of the old science of authenticating documents), and data mining (the ability to find the historical needle in the digital hay). In the coming years, “contemporary historians” may need more specialized research and “language” skills than medievalists do.

What is my imagined skill set for the historian of our digital world? A willingness to dig into the rich and chaotic world of born digital records. The ability to use tools and find partners to assist in the interpretation of those records. Equal comfort working at tables covered in dusty boxes and in the virtual domain of glowing computer terminals. And of course – the same curiosity and sense of adventure that has always drawn people to the path of being a historian.

We cannot predict the future – we can only do our best to adapt to what we see before us. I suspect the prefixing of every job title with the word ‘digital’ will disappear over time – much as the prefixing of everything with the letter ‘e’ to let you know that something was electronic or online has ebbed out of popular culture. As the historians and archivists of today evolve into the historians and archivists of tomorrow they will have to deal with born digital records – no matter what job title we give them.