born digital records | Spellbound Blog

Capa’s Found Images and Thoughts on Digital Photographers’ Sketchbooks

February 1, 2008 2 Comments

In the Washington Post article There Are No Black-and-White Answers in War — Then Lost Negatives Turn Up (February 1, 2008), we learn that three cardboard boxes of negatives were recently delivered to the International Center of Photography (ICP) – possibly including as many as 4,000 images. This collection of black-and-white film, consisting predominately of photos shot by Robert Capa during the Spanish Civil War, was long thought lost during World War II and will join the already existing Robert Capa Archives. The boxes also contain negatives from two other famed photographers associated with Capa, Gerda Taro and David Seymour (known by the pseudonym Chim – pronounced shim).

There are many reasons these boxes are exciting for historians and Capa researchers. They hold the promise of answering some long standing questions. Where certain famous photos were staged? Are the current credits given for various photos are correct? But what caught my eye in this article was the following quote from ICP curator Brian Wallis:

“Capa was really adept at creating a whole story in one day: Here are the characters, here is the beginning, the action shots, the end, and the effect on civilians. If you look at his work not as great individual shots, but as stories, you get a completely different picture of him, and I think a more accurate and valuable picture.

“These negatives will further amplify that story, not just a few stories but dozens of stories that went out. It is like a sketchbook — he was trying out various ideas, and some worked and some didn’t.”

What About Digital Photographer’s Sketchbooks?

If you have ever used a digital camera, you have almost certainly enjoyed the instant gratification of being able to preview your photos on the tiny screen. The next temptation is to click the delete button. Sometimes you delete because the photo is clearly not what you were after – other times you delete to make room for some much more crucial photo.

I don’t know what the standard best practices are for professional photographers. Part of me hopes that they keep everything – at least until they can view the images on a big screen. But there is clearly a much easier path to deleting the ideas that didn’t work out. It leaves me wondering what the scholars of the future will be missing by not being able to see the failed experiments. The ‘sketchbooks’ of digital photographers could easily be perceived as at risk records. That said, many creative individuals (artists, architects, photographers… etc) do not care to share their failed experiments with the outside world. One of the issues facing those preserving digital records of the design community is the strong desire of designers to not share their work in progress and only share the final product (see my post SAA2007: Preserving Born Digital Records of the Design Community (Session 106) for more thoughts on this).

Image Overload

Of course there will be the photographers who do keep everything. Hard drive space is getting cheaper with every passing day. Perhaps my fears are misplaced and instead we should be worrying more about the flood of photographs that will overwhelm archivists and researchers. The time needed to discover the ‘good’ and ‘important’ photographs in a collection of thousands of images could be extreme.

I shoot all my photos digitally now. I no longer live in a world where there are only 36 shots on a single role – I don’t need to choose each photo carefully. I cheerfully tell my friends “Photos are free!”. Even that doesn’t stop me from deleting the ones that I really dislike. Sometimes the 2 GB card in my camera gets full before an event is done, so the on the spot weeding of photos occurs as well. But when I compare the number of ‘good’ photos that I have uploaded to share online (currently 5,000+ and counting) with the number of photos I have on my hard drive (20,000+) it is clear to me that I am keeping plenty of ‘sketch photos’. It is also interesting to note that I will often realize that there are photos I really like now that I didn’t appreciate immediately after they were taken. While something at the time made me NOT include it as a photo to share, now I see something in the image that catches my eye in a new way. The more this happens, the less I delete as I download, organize and tag my photos.

Metadata and the Exchangeable Image File Format (EXIF)

Of course the situation with digital photographs is not all bad. When digital cameras record a photo, they also record a set of metadata in the exchangeable image file format (EXIF) format. The metadata recorded usually includes camera make and model, date, time, and camera settings. Some cameras can even record GPS generated location information. Because there is no way to know the time zone (at least without location information), the value of the time setting is more useful for relating photos from within a set to one another than in establishing the actual time a photo was taken.

Adobe has contributed their own proprietary metadata format called Extensible Metadata Platform (XMP).

The most common metadata tags recorded in XMP data are those from the Dublin Core Metadata Initiative, which include things like title, description, creator, and so on. The standard is designed to be extensible, allowing users to add their own custom types of metadata into the XMP data. (Wikipedia Entry: Extensible Metadata Platform )

The magic of both XMP and EXIF is that the metadata is embedded in the file itself. There is no chance of losing the connection between a photo and the information about it – it is akin to writing on the back of an analog photograph. Embedded metadata provides the greatest tool for rediscovering the original order in which a series of photographs were taken, as well as providing access to metadata entered by the photographer at the image level.

The archivist of today accessioning born digital images must be comfortable with tools for viewing and updating embedded metadata. I mention updating because any information that is currently known about an image that could be added to the embedded metadata is more information that cannot later become accidentally separated from the images in question. This of course assumes that we will still have the proper technology in the future with which to access all this embedded metata.

Embedded metadata can be updated before it reaches the controlled environment of an archive. Data found as embedded metadata must be evaluated in the same manner that any information about photographs would be evaluated. For example, it would be a lot easier to modify metadata on digital photos to make the images appear to have been taken in a different order than it would be to do the same change with a strip of analog negatives. If this in fact was done – the fact that it was would likely be as interesting to researchers as the original order (assuming of course that you could ever figure out that such a modification had been made!).

Not all methods of organizing photos results in embedded metadata, so there is plenty of room for the standard challenges of old software that you can’t get to run but that holds the key to information about a hard drive of thousands of images. Sophisticated photograph management tools often now include workflow features that could also provide insight into the decision making and processing steps taken by a photographer. Much of this type of information is very unlikely to be embedded in the photos themselves – but still would represent interesting digital records related to the everyday work that a professional photographer performs.

Final Thoughts

I do feel that something is being lost via the ease with which one may delete experimental ‘sketchbook’ photos, but I suspect that the lure of virtually infinite hard drive space, image organization/tagging software tools and the clues provided by embedded metadata will balance the scales. Those who study photographers and their work will certainly have more to say about far in the future. There will be hard choices over the next decades – what can we do to guarantee access in that distant time to the full digital bodies of work of the Capas of today? I think the answers start with building strong lines of communication between prominent digital photographers and archivists. I know that this is just a special case of the challenges we see with digital records across professions, but each field adds its own special issues that must be sorted through and figured out one at a time. So, are there archivists out there working with professional digital photographers?

For more images related to this story, see the New York Time’s slideshow about Robert Capa’s Lost Negatives [UPDATE: New images available in the slideshow Inside the Mexican suitcase, posted April 29, 2009]

Image credit: Photographer Robert Capa during the Spanish civil war, May 1937. Photo by Gerda Taro. If the logic on this Wikimedia Commons page is to believed, this photo is in the public domain in the United States because the photographer died in 1937 (ie, more than 70 years ago).

Digital Preservation via Emulation – Dioscuri and the Prevention of Digital Black Holes

December 25, 2007 2 Comments

Available Online posted about the open source emulator project Dioscuri back in late September. In the course of researching Thoughts on Digital Preservation, Validation and Community I learned a bit about the Microsoft Virtual PC software. Virtual PC permits users to run multiple operating systems on the same physical computer and can therefore facilitate access to old software that won’t run on your current operating system. That emulator approach pales in comparison with what the folks over at Dioscuri are planning and building.

On the Digital Preservation page of the Dioscuri website I found this paragraph on their goals:

To prevent a digital black hole, the Koninklijke Bibliotheek (KB), National Library of the Netherlands, and the Nationaal Archief of the Netherlands started a joint project to research and develop a solution. Both institutions have a large amount of traditional documents and are very familiar with preservation over the long term. However, the amount of digital material (publications, archival records, etc.) is increasing with a rapid pace. To manage them is already a challenge. But as cultural heritage organisations, more has to be done to keep those documents safe for hundreds of years at least.

They are nothing if not ambitious… they go on to state:

Although many people recognise the importance of having a digital preservation strategy based on emulation, it has never been taken into practice. Of course, many emulators already exist and showed the usefulness and advantages it offer. But none of them have been designed to be digital preservation proof. For this reason the National Library and Nationaal Archief of the Netherlands started a joint project on emulation.

The aim of the emulation project is to develop a new preservation strategy based on emulation.

Dioscuri is part of Planets (Preservation and Long-term Access via NETworked Services) – run by the Planets consortium and coordinated by the British Library. The Dioscuri team has created an open source emulator that can be ported to any hardware that can run a Java Virtual Machine (JVM). Individual hardware components are implemented via separate modules. These modules should make it possible to mimic many different hardware configurations without creating separate programs for every possible combination.

You can get a taste of the big thinking that is going into this work by reviewing the program overview and slide presentations from the first Emulation Expert Meeting (EEM) on digital preservation that took place on October 20th, 2006.

In the presentation given by Geoffrey Brown from Indiana University titled Virtualizing the CIC Floppy Disk Project: An Experiment in Preservation Using Emulation I found the following simple answer to the question ‘Why not just migrate?’:

Loss of information — e.g. word edits
Loss of fidelity — e.g. WordPerfect to Word isn’t very good
Loss of authenticity — users of migrated document need access to original to verify authenticity
Not always possible — closed proprietary formats
Not always feasible — costs may be too high
Emulation may necessary to enable migration

After reading through Emulation at the German National Library, presented by Tobias Steinke, I found my way to the kopal website. With their great tagline ‘Data into the future’, they state their goal is “…to develop a technological and organizational solution to ensure the long-term availability of electronic publications.” The real gem for me on that site is what they call the kopal demonstrator. This is a well thought out Flash application that explains the kopal project’s ‘procedures for archiving and accessing materials’ within the OAIS Reference Model framework. But it is more than that – if you are looking for a great way to get your (or someone else’s) head around digital archiving, software and related processes – definitely take a look. They even include a full Glossary.

I liked what I saw in Defining a preservation policy for a multimedia and software heritage collection, a pragmatic attempt from the Bibliothèque nationale de France, a presentation by Grégory Miura, but felt like I was missing some of the guts by just looking at the slides. I was pleased to discover what appears to be a related paper on the same topic presented at IFLA 2006 in Seoul titled: Pushing the boundaries of traditional heritage policy: Maintaining long-term access to multimedia content by introducing emulation and contextualization instead of accepting inevitable loss . Hurrah for NOT ‘accepting inevitable loss’.

Vincent Joguin’s presentation, Emulating emulators for long-term digital objects preservation: the need for a universal machine, discussed a virtual machine project named Olonys. If I understood the slides correctly, the idea behind Olonys is to create a “portable and efficient virtual processor”. This would provide an environment in which to run programs such as emulators, but isolate the programs running within it from the disparities between the original hardware and the actual current hardware. Another benefit to this approach is that only the virtual processor need be ported to new platforms rather than each individual program or emulator.

Hilde van Wijngaarden presented an Introduction to Planets at EEM. I also found another introductory level presentation that was given by Jeffrey van der Hoeven at wePreserve in September of 2007 titled Dioscuri: emulation for digital preservation.

The wePreserve site is a gold mine for presentations on these topics. They bill themselves as “the window on the synergistic activities of DigitalPreservationEurope (DPE), Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval (CASPAR), and Preservation and Long-term Access through NETworked Services (PLANETS).” If you have time and curiosity on the subject of digital preservation, take a glance down their home page and click through to view some of the presentations.

On the site of The International Journal of Digital Curation there is a nice ten page paper that explains the most recent results of the Dioscuri project. Emulation for Digital Preservation in Practice: The Results was published in December 2007. I like being able to see slides from presentations (as linked to above), but without the notes or audio to go with them I am often left staring at really nice diagrams wondering what the author’s main point was. The paper is thorough and provides lots of great links to other reading, background and related projects.

There is a lot to dig into here. It is enough to make me wish I had a month (maybe a year?) to spend just following up on this topic alone. I found my struggle to interpret many of the Power Point slide decks that have no notes or audio very ironic. Here I was hunting for information about the preservation of born digital records and I kept finding that the records of the research provided didn’t give me the full picture. With no context beyond the text and images on the slides themselves, I was left to my own interpretation of their intended message. While I know that these presentations are not meant to be the official records of this research, I think that the effort obviously put into collecting and posting them makes it clear that others are as anxious as I to see this information.

The best digital preservation model in the world will only preserve what we choose to save. I know the famous claim on the web is that ‘content is king’ – but I would hazard to suggest that in the cultural heritage community ‘context is king’.

What does this have to do with Dioscuri and emulators? Just that as we solve the technical problems related to preservation and access, I believe that we will circle back around to realize that digital records need the same careful attention to appraisal, selection and preservation of context as ‘traditional’ records. I would like to believe that the huge hurdles we now face on the technical and process side of things will fade over time due to the immense efforts of dedicated and brilliant individuals. The next big hurdle is the same old hurdle – making sure the records we fight to preserve have enough context that they will mean anything to those in the future. We could end up with just as severe a ‘digital black hole’ due to poorly selected or poorly documented records as we could due to records that are trapped in a format we can no longer access. We need both sides of the coin to succeed in digital preservation.

Did I mention the part about ‘Hurray for open source emulator projects with ambitious goals for digital preservation’? Right. I just wanted to be clear about that.

Image Credit: The image included at the top of this post was taken from a screen shot of Dioscuri itself, the original version of which may be seen here.

The MemoryArchive Affiliate Program: A Wiki Engine for Collecting Memoirs

November 14, 2007 2 Comments

A Beautiful WWW posted A Review of MemoryArchive.org. MemoryArchive, founded by historian Marshall Poe, is a new MediaWiki based website aimed at collecting first person accounts that they term ‘memoirs’. In sharp contrast with the communal authorship approach of most wikis, MemoryArchive locks down edits of each entry after a format review.

What sorts of memoirs are they looking for? In their FAQ they say they want “pretty much anything you remember that someone else might conceivably find interesting, now or in 500 years”.

I spent some time exploring. I read a very moving memorial titled Death by Aids The Goodbye Party, 1992, by Jay Blotcher (ed note: Jay emailed me with the correct title for this memoir). I wandered through some 9/11 memories. Eventually something dawned on me. Maybe it is the fact that I am spending most of my days lately thinking deep thoughts about metadata and classification — or maybe my archives course work is to blame — whatever the reason, I realized that I wanted more information about the storytellers. Right now it appears that each memoir includes Who, What, When and Where data – to whatever degree the contributors choose to furnish such information. Categories are also available and seem to be frequently employed.

But I want to know more about the individuals who are telling the stories. I appreciate that some posts will be made more powerful through anonymity, but for those cases that an individual is willing to share additional biographic information it would be great to have an easy place for that information to be captured.

I think the most interesting aspect of the Memory Archive to the archives community is the Memory Archive Affiliate Program. The theory behind this program is to support the collection and archiving of personal histories online. It is described as being of interest to the following types of organizations:

historical societies (urban, state, or national)
institutions interested in recording their own history (a club, society, or military unit)
educational institutions teaching history (high school or college)
public history projects (oral history gathering, or document collection)

This is a powerful idea. Any time you can accumulate a critical mass of of a single type of information on the web (in this case, memoirs) you have the chance of becoming a destination. There is also the added benefit of enabling smaller organizations to launch an online memoir collection initiatives without needing to worry about the technology, costs and people-power that would usually be required.

There does needs to be an easy way for the Memory Archive Affiliates to download these born digital memoirs for offline use and preservation purposes. This could be accomplished by an ‘export’ or ‘format for printing’ button on each memoir page, or perhaps some form of bulk download for all memoirs collected for a single affiliate’s project. I will say that the default print format isn’t bad. It seems to already do some special reformatting (such as displaying URL links in their entirety). I still also would want more metadata, though perhaps the definition of attributes to be collected could be customized per project.

I am curious to see the overall quality of the memoirs a year from now. I suspect that memoirs collected is association with a topically focused program may be more compelling than the average ‘man-on-the-net’ first person narratives. That isn’t to say that there is no value in the memories of someone who feels compelled to share their story – but a collection created around a theme would have the additional power of that common thread. The affiliate program memoirs would also be more likely to come with some contextual background explaining the source and origin of the solicited accounts. I am a fan the existing thematic memory sites, such as The April 16 Archive and the Hurricane Digital Memory Bank. I love that the Omeka software used to create these two example sites is open source and free. Unfortunately, I don’t think the average small historical society or public history project is likely to have the resources to build and support a site like this even with free software. I think that a program like the Memory Archive Affiliate Program (or something like it) could bridge the gap for these smaller organizations and make the creation of online memoir collection projects a reality.

SAA2007: Preserving Born Digital Records of the Design Community (Session 106)

September 8, 2007 9 Comments

The official title for SAA2007 Session 106 is Constructing Sustainability: Real-World Implementations of Preservation Standards for Born-Digital Design Documentation, but I think it might have been better served to include the word Architecture somewhere in it’s title. Sponsored by the Architectural Records Roundtable, this session considered issues related to preserving born digital records of “the design community”. The design community in question includes both architects and landscape designers.

Each panelist gave a 5 minute brief about the way in which they are working toward preserving these design community records – and the rest of the session was opened up to Q&A. David Read, the session chair, mentioned how they used a wiki to collect questions and ideas for the session, gave an introduction to each of the panelists and helped guide the Question and Answer portion of the session.

Who was on the panel?

David Read (Session Chair, Information Resources Manager, DiMella Shaffer )
Phil Bernstein (Autodesk, Architect and Technologist)
Carissa Kowalski Dougherty (Art Institute of Chicago, Department of Architecture and Design )
Annemarie van Roessel (Columbia University, Avery Architectural and Fine Arts Library )
Dennis Newman (general manager at PFS Corporation , member of PDF standards working group of ISO)

What is being done?

Phil Bernstein kicked off the 5 minute summaries with a quick history of design technology. He explained how currently there is a shift in progress. Hundreds of years of paper drawings were followed by ten to fifteen years of electronic drawings. The latest development is use of Building Information Modelling (BIM). BIM relies on a database that generates ‘reports’ that are in fact ‘drawings’. These are sometimes referred to as Building Development Information Models. Digital printers can produce physical models directly from the stored BIM data with no need to step through generation of an actual drawing outside the computer.

Phil showed Yale School of Architecture design examples from the BIM world. These were fantastical organically shaped creations that looked more like strange undiscovered plants from under the sea than traditional buildings!

The good news is that the data in the BIM databases are all just text. The bad news is that the generated ‘design artifacts’ are based on the text data and can lead to digitally printed artifacts. There has been an explosion in the various means of representation. The architecture world is catching up to the to other industries (such as the auto industry) that have been doing this for 25+ years.

Current architects are application agnostic – they don’t care what they use to create their outputs. All the paths and platforms will only grow – what is driving the design process will be increasing in complexity. The building industry is making a fundamental shift from electronic drawing to the Building Information Modeling approach – but there is an unlimited environment for representation. He hoped to discuss the intersection between the archival/record keeping issues and the problems facing the architecture world.

Carissa Kowalski Dougherty’s overview covered the Digital Archive for Architecture (DAArch) project out of the Art Institute of Chicago . The project was based on the 2004 study Collecting, Archiving, and Exhibiting Digital Design Data. They considered how Architecture and Design firms are using software tools to produce and design – but examined these questions from a museum and curatorial perspective.

The recommendation is a two-tiered collection approach.

First tier: Native files – like autocad files – these are going to be preserved at the bit level – but there is no commitment to ensuring access to these files
Second tier : Output formats – only pdf and tif files
PDF: line drawings, vector-based graphic files, text documents power points
TIF: renderings, digital photographs

The second tier outputs are what they are committing to “functionally preserve”.

Carissa presented an example of what they accessioned from the Garofalo Architects‘ Manilow Residence (2001-2003) project. A lot of what they got were files that no-one (including the small architectural firm itself) could still open.. the software is gone. Another major challenge was poor naming conventions for the files themselves. The final project archive included over 200 native vector 2D files (.dxf, .dgn, .dwg), 145 pdfs.. and more.

From the UrbanLab they sought to preserve their Visitor Information Center Competition Entry from 2001. This was a project that was never built and therefore has little physical output. They mostly used autoCAD (2D), Maya (3D), FormZ (3D) and Adobe Illustrator (layout).

The DAArch Software highlights:

browser based
DSpace as back end
Dublin Core augmented with CDWA and custom metadata to support architecture data and digital materials
authority records
group and item level cataloging
will be available open source with BSD license via SourceForge (this was a requirement of the funder – that it be open source)

Final lessons and challenges from the DAArch project:

file naming and organization – the biggest challenges at the smaller firms – need outreach to these firms
metadata for digital objects – there is not a lot out there for 3D digital images
software and migration tools – can we/should we preserve the software dependent first tier files? or just the PDF/TIF outputs?
three-dimensional objects, BIM, animations, etc

Annemarie van Roessel discussed Columbia’s major Manhattanville project. Their goal is to make digital records last as long as steel and glass. The Avery Architectural and Fine Arts Library is feeling the pressure to be a leader, so how does Avery document this project? Manhattanville is a 30 year planning, design and build project targeted to be completed in 2030. It will cover 17 acres northwest of the main Columbia campus.

There are many building blocks to the digital design archives: autoCAD, project management records, collaborative environments (sharepoint – Microsoft), images, presentations, websites and movies (ie, more than just “scary CAD drawings”). They are planning staged preservation points. The Avery is committed to developing capacity for digital archiving by 2009. For their metadata they use at minimum the mandatory DACS elements mapped to Dublin Core elements.

Dennis Newman was the final panelist. He has clients who need to preserve/archive finished drawings – such as the documents being sent along to regulatory agencies for final approval. PDF/A-1 was based on ‘electronic paper’ – you loose lots of data when you ‘cut back’ to PDF-A. PDF-E is in it’s first draft/generation being submitted for version 1. PDF-A didn’t address 3D, complex metadata or moving images. PDF-E is based on Acrobat version 7. Adobe has thrown out PDF to the ISO community. Dennis believes that the final ‘as-built’ drawing is what should be the archived version.

He pointed out that Stage I responders need more information than the regulator commissions need. Since 9-11 the state requirements have changed about what need to be in the ‘record’.

As an IT professional he was asked “what can we do” and his answer is “how much do you want to spend?”. IT can do anything – but it takes time and money.

Questions and Answers

Keep in mind throughout this section that I was summarizing the questions and their answers as best I could. Please do not take any statements attributed to the session speakers as full and complete quotes. In cases where I missed too much of the question or answer I generally skipped including it in the list below. If you are anxious to know exactly what someone said, you would need to buy and listen to the conference recordings for this session.

QUESTION : Could a neutral exchange format such as International Alliance for Interoperability‘s (IAI) Industry Foundation Classes (IFC) be the foundation or a piece of the next step in preservation of born digital design documentation? Text + data model that could be read by different software (import/export of data). You can do this now with AutoCAD – you can dump into IFC.

Phil: Is a neutral exchange format the answer to the archiving problems? Software is changing so fast that there is no way that a standard could keep up with it. Also – even if all the data in the world could be put in XML – you still need something to ‘read and do something’ with the data. He put the business process diagram on screen from his talk and pointed out that all the different tools and their outputs exist within the CONTEXT of the business process itself.

Carissa (?): IFC is a recommendation of the Art Institute of Chicago

QUESTION: William Reilly from the FACADE project started to ask about the challenges inherent in the fact that the IFC standard only gives you the geometry. There was some back and forth about this idea with voices noting that IFC can capture more than that.. but not everything.

Kristine Fallon: The idea of doing a neutral format for complex information is a complicated thing. Going back probably 20 years, the people working on data exchange standards for engineering … the different software won’t perfectly talk to each other – but what they can do is exchange ‘model views’. The IFC data model is capable of a fairly comprehensive set of model views.

QUESTION : Who is going to keep it up in 20 years? Are the software producers going to keep it up?

Phil: Autodesk spent 5 million dollars in building the IFCs. If the archivists align their needs with the business needs then the business will pay for it and the archivists will get what they need.

Annemarie: The archivists don’t have the money and resources.. even at Columbia they don’t have the money to buy generation after generation of the software to read all the different file formats. Maybe the MIT approach of emulation is a better approach.

David : Will there ever be a day that I will have an emulator on his desktop? That makes me more curious about exporting pure text.. I can get my head around preservation of that.

Annemarie: The Mannhattenville project is the first step for Columbia in collecting digital data. Archivists need to reach out to organizations now to explain that they want to preserve what they are creating. I am being honest about the chaos coming down the track when we start getting the data from the 90s.

QUESTION (from the audience): The function of IFC is not for archiving.. it is for different software products to communicate with one another. How do you figure out what artifacts of the design process do you keep? How do you extract the ‘important’ parts to keep from what is ‘less’ important?

Phil : What about when there are physical digital models, analytical models and more.. how do you understand all of it?

Carissa: The architectural firms need to be able to get to all of this too. It isn’t just archivists who should be caring about access to all these models. There are legal ramifications and the possibility of renovations later… this needs to come out of the architecture profession.

QUESTION (I asked the following question): Are the problems in preserving the final products so challenging – are there any thoughts to trying to preserve the process. With paper there is an easier preservation of the evolution of design.

Annemarie: In the Manhattanville project one of the big challenges is the architect who does lots of self editing. In many cases they don’t want the word to see their interim choices during the design process.

Phil: Digital tools can encourage you to explore useless ideas. Keep in mind that the journal file for the Building Information Model keeps track of every change. It will tell you that on Tuesday at 4:10 pm someone moved this door 5 inches to the left.

Carissa : At the art institute, architect and archivists need to work together to figure out what is worth capturing.

David: Two different schools of thought. Archiving the final product or archiving the process. File formats are preserving the final product.

QUESTION : There is danger in keeping everything – the goal of archiving is to keep the best final version. The big hulking databases of the world open the door to keeping an overwhelming set of unimportant data.

Annemarie : the needs of all their different consumers are so broad. Perhaps the taking a snapshot should happen more often – thinner slices

Carissa : 2D snapshots are not going to capture the fullness of a 3D object. But it isn’t capturing as much as it might.

Phil : There could be an interactive digital simulation that generates 3D models.. there could be no ‘final’ product. Can we have an impact on how info is kept 4, 10, 30 years from now – for the future? In a world where you can borrow (or pay for) processing time… someone will keep all the versions of autocad.. you will pay for the 15 seconds of rendering time in AutoCAD 14 from some 3rd party.

Kristine Fallon: There is a real business purpose to sorting this out… the IAI work is very real world.. defining model views can help support business.. but they can also support the goals of archivists.

Kristine Fallon‘s Question : Was PDF-E designed to be an archival format?

Dennis: No.. it was designed to be a data interchange format. People who don’t want to give lots of proprietary data to another vendor – they still need to give them a bunch of data to work with them.. that is where PDF-E came out of.

My Thoughts

As seems to be the case with all born digital records, there are no easy answers. While events like 9-11 have had impacts on the types of final products that regulatory agencies and first responders need to evaluate and have easy access to, the speed of innovation and evolution in building design is stunning. It should come as no surprise that architects are more concerned with finding the best tools for their trade than they are with how to preserve the artifacts of their ultimate creations. They will change the tools they use when they find a better tool to manifest their vision.

The most promissing option seems to be having archivists get involved in discussions with the software developers, the architects, the builders and government early in the design process. The traditional model of archivists receiving the final products of business processes years after they were completed does not appear to be an answer on which we can depend. I suspect that proactive efforts to plan for preservation from the start will pay off – both for those trying to use the records 10 years from now and for those who want to preserve some subset of the records of the design community for future generations.

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Preserving Virtual Worlds – TinyMUD to SecondLife

August 17, 2007 3 Comments

A recent press release from the Library of Congress, Digital Preservation Program Makes Awards to Preserve American Creative Works, describes the newly funded project aimed at the preservation of ‘virtual worlds’:

The Preserving Virtual Worlds project will explore methods for preserving digital games and interactive fiction. Major activities will include developing basic standards for metadata and content representation and conducting a series of archiving case studies for early video games, electronic literature and Second Life, an interactive multiplayer game. Second Life content participants include Life to the Second Power, Democracy Island and the International Spaceflight Museum. Partners: University of Maryland, Stanford University, Rochester Institute of Technology and Linden Lab.

This has gotten a fair amount of coverage from the gaming and humanities sides of the world, but I learned about it via Professor Matthew Kirschenbaum‘s blog post Just Funded: Preserving Virtual Worlds.

The How They Got Game 2 post Library of Congress announces grants for preservation of digital games gives a more in depth summary of the Preserving Virtual Worlds project goals:

The main goal of the project is to help develop generalizable mechanisms and methods for preserving digital games and interactive fiction, and to begin to test these mechanism through the archiving of selected test cases. Key deliverables include the development of metadata schema and wrapper recommendations, and the long-term curation of archived cases.

I take this all a bit more personally than most might. I was a frequent denizen of an online virtual world known as TinyMUD (now usually referred to as TinyMUD Classic). TinyMUD was a text based, online, multi-player game that existed for seven months beginning in August of 1989. In practice it was sort of a cross between a chat room and a text based adventure. The players could build new parts of the MUD as they went – in many ways it was an early example of crowdsourcing. There was a passionate core of players who were constantly building new areas for others to explore and experience – not unlike what is currently the case in SecondLife. These types of text based games still exist – see MudMagic for listings.

Apparently August 20, 2007 will be TinyMUD’s 18th Annual Brigadoon Day. It will be celebrated by putting TinyMUD classic online for access. The page includes careful notes about finding and using a MUD Client to access TinyMUD. The existence of an ongoing MUD community of users has kept software like this alive and available almost 20 years later.

With projects like Preserving Virtual Worlds getting grants and gaining momentum it seems more plausible with each passing day that 18 years from now, parts of 2007’s SecondLife will still be available for people to experience. I am thankful to know that a copy of the TinyMUD world I helped build is still out there. I am even more thankful to know that the technology still exists to permit users to access it even if it is only once a year.

Update: 20th Anniversary of TinyMud Brigadoon day is set for Thursday, August 20, 2009

Thoughts on Digital Preservation, Validation and Community

July 6, 2007 2 Comments

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.

Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By RateLimited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

Digital Archiving Articles – netConnect Spring 2007

April 19, 2007 1 Comment

Thanks to Jessamyn West’s blog post, I found my way to a series of articles in the Spring 2007 edition of netConnect:

Funding the Past – and Future by Fancine Fialkoff
Saving Digital History by Jessamyn West
LC Needs Digital Support by Norman Oder

“Saving Digital History” is the longest of the three and is a nice survey of many of the issues found at the interseciton of archiving, born digital records and the wild world of the web. I especially love the extensive Link List at the end of the articles — there are lots of interesting related resources. This is the sort of list of links I wish were available with ALL articles online!

I can see the evolution of some of the ideas she and her co-speakers touched on in their session at SAA 2006: Everyone’s Doing It: What Blogs Mean for Archivists in the 21st Century. I hope we continue to see more of these sorts of panels and articles. There is a lot to think about related to these issues – and there are no easy answers to the many hard questions.

Update: Here is a link to Jessamyn’s presentation from the SAA session mentioned above: Capturing Collaborative Information News, Blogs, Librarians, and You.

Copyright Law: Archives, Digital Materials and Section 108

April 12, 2007 2 Comments

I just found my way today to Copysense (obviously I don’t have enough feeds to read as it is!). Their current clippings post highlighted part of the following quote as their Quote of the Week.

“[L]egislative changes to the copyright law are needed. First, we need to amend the law to give the Library of Congress additional flexibility to acquire the digital version of a work that best meets the Library’s future needs, even if that edition has not been made available to the public. Second, section 108 of the law, which provides limited exceptions for libraries and archives, does not adequately address many of the issues unique to digital media—not from the perspective of copyright owners; not from the perspective of libraries and archives.” Marybeth Peters , Register of Copyrights, March 20, 2007

Marybeth Peters was speaking to the Subcommittee on Legislative Branch of the Committee on Appropriations about the Future of Digital Libraries.

Copysense makes some great points about the quote:

Two things strike us as interesting about Ms. Peters’ quote. First, she makes the quote while The Section 108 Study Group continues to work through some very thorny issues related to the statutes application in the digital age […] Second, while Peters’ quote articulates what most information professionals involved in copyright think is obvious, her comments suggest that only recently is she acknowledging the effect of copyright law on this nation’s de facto national library. […] [S]omehow it seems that Ms. Peters is just now beginning to realize that as the Library of Congress gets involved in the digitization and digital work so many other libraries already are involved in, that august institution also may be hamstrung by copyright.

I did my best to read through Section 108 of the Copyright Law – subtitled “Limitations on exclusive rights: Reproduction by libraries and archives”. I found it hard to get my head around … definitely stiff going. There are 9 different subsections (‘a’ through ‘i’) each with there own numbered exceptions or requirements. Anxious to get a grasp on what this all really means – I found LLRX.com and their Library Digitization Projects and Copyright page. This definitely was an easier read and helped me get further in my understanding of the current rules.

Next I explored the website for the Section 108 Study Group that is hard at work figuring out what a good new version of Section 108 would look like. I particularly like the overview on the About page. They have a 32 page document titled Overview of the Libraries and Archives Exception in the Copyright Act: Background, History, and Meaning for those of you who want the full 9 years on what has gotten us to where we are today with Section 108.

For a taste of current opinions – go to the Public Comments page which provides links to all the written responses submitted to the Notice of public roundtable with request for comments. There are clear representatives from many sides of the issue. I spotted responses from SAA, ALA and ARL as well as from MPAA, AAP and RIAA. All told there are 35 responses (and no, I didn’t read them all). I was more interested in all the different groups and individuals that took the time to write and send comments (and a lot of time at that – considering the complicated nature of the original request for comments and the length of the comments themselves). I was also intrigued to see the wide array of job titles of the authors. These are leaders and policy makers (and their lawyers) making sure their organizations’ opinions are included in this discussion.

Next stop – the Public Roundtables page with it’s links to transcripts from the roundtables – including the most recent one held January 31, 2007. Thanks to the magic of Victoria’s Transcription Services, the full transcripts of the roundtables are online. No, I haven’t read all of these either. I did skim through a bit of it to get a taste of the discussions – and there is some great stuff here. Lots of people who really care about the issues carefully and respectfully exploring the nitty-gritty details to try and reach good compromises. This is definitely on my ‘bookmark to read later’ list.

Karen Coyle has a nice post over on Coyle’s InFormation that includes all sorts of excerpts from the transcripts. It gives you a good flavor of what some of these conversations are like – so many people in the same room with such different frames of reference.

This is not easy stuff. There is no simple answer. It will be interesting to see what shape the next version of Section 108 takes with so many people with very different priorities pulling in so many directions.

The good news is that there are people with the patience and dedication to carefully gather feedback, hold roundtables and create recommendations. Hurrah for the hard working members of the Section 108 Study Group – all 19 of them!

Supporting Appraisal of Digital Records

March 28, 2007

In his recent post to the A+A Listserv, Richard Pearce-Moses explores some really interesting ideas related to the appraisal of a listserv. The notions that particularly caught my imagination were in these passages:

We could take advantage of the fact that the list is in electronic format and conceivably use some AI filters to do some weeding. But at what cost? Is this truly feasible? And what are the implications on the integrity of the collection if only a portion are saved?

I was particularly interested in the number of people who said they searched the lists’ archives. Although demonstrated use can be used to justify preservation, what is sufficient use and how do we measure it? Are there use patterns that suggest these messages are inactive, with use falling off over time in a pattern that suggests they not be kept permanently? (To my knowledge the server logs are not accessible.)

What sort of infrastructure could archivists work toward putting in place to support automated weeding of listserv postings? If the postings were not sent via email but rather posted via some other interface, I can imagine a choice being presented at the time the post was written ‘Keep’ vs ‘Discard After 6 months’. There is something like this already in place for some government email systems – the sender indicates if the message is ‘permanent’ when the message is sent. Of course that presents a whole series of new problems. Someone in one of my classes mentioned that U.S. White House staffers had taken to marking EVERYTHING as permanent because emails that were marked not permanent were being scrutinized NOW. I wish I could find a source for this story online – but all I am finding today is the latest hubbub about White House staffers using non-government email accounts to communicate when they didn’t want to worry about it being preserved (or at least that seems to be the current allegation ). Luckily for this discussion we aren’t worried about people hiding posts.

Of course some of this can be implemented via those who post to the list. If everyone (anyone?) used standard post title prefixes, the appraising archivist of the future could easily filter out entire subsets of posts. SAA has this posted on the Terms of Participation Page for the Archives and Archivists List:

In order to maintain a highly informative and focused professional forum, SAA strongly encourages list participants to use the following labels at the beginning of all subject lines. This will allow others to filter list messages via mail rules and automatically select those types of information according to their individual needs and preferences.

“Calls:” (Calls for papers, survey participation, etc.)

“Disc:” (Discussion on various topics)

“Event:” (Conference, seminar, workshop announcements, etc.)

“FF:” (“Friday funnies,” see below)

“FYI:” (General announcements and information)

“Job:” (Job announcements)

“Media:” (Links to archives and archivists in the news)

“Qs:” (Questions)

“Pubs:” (Announcements re: books, chapters, papers, dissertations, and reviews)

There is a link to this page at the bottom of EVERY post to the listserv, but I have rarely seen anyone use these prefixes in post titles. ‘Job’ is the one that gets the most use, and usually only for a short time after someone politely ask everyone creating job posts to make sure they have good titles.

The idea of examining ‘usage’ patterns is also an interesting one. If we could easily capture and examine the view and search logs of posts we could build an understanding of what types of posts really are re-examined over time. But then what do we do with that information? Does past interest in a topic translate into permanent informational value? Just because someone didn’t look at it again yet – does that mean we assume that no-one will every be interested in its content?

My instinct (when wearing my techie hat) is to vote for the ‘keep it all – disk space is cheap’ approach. That said, I know that the expense of the space on that first hard drive you save your records on is just the tip of the iceberg in terms of digital preservation expenses.

Thinking about what you want to keep before you turn on any software system is always going to make things easier. I know that as the laws in the US continue to evolve to demand the retention of specific types of data the software will also continue evolve to make it easy to keep ONLY what needs to be kept. Private sector companies are usually quite intent on sticking to the letter of the law in that regard – they never want to keep more than they must (or so their lawyers like to tell them). It is also in the best interest of the software companies to ensure that all the required records are being kept.

Another driving force to generate systems that know how to filter and keep the ‘right’ records (whatever that means) could be individual users. In a universe of digital cameras where you can take 1000 photos as cheaply as 100 – I wonder if there is a place for software that intelligently archives your most frequently accessed (and tagged and shared) photos. The flip side of that could be auto-weeding (perhaps with a quick review option) every year. This would be the same approach some take to cleaning out their clothes closets – if I haven’t touched it in 2 years, then I should get rid of it.

While doing my research last term into the appraisal of Digital GIS records, I was amazed by how much of what was currently being done could only be accomplished by brute force. Frequently the work is being done through the sheer will of a small group of very dedicated people using tools not particularly suited to the task. I need to do more research into the realm of electronic record management – I want to understand what standard tools are being supplied (or not supplied). Are there tools for those who manage large repositories of electronic records where there is an acknowledged goal of supporting records scheduling and permanent preservation?

In our increasingly digital world, I think there will always be cases of born digital records that must be considered for appraisal without all the answers to all our questions. I am just fascinated at the notion of building the tools we need into the software systems from the start. At the end of a record’s active life cycle we would then be able to make and implement appraisal choices more easily. Imagine that – planning ahead for appraisal!

Considering Historians, Archivists and Born Digital Records

March 23, 2007 3 Comments

I think I renamed this post at least 12 times. My original intention was was to consider the impact of born digital records on the skills needed for the historian/researchers of the future. In addition I found myself exploring the dividing lines among a number of possible roles in ensuring access to the information written in the 1s and 0s of our born digital records.

After my last post about the impact of anonymization of Google Logs, a friend directed me to the work of Dr. Latanya Sweeney. Reading through the information about her research I found Trail Re-identification: Learning Who You are From Where You Have Been. Given enough data to work with, algorithms can be written that often can re-identify the individuals who performed the original searches. Carnegie Mellon University‘s Data Privacy Lab includes the Trails Learning Project with the goal of answering the question “How can people be identified to the trail of seemingly innocent and anonymous data they leave behind at different locations?”. So it seems that there may be a lot of born digital records that start out anonymous but that may permit ‘re-identification’ – given the application of the right tools or techniques. That is fine – historians have often needed to become detectives. They have spent years developing techniques for the analysis of paper documents to support ‘re-identification’. Who wrote this letter? Is this document real or a forgery? Who is the ‘Mildred’ referenced in this record?

The field of diplomatics studies the authenticity and provenance of documents by looking at everything from the paper they were written on to the style of writing to the ink used. I like the idea of using the term ‘digital diplomatics’ for the ever increasing process of verifying and validating born digital records. Google found me the Digital Diplomatics conference that took place earlier this year in Munich. Unfortunately it was more geared toward investigating how the use of computers can enhance traditional diplomatic approaches rather than how to authenticate the provenance of born digital records.

In the March 2007 issue of Scientific American I found the article A Digital Life. It talks primarily about the Microsoft Research project MyLifeBits. A team at Microsoft Research has spent the last six years creating what they call a ‘digital personal archive’ of team member Gordon Bell. This archive hopes to “record all of Bell’s communications with other people and machines, as well as the images he sees, the sounds he hears and the Web sites he visits–storing everything in a personal digital archive that is both searchable and secure.”

They are not blind to the long term challenges of preserving the data itself in some accessible format:

Digital archivists will have to constantly convert their files to the latest formats, and in some cases they may need to run emulators of older machines to retrieve the data. A small industry will probably emerge just to keep people from losing information because of format evolution.

The article concludes:

Digital memories will yield benefits in a wide spectrum of areas, providing treasure troves of information about how people think and feel. By constantly monitoring the health of their patients, future doctors may develop better treatments for heart disease, cancer and other illnesses. Scientists will be able to get a glimpse into the thought processes of their predecessors, and future historians will be able to examine the past in unprecedented detail. The opportunities are restricted only by our ability to imagine them.

Historians will have at least these two types of digital artifacts to explore – those gathered purposefully (such as the digital personal archives described above) and those generated as a byproduct of other activity (such as the Google search logs). Might these be the future parallels to the ‘manuscript’ and ‘corporate’ archives of today?

So we have both the ideas of the Digital Archivist and the Digital Historian. What about a Digital Archaeologist? I am not the first to ponder the possible future job of Digital Archaeologist. A bit of googling of the term led me to Dark Star Gazette and Dear Digital Archaeologist. Back in February of 2007 they pondered:

Will there be digital archaeologists, people who sift through our society’s discarded files and broken web links, carefully brushing away revisions and piecing together antiquated file formats? Will a team of grad students working on their PhDs a thousand, or two thousand, years from now be digging through old blog entries, still archived online in some remote descendant of the Wayback Machine or a copy of Google’s backup tapes?

I can only imagine a world in which this is in fact the case. Given that premise, at what point does the historian get too far from the primary source? If the historian does not understand exactly what a computer program does to extract the information they want from logs or ‘digital memory repositories’ – are they no longer working with the primary source?

Imagine any field in which historians do research. Music? Accounting? Science? In order examine and interpret primary source records a historian becomes something of an expert in that field. Consider the historian documenting the life of a famous scientist based partly on their lab notebooks. That historian would be best served by being taught how to interpret the notebooks themselves. The historian must be fluent in the language of the record in order to gain the most direct access to the information.

Ah – but if there really are Digital Archaeologists in the far future, perhaps they would be the connection between the primary source born digital records and the historians who wish to study them. Or perhaps the Digital Archivist, in a new take on ‘arranging records’, would transform digital chaos into meaningful records for use by researchers? The field of expertise on the historians part would need only be in the content of the records – not exactly how they were rescued from the digital abyss.

Would a Digital Historian be someone who only considers the history of the digital landscape or a historian especially well versed in the interpretation of digital records? In Daniel Cohen and Roy Rosenzweig‘s book Digital History: A Guide to Gathering, Preserving, And Presenting the Past on the Web they seem to use the term in the present tense to refer to historians who uses computers and technology to support and expand the reach of their research. Yet, in his essay Scarcity or Abundance? Preserving the Past in a Digital Era, Roy Rosenzweig proposes:

Future graduate programs will probably have to teach such social-scientific and quantitative methods as well as such other skills as “digital archaeology”(the ability to “read” arcane computer formats), “digital diplomatics” (the modern version of the old science of authenticating documents), and data mining (the ability to find the historical needle in the digital hay). In the coming years, “contemporary historians” may need more specialized research and “language” skills than medievalists do.

What is my imagined skill set for the historian of our digital world? A willingness to dig into the rich and chaotic world of born digital records. The ability to use tools and find partners to assist in the interpretation of those records. Equal comfort working at tables covered in dusty boxes and in the virtual domain of glowing computer terminals. And of course – the same curiosity and sense of adventure that has always drawn people to the path of being a historian.

We cannot predict the future – we can only do our best to adapt to what we see before us. I suspect the prefixing of every job title with the word ‘digital’ will disappear over time – much as the prefixing of everything with the letter ‘e’ to let you know that something was electronic or online has ebbed out of popular culture. As the historians and archivists of today evolve into the historians and archivists of tomorrow they will have to deal with born digital records – no matter what job title we give them.

Category: born digital records