Menu Close

Month: August 2012

CURATEcamp Processing 2012

CURATEcamp Processing 2012 was held the day after the National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Digital Stewardship Alliance (NDSA) sponsored Digital Preservation annual meeting.

The unconference was framed by this idea:

Processing means different things to an archivist and a software developer. To the former, processing is about taking custody of collections, preserving context, and providing arrangement, description, and accessibility. To the latter, processing is about computer processing and has to do with how one automates a range of tasks through computation.

The first hour or so was dedicated to mingling and suggesting sessions.  Anyone with an idea for a session wrote down a title and short description on a paper and taped it to the wall. These were then reviewed, rearranged on the schedule and combined where appropriate until we had our full final schedule. More than half the sessions on the schedule have links through to notes from the session. There were four session slots, plus a noon lunch slot of lightening talks.

Session I: At Risk Records in 3rd Party Systems This was the session I had proposed combined with a proposal from Brandon Hirsch. My focus was on identification and capture of the records, while Brandon started with capture and continued on to questions of data extraction vs emulation of the original platforms. Two sets of notes were created – one by me on the Wiki and the other by Sarah Bender in Google Docs. Our group had a great discussion including these assorted points:

  • Can you mandate use of systems we (archivists) know how to get content out of? Consensus was that you would need some way to enforce usage of the mandated systems. This is rare, if not impossible.
  •  The NY Philharmonic had to figure out how to capture the new digital program created for the most recent season. Either that, or break their streak for preserving every season’s programs since 1842.
  • There are consequences to not having and following a ‘file plan’. Part of people’s jobs have to be to follow the rules.
  • What are the significant properties? What needs to be preserved – just the content you can extract? Or do you need the full experience? Sometimes the answer is yes – especially if the new format is a continuation of an existing series of records.
  • “Collecting Evidence” vs “Archiving” – maybe “collecting evidence” is more convincing to the general public
  • When should archivists be in the process? At the start – before content is created, before systems are created?
  • Keep the original data AND keep updated data. Document everything, data sources, processes applied.

Session II: Automating Review for Restrictions? This was the session that I would have suggested if it hadn’t already been on the wall. The notes from the session are online in a Google Doc. It was so nice to realize that that challenge of review of records for restricted information is being felt in many large archives. It was described as the biggest roadblock to the fast delivery of records to researchers. The types of restrictions were categorized as ‘easy’ or ‘hard’. The ‘Easy’ category was for well defined content that follow rules that we could imagine teaching a computer to identity — things like US social security numbers, passport numbers or credit card numbers. The ‘Hard’ category was for restrictions that had more human judgement involved. The group could imagine modules coded to spot the easy restrictions. The modules could be combined to review for whatever set was required – and carry with them some sort of community blessing that was legally defensible. The modules should be open source. The hard category likely needs us as a community to reach out to the eDiscovery specialists from the legal realm, the intelligence community and perhaps those developing autoclassification tools. This whole topic seems like a great seed for a Community of Practice. Anyone interested? If so – drop a comment below please!

Lunchtime Lightning Talks: At five minutes each, these talks gave the attendees a chance to highlight a project or question they would like to discuss with others. While all the talks were interesting, there was one that really stuck with me: Harvard University’s Zone 1 project which is a ‘rescue repository’. I would love to see this model spread! Learn more in the video below.

Session III: Virtualization as a means for Preservation In this session we discussed the question posed in the session proposal “How can we leverage virtualization for large-scale, robust preservation?”. I am not sure if any notes were generated for this session. Notes are available on the conference wiki. Our discussion touched on the potential to save snapshots of virtualized systems over time, the challenges of all the variables that go into making a specific environment, and the ongoing question of how important is it to view records in their original environment (vs examining the extracted ‘content’).

Session IV: Accessible Visualization This session quickly turned into a cheerful show and tell of visualization projects, tools and platforms – most made it into a list on the Wiki.

Final Thoughts
The group assembled for this unconference definitely included a great cross-section of archivists and those focused on the tech of electronic records and archives. I am not sure how many there were exclusively software developers or IT folks. We did go around the room for introductions and hand raising for how people self-identified (archivists? developers? both? other?). I was a bit distracted during the hand raising (I was typing the schedule into the wiki) – but it is my impression that there were many more archivists and archivist/developers than there were ‘just developers’. That said, the conversations were productive and definitely solidly in the technical realm.

One cross-cutting theme I spotted was the value of archivists collaborating with those building systems or selecting tech solutions. While archivists may not have the option to enforce (through carrots or sticks) adherence to software or platform standards, any amount of involvement further up the line than the point of turning a system off will decrease the risks of losing records.

So why the picture of the abandoned factory at the top of this post? I think a lot of the challenges of preservation of born digital records tie back to the fact that archivists often end up walking around in the abandoned factory equivalent of the system that created the records. The workers are gone and all we have left is a shell and some samples of the product. Maybe having just what the factory produced is enough. Would it be a better record if you understood how it moved through the factory to become what it is in the end? Also, for many born digital records you can’t interact with them or view them unless you have the original environment (or a virtual one) in which to experience them. Lots to think about here.

If this sounds like a discussion you would like to participate in, there are more CURATEcamps on the way. In fact – one is being held before SAA’s annual meeting tomorrow!

Image Credit: abandoned factory image from Flickr user sonyasonya.