Menu Close

Category: electronic records

Chapter 3: The Rise of Computer-Assisted Reporting by Brant Houston

Embed from Getty Images
The third chapter in Partners for Preservation is ‘The Rise of Computer-Assisted Reporting: Challenges and Successes’ by Brant Houston. A chapter on this topic has been at the top of my list of chapter ideas from the very start of this project. Back in February of 2007, Professor Ira Chinoy from the University of Maryland, College Park’s Journalism Department spoke to my graduate school Archival Access class. His presentation and the related class discussion led to my blog post Understanding Born-Digital Records: Journalists And Archivists With Parallel Challenges. Elements of this blog post even inspired a portion of the book’s introduction.

The photo above is from the 1967 Detroit race riots. 50 years ago, the first article recognized to have used computer-assisted reporting was awarded the 1968 Pulitzer Prize for Local General or Spot News Reporting “For its coverage of the Detroit riots of 1967, recognizing both the brilliance of its detailed spot news staff work and its swift and accurate investigation into the underlying causes of the tragedy.” In his chapter, Brant starts here and takes us through the evolution of computer-assisted reporting spanning from 1968 to the current day, and looking forward to the future.

As the third chapter in Part 1: Memory, Privacy, and Transparency, it continues to weave these three topics together. Balancing privacy and the goal of creating documentation to preserve memories of all that is going on around us is not easy. Transparency and a strong commitment to ethical choices underpin the work of both journalists and archivists.

This is one of my favorite passages:

“As computer-assisted repoting has become more widespread and routine, it has given rise to discussion and debate over the issues regarding the ethical responsibilitys of journalists. There have been criticisms over the publishing of data that was seen as intrusive and violating the privacy of individuals.”

I learned so much in this chapter about the long road journalists had to travel as they sought to use computers to support their reporting. It never occurred to me, as someone who has always had the access to the computing power I needed through school or work, that getting the tools journalists needed to do their computational analysis often required negotiation for time on newspaper mainframes or seeking partners outside of the newsroom. It took tenacity and the advent of personal computers to make computer-assisted reporting feasible for the broader community of journalists around the world.

Journalists have sought the help of archivists on projects for many years – seeking archival records as part of the research for their reporting. Now journalists are also taking steps to preserve their field’s born-digital content. Given the high percentage of news articles that exist exclusively online – projects like the Journalism Digital News Archive are crucial to the survival of these articles. I look forward to all the ways that our fields can learn from each other and work together to tackle the challenges of digital preservation.

Bio

Brant Houston

Brant Houston is the Knight Chair in Investigative Reporting at the University of Illinois at Urbana-Champaign where he works on projects and research involving the use of data analysis in journalism. He is co-founder of the Global Investigative Journalism Network and the Institute for Nonprofit News. He is author of Computer-Assisted Reporting: A Practical Guide, co-author of The Investigative Reporter’s Handbook. He is a contributor to books on freedom of information acts and open government. Before joining the University of Illinois, he was executive director of Investigative Reporters and Editors at the University of Missouri after being an award-winning investigative journalist for 17 years.  

 

UNESCO/UBC Vancouver Declaration

In honor of the 2012 Day of Digtal Archives, I am posting a link to the UNESCO/UBC Vancouver Declaration. This is the product of the recent Memory of the World in the Digital Age conference and they are looking for feedback on this declaration by October 19th, 2012 (see link on the conference page for sending in feedback).

To give you a better sense of the aim of this conference, here are the ‘conference goals’ from the programme:

The safeguard of digital documents is a fundamental issue that touches everyone, yet most people are unaware of the risk of loss or the magnitude of resources needed for long-term protection. This Conference will provide a platform to showcase major initiatives in the area while scaling up awareness of issues in order to find solutions at a global level. Ensuring digital continuity of content requires a range of legal, technological, social, financial, political and other obstacles to be overcome.

The declaration itself is only four pages long and includes recommendations to UNESCO, member states and industry. If you are concerned with digital preservation and/or digitization, please take a few minutes to read through it and send in your feedback by October 19th.

CURATEcamp Processing 2012

CURATEcamp Processing 2012 was held the day after the National Digital Information Infrastructure and Preservation Program (NDIIPP) and the National Digital Stewardship Alliance (NDSA) sponsored Digital Preservation annual meeting.

The unconference was framed by this idea:

Processing means different things to an archivist and a software developer. To the former, processing is about taking custody of collections, preserving context, and providing arrangement, description, and accessibility. To the latter, processing is about computer processing and has to do with how one automates a range of tasks through computation.

The first hour or so was dedicated to mingling and suggesting sessions.  Anyone with an idea for a session wrote down a title and short description on a paper and taped it to the wall. These were then reviewed, rearranged on the schedule and combined where appropriate until we had our full final schedule. More than half the sessions on the schedule have links through to notes from the session. There were four session slots, plus a noon lunch slot of lightening talks.

Session I: At Risk Records in 3rd Party Systems This was the session I had proposed combined with a proposal from Brandon Hirsch. My focus was on identification and capture of the records, while Brandon started with capture and continued on to questions of data extraction vs emulation of the original platforms. Two sets of notes were created – one by me on the Wiki and the other by Sarah Bender in Google Docs. Our group had a great discussion including these assorted points:

  • Can you mandate use of systems we (archivists) know how to get content out of? Consensus was that you would need some way to enforce usage of the mandated systems. This is rare, if not impossible.
  •  The NY Philharmonic had to figure out how to capture the new digital program created for the most recent season. Either that, or break their streak for preserving every season’s programs since 1842.
  • There are consequences to not having and following a ‘file plan’. Part of people’s jobs have to be to follow the rules.
  • What are the significant properties? What needs to be preserved – just the content you can extract? Or do you need the full experience? Sometimes the answer is yes – especially if the new format is a continuation of an existing series of records.
  • “Collecting Evidence” vs “Archiving” – maybe “collecting evidence” is more convincing to the general public
  • When should archivists be in the process? At the start – before content is created, before systems are created?
  • Keep the original data AND keep updated data. Document everything, data sources, processes applied.

Session II: Automating Review for Restrictions? This was the session that I would have suggested if it hadn’t already been on the wall. The notes from the session are online in a Google Doc. It was so nice to realize that that challenge of review of records for restricted information is being felt in many large archives. It was described as the biggest roadblock to the fast delivery of records to researchers. The types of restrictions were categorized as ‘easy’ or ‘hard’. The ‘Easy’ category was for well defined content that follow rules that we could imagine teaching a computer to identity — things like US social security numbers, passport numbers or credit card numbers. The ‘Hard’ category was for restrictions that had more human judgement involved. The group could imagine modules coded to spot the easy restrictions. The modules could be combined to review for whatever set was required – and carry with them some sort of community blessing that was legally defensible. The modules should be open source. The hard category likely needs us as a community to reach out to the eDiscovery specialists from the legal realm, the intelligence community and perhaps those developing autoclassification tools. This whole topic seems like a great seed for a Community of Practice. Anyone interested? If so – drop a comment below please!

Lunchtime Lightning Talks: At five minutes each, these talks gave the attendees a chance to highlight a project or question they would like to discuss with others. While all the talks were interesting, there was one that really stuck with me: Harvard University’s Zone 1 project which is a ‘rescue repository’. I would love to see this model spread! Learn more in the video below.

Session III: Virtualization as a means for Preservation In this session we discussed the question posed in the session proposal “How can we leverage virtualization for large-scale, robust preservation?”. I am not sure if any notes were generated for this session. Notes are available on the conference wiki. Our discussion touched on the potential to save snapshots of virtualized systems over time, the challenges of all the variables that go into making a specific environment, and the ongoing question of how important is it to view records in their original environment (vs examining the extracted ‘content’).

Session IV: Accessible Visualization This session quickly turned into a cheerful show and tell of visualization projects, tools and platforms – most made it into a list on the Wiki.

Final Thoughts
The group assembled for this unconference definitely included a great cross-section of archivists and those focused on the tech of electronic records and archives. I am not sure how many there were exclusively software developers or IT folks. We did go around the room for introductions and hand raising for how people self-identified (archivists? developers? both? other?). I was a bit distracted during the hand raising (I was typing the schedule into the wiki) – but it is my impression that there were many more archivists and archivist/developers than there were ‘just developers’. That said, the conversations were productive and definitely solidly in the technical realm.

One cross-cutting theme I spotted was the value of archivists collaborating with those building systems or selecting tech solutions. While archivists may not have the option to enforce (through carrots or sticks) adherence to software or platform standards, any amount of involvement further up the line than the point of turning a system off will decrease the risks of losing records.

So why the picture of the abandoned factory at the top of this post? I think a lot of the challenges of preservation of born digital records tie back to the fact that archivists often end up walking around in the abandoned factory equivalent of the system that created the records. The workers are gone and all we have left is a shell and some samples of the product. Maybe having just what the factory produced is enough. Would it be a better record if you understood how it moved through the factory to become what it is in the end? Also, for many born digital records you can’t interact with them or view them unless you have the original environment (or a virtual one) in which to experience them. Lots to think about here.

If this sounds like a discussion you would like to participate in, there are more CURATEcamps on the way. In fact – one is being held before SAA’s annual meeting tomorrow!

Image Credit: abandoned factory image from Flickr user sonyasonya.

Day of Digital Archives

To be honest, today was a half day of digital archives, due to personal plans taking me away from computers this afternoon. In light of that, my post is more accurately my ‘week of digital archives’.

The highlight of my digital archives week was the discovery of the Digital Curation Exchange. I promptly joined and began to explore their ‘space for all things ‘digital curation’ ‘. This led me to a fabulous list of resources, including a set of syllabi for courses related to digital curation. Each link brought me to an extensive reading list, some with full slide decks related to weekly in classroom presentations. My ‘to read’ list has gotten much longer – but in a good way!

On other days recently I have found myself involved in all of the following:

  • review of metadata standards for digital objects
  • creation of internal guidelines and requirements documents
  • networking with those at other institutions to help coordinate site visits of other digitization projects
  • records management planning and reviews
  • learning about the OCR software available to our organization
  • contemplation of the web archiving efforts of organizations and governments around the world
  • reviewing my organization’s social media policies
  • listening to the audio of online training available from PLANETS (Preservation and Long-term Access through NETworked Services)
  • contemplation of the new Journal of Digital Media Management and their recent call for articles

My new favorite quote related to digital preservation comes from What we reckon about keeping digital archives: High level principles guiding State Records’ approach from the State Records folks in New South Wales Australia, which reads:

We will keep the Robert De Niro principle in mind when adopting any software or hardware solutions: “You want to be makin moves on the street, have no attachments, allow nothing to be in your life that you cannot walk out on in 30 seconds flat if you spot the heat around the corner” (Heat, 1995)

In other words, our digital archives technology will be designed to be sustainable given our limited resources so it will be flexible and scalable to allow us to utilise the most appropriate tools at a given time to carry out actions such as creation of preservation or access copies or monitoring of repository contents, but replace these tools with new ones easily and with minimal cost and with minimal impact.

I like that this speaks to the fact that no plan can perfectly accommodate the changes in technology coming down the line. Being nimble and assuming that change will be the only constant are key to ensuring access to our digital assets in the future.

Rescuing 5.25″ Floppy Disks from Oblivion

This post is a careful log of how I rescued data trapped on 5 1/4″ floppy disks, some dating back to 1984 (including those pictured here). While I have tried to make this detailed enough to help anyone who needs to try this, you will likely have more success if you are comfortable installing and configuring hardware and software.

I will break this down into a number of phases:

  • Phase 1: Hardware
  • Phase 2: Pull the data off the disk
  • Phase 3: Extract the files from the disk image
  • Phase 4: Migrate or Emulate

Phase 1: Hardware

Before you do anything else, you actually need a 5.25″ floppy drive of some kind connected to your computer.  I was lucky – a friend had a floppy drive for us to work with. If you aren’t that lucky, you can generally find them on eBay for around $25 (sometimes less). A friend had been helping me by trying to connect the drive to my existing PC – but we could never get the communications working properly. Finally I found Device Side Data’s 5.25″ Floppy Drive Controller which they sell online for $55. What you are purchasing will connect your 5.25 Floppy Drive to a USB 2.0 or USB 1.1 port. It comes with drivers for connection to Windows, Mac and Linux systems.

If you don’t want to mess around with installing the disk drive into our computer, you can also purchase an external drive enclosure and a tabletop power supply. Remember, you still need the USB controller too.

Update: I just found a fantastic step-by-step guide to the hardware installation of Device Side’s drive controller from the Maryland Institute for Technology in the Humanities (MITH), including tons of photographs, which should help you get the hardware install portion done right.

Phase 2: Pull the data off the disk

The next step, once you have everything installed, is to extract the bits (all those ones and zeroes) off those floppies. I found that creating a new folder for each disk I was extracting made things easier. In each folder I store the disk image, a copy of the extracted original files and a folder named ‘converted’ in which to store migrated versions of the files.

Device Side provides software they call ‘Disk Image and Browse’. You can see an assortment of screenshots of this software on their website, but this is what I see after putting a floppy in my drive and launching USB Floppy -> Disk Image and Browse:

You will need to select the ‘Disk Type’ and indicate the destination in which to create your disk image. Make sure you create the destination directory before you click on the ‘Capture Disk File Image’ button. This is what it may look like in progress:

Fair warning that this won’t always work. At least the developers of the software that comes with Device Side Data’s controller had a sense of humor. This is what I saw when one of my disk reads didn’t work 100%:

If you are pressed for time and have many disks to work your way through, you can stop here and repeat this step for all the disks you have on hand.

Phase 3: Extract the files from the disk image

Now that you have a disk image of your floppy, how do you interact with it? For this step I used a free tool called Virtual Floppy Drive. After I got this installed properly, when my disk image appeared, it was tied to this program. Double clicking on the Floppy Image icon opens the floppy in a view like the one shown below:

It looks like any other removable disk drive. Now you can copy any or all of the files to anywhere you like.

Phase 4: Migrate or Emulate

The last step is finding a way to open your files. Your choice for this phase will depend on the file formats of the files you have rescued. My files were almost all WordStar word processing documents. I found a list of tools for converting WordStar files to other formats.

The best one I found was HABit version 3.

It converts Wordstar files into text or html and even keeps the spacing reasonably well if you choose that option. If you are interested in the content more than the layout, then not retaining spacing will be the better choice because it will not put artificial spaces in the middle of sentences to preserve indentation. In a perfect world I think I would capture it both with layout and without.

Summary

So my rhythm of working with the floppies after I had all the hardware and software installed was as follows:

  • create a new folder for each disk, with an empty ‘converted’ folder within it
  • insert floppy into the drive
  • run DeviceSide’s Disk Image and Browse software (found on my PC running Windows under Start -> Programs -> USB Flopy)
  • paste the full path of the destination folder
  • name the disk image
  • click ‘Capture Disk Image’
  • double click on the disk image and view the files via vfd (virtual floppy drive)
  • copy all files into the folder for that disk
  • convert files to a stable format (I was going from WordStar to ASCII text) and save the files in the ‘converted’ folder

These are the detailed instructions I tried to find when I started my own data rescue project. I hope this helps you rescue files currently trapped on 5 1/4″ floppies. Please let me know if you have any questions about what I have posted here.

Update: Another great source of information is Archive Team’s wiki page on Rescuing Floppy Disks.

Career Update


I have some lovely news to share! In early July, I will join the Library and Archives of Development at the World Bank as an Electronic Records Archivist. This is a very exciting step for me. Since the completion of my MLS back in 2009, I have mostly focused on work related to metadata, taxonomies, search engine optimization (SEO) and web content management systems. With this new position, I will finally have the opportunity to put my focus on archival issues full time while still keeping my hands in technology and software.

I do have a request for all of you out there in the blogosphere: If you had to recommend a favorite book or journal article published in the past few years on the topic of electronic records, what would it be? Pointers to favorite reading lists are also very welcome.

DH2009: Digital Lives and Personal Digital Archives

Session Title: Digital Lives: How people create, manipulate and store their personal digital archives
Speaker: Peter Williams, UCL

Digital lives is a joint project of UCL, British Library and University of Bristol

What? We need a better understanding of how people manage digital collections on their laptops, pdas and home computers. This is important due to the transition from paper-based personal collections to digital collections. The hope is to help people manage their digital archives before the content gets to the archives.

How? Talk to people with in-depth narrative interview. Ask people of their very first memories of information technology. When did they first use the computer? Do they have anything from that computer? How did they move the content from that computer? People enjoyed giving this narrative digital history of their lives.

Who? 25 interviewees – both established and emerging people whose works would or might be of interest to repositories of the future.

Findings?

  • They created a detailed flowchart of users’ reported process of document manipulation.
  • Common patterns in use of email showed that people used email across all these platforms and environments. Preserving email is not just a case of saving one account’s messages:
    • work email
    • Gmail/Yahoo
    • mails via Facebook
    • Twitter
  • Documented personal information styles that relate skills dimension to data security dimension.

The one question I caught was from someone who asked if they thought people would stop using folders to organize emails and digital files with the advent of easy search across documents. The speaker answered by mentioning the revelations in the paper Don’t Take My Folders Away!. People like folders.

My Thoughts

This session got me to think again about the SAA2008 session that discussed the challenges that various archivists are facing with hybrid literary collections. Matthew Kirschenbaum also pointed me to MITH’s white paper: Approaches to Managing and Collecting Born-Digital Literary Materials for Scholarly Use.

I am very interested to see how ideas about preserving personal digital records evolve. For example, what happens to the idea of a ‘draft’ in a world that auto-saves and versions documents every few minutes such as Google Documents does?

With born digital photos we run into all sorts of issues. Photos that are simultaneously kept on cameras, hard drives, web based repositories (flickr, smugmug, etc) and off-site backup (like mozy.com). Images are deleted and edited differently across environments as well. A while back I wrote a post considering the impact of digital photography on the idea of photographic negatives as the ‘photographers’ sketchbooks’: Capa’s Found Images and Thoughts on Digital Photographers’ Sketchbooks.

I really liked the approach of this project in that it looked at general patterns of behavior rather than attempting to extrapolate from experiences of archivists with individual collections. This sort of research takes a lot of energy, but I am hopeful that basically creating these general user profiles will lead to best practices for preserving personal digital collections that can be applied easily as needed.

As is the case with all my session summaries from DH2009, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Archivists and New Technology: When Do The Records Matter?

Navigating the rapidly changing landscape of new technology is a major challenge for archivists. As quickly as new technologies come to market, people adopt them and use them to generate records. Businesses, non-profits and academic institutions constantly strive to find ways to be more efficient and to cut their budgets. New technology often offers the promise of cost reductions. In this age of constantly evolving software and technological innovation, how do archivists know when a new technology is important or established enough to take note of? When do the records generated by the latest and greatest technology matter enough to save?

Below I have include two diagrams that seek to illustrate the process of adopting new technology. I think they are both useful in aiding our thinking on this topic.

The first is the “Hype Cycle“, as proposed by analyst Jackie Fenn at Gartner Group. It breaks down the phases that new technologies move through as they progress from their initial concept through to broad acceptance in the marketplace. The generic version of the Hype Cycle diagram below is from the Wikipedia entry on hype cycle.

Gartner Hype Cycle (Wikipedia)

Each summer, Gartner comes out with a new update on Where Are We In The Hype Cycle?. Last summer, microblogging was just entering the ‘Peak of Inflated Expectations’, public virtual worlds were sliding down into the ‘Trough of Disillusionment’ and location aware applications were climbing back up the ‘Slope of Enlightenment’. There is even a book about it: Mastering the Hype Cycle: How to Choose the Right Innovation at the Right Time.

The other diagram is the Technology Adoption Lifecycle from Geoffrey Moore’s Crossing the Chasm. This perspective on the technology cycle is from the perspective of bringing new technology to market. How do you cross the chasm between early adopters and the general population?

Technology Adoption Lifecycle (Wikipedia)

Archivists need to consider new technology from two different perspectives. When to use it to further their own goals as archivists and when to address the need to preserve records being generated by new technology. A fair bit of attention has been focused on figuring out how to get archivists up to speed on new web technology. In August 2008, ArchivesNext posted about hunting for Web 2.0 related sessions at SAA2008 and Friends Told Me I Needed A Blog posted about SAA and the Hype Cycle shortly thereafter.

But how do we know when a technology is ‘important enough’ to start worrying about the records it generates? Do we focus our energy on technology that has crossed the chasm and been adopted by the ‘early majority’? Do we watch for signs of adoption by our target record creators?

I expect that the answer (such as there can be one answer!) will be community specific. As I learned in the 2007 SAA session about preserving digital records of the design community, waiting for a single clear technology or software leader to appear can lead to lost or inaccessible records. Archivists working with similar records already come together to support one another through round tables, mailing lists and conference sessions. I have noticed that I often find the most interesting presentations are those that discuss the challenges a specific user community is facing in preserving their digital records. The 2008 SAA session about hybrid analog/digital literary collections discussed issues related to digital records from authors. Those who worry about records captured in geographic information systems (GIS) were trying to sort out how to define a single GIS electronic record when last I dipped my toes into their corner of the world in the Fall of 2006.

It is not feasible to imagine archivists staying ahead of every new type of technology and attempting to design a method for archiving every possible type of digital records being created. What we can do is make it a priority for a designated archivist within every ‘vertical’ community (government, literary, architecture… etc) to keep their ear to the ground about the use of technology within that community. This could be a community of practice of its own. A group that shares info about the latest trends they are seeing while sharing their best practices for handling the latest types of records being seen.

The good news is that archivists aren’t the only ones who want to be able to preserve access to born digital records. Consider Twitter, which only provides easy access to recent tweets. A whole raft of third-party tools built to archive data from Twitter are already out there, answering the demand for a way to backup people’s tweets.

I don’t think archivists always have the luxury of waiting for technology to be adopted by the majority of people and to reach the ‘Plateau of Productivity’. If you are an archivist who works with a community  that uses cutting edge technology, you owe it to your community to stay in the loop with how they do their work now. Just because most people don’t use a specific technology doesn’t mean that an individual community won’t pick it up and use to the exclusion of more common tools.

The design community mentioned above spoke of working with those creating the tools for their community to ensure easy archiving down the line. In our fast paced world of innovation, a subset of archivists need to stay involved with the current business practices of each vertical being archived. This group can work together to identify challenges, brainstorm solutions, build relationships with the technology communities and then disseminate best practices throughout the archives community. I did find a web page for the SAA’s Technology Best Practices Task Force and its document Managing Electronic Records and Assets: A Working Bibliography, but I think that I am imagining something more ongoing, more nimble and more tied into each of the major communities that archivists must support. Am I describing something that already exists?

SAA2008: Preservation and Experimentation with Analog/Digital Hybrid Literary Collections (Session 203)

floppy disks

The official title of Session 203 was Getting Our Hands Dirty (and Liking It): Case Studies in Archiving Digital Manuscripts. The session chair, Catherine Stollar Peters from the New York State Archives and Records Administration, opened the session with a high level discussion of the “Theoretical Foundations of Archiving Digital Manuscripts”. The focus of this panel was preserving hybrid collections of born digital and paper based literary records. The goal was to review new ways to apply archival techniques to digital records. The presenters were all archivists without IT backgrounds who are building on others work … and experimenting. She also mentioned that this also impacts researchers, historians, and journalists.For each of the presenters, I have listed below the top challenges and recommendations. If you attended the sessions, you can skip forward to my thoughts.

Norman Mailer’s Electronic Records

Challenges & Questions:

  • 3 laptops and nearly 400 disks of correspondence
  • While the letters might have been dictated or drafted by Mailer, all the typing, organization and revisions done on the computer were done by his assistant Judith McNally. This brings into question issues of who should be identified as the record creator. How do they represent the interaction between Mailer & McNally? Who is the creator? Co-Creators?
  • All the laptops and disks were held by Judith McNally. When she died all of her possessions were seized by county officials. All the disks from her apartment were eventually recovered over a year later – but it causes issues of provenance. There is no way to know who might have viewed/changed the records.

Revelations and Recommendations:

What is accessioning and processing when dealing with electronic records? What needs to be done?

  • gain custody
  • gather information about creator’s (or creators’) use of the electronic records. In March 2007 they interviewed Mailer to understand the process of how they worked together. They learned that the computers were entirely McNally’s domain.
  • number disks, computers (given letters), other digital media
  • create disk catalog – to reflect physical information of the disk. Include color of ink.. underlining..etc. At this point the disk has never been put into a computer. This captures visual & spacial information
  • gather this info from each disk: file types, directory structure & file names

The ideal for future collections of this type is archivist involvement earlier – the earlier the better.

Papers of Peter Ganick

  • Speaker: Melissa Watterworth
  • Featured Collection: Papers of Writer and Small Press Publisher Peter Ganick, Thomas J Dodd Research Center, University of Connecticut

Challenges & Questions:

  • What are the primary sources of our modern world?
  • How do we acquire and preserve born digital records as trusted custodians?
  • How do we preserve participatory media – maybe we can learn from those who work on performance art?
  • How do we incrementally build our collections of electronic records? Should we be preserving the tools?
  • Timing of acquisition: How actively should we be pursuing personal archives? How can we build trust with creators and get them to understand the challenges?
  • Personal papers are very contextual – order matters. Does this hold true for born digital personal archives? What does the networking aspect of electronic records mean – how does it impact the idea of order?
  • First attempt to accession one of Peter Ganick’s laptops and the archivist found nothing she could identify as files.. she found fragments of text – hypertext work and lots of files that had questionable provenance (downloaded from a mailing list? his creations?). She had to sit down next to him and learn about how he worked.
  • He didn’t understand at first what her challenges were. He could get his head around the idea of metadata and issues of authenticity. He had trouble understanding what she was trying to collect.
  • How do we arrange and keep context in an online environment?
  • Biggest tech challenge: are we holding on for too long to ideas of original order and context?
  • Is there a greater challenge in collecting earlier in the cycle? What if the creator puts restrictions on groupings or chooses to withdraw them?
  • Do we want to create contracts with donors? Is that practical?

Revelations and Recommendations:

  • Collect materials that had high value as born digital works but were at a high risk of loss.
  • Build infrastructure to support preservation of born digital records.
  • Go back to the record creator to learn more about his creative process. They used to acquire records from Ganick every few years.. that wasn’t frequent enough. He was changing the tools he used and how he worked very quickly. She made sure to communicate that the past 30 years of policy wasn’t going to work anymore. It was going to have to evolve.
  • Created a ‘submission agreement’ about what kinds of records should be sent to the archive. He submitted them in groupings that made sense to him. She reviewed the records to make sure she understood what she was getting.
  • Considering using PDFa to capture snapshot of virtual texts.
  • Looked to model of ‘self archiving’ – common in the world of professors to do ongoing accruals.
  • What about ’embedded archivists’? There is a history of this in the performing arts and NGOs and it might be happening more and more.

George Whitmore Papers

Challenges & Questions:

  • How do you establish identity in a way that is complete and uncorrupted? How do you know it is authentic? How do you make an authentic copy? Are these requirements as unreasonable and unachievable?

Revelations and Recommendations:

  • Refresh and replicate files on a regular schedule.
  • They have had good success using Quick View Plus to enable access to many common file formats. On the downside, it doesn’t support everything and since it is proprietary software there are no long term guarantees.
  • In some cases they had to send CP/M files to a 3rd party to have them converted into WordStar and have the ascii normalized.
  • Varied acquisition notes.. and accession records.. loan form with the 3rd party who did the conversion that summarized the request.. they did NOT provide information about what software was used to convert from CP/M to DOS. This would be good information to capture in the future.
  • Proposed an expansion of the standards to include how electronic records were migrated in the <processinfo> processing notes.

Questions & Answers

Question: As part of a writers community, what do we tell people who want to know what they can DO about their records. They want technical information.. they want to know what to keep. Current writers are aware they are creating their legacy.

Answer: Michael: The single best resource is the interPARES 2 Creator Guidelines. The Beineke has adapted them to distrubute to authors. Melissa: Go back to your collection development policies and make sure to include functions you are trying to document (like process.. distribution networks). Also communities of practice (acid free bits) are talking about formats and guidelines like that Gabriela: People often want to address ‘value’. Right now we don’t know how to evaluate the value of electronic drafts – it is up to authors.

Question: Cal Lee: Not a question so much as an idea: the world of digital forensics and security and the ‘order of volatility’ dictate that everyone should always be making a full disk copy bit by bit before doing anything else.

Comment: Comment on digital forensic tools – there is lots of historical and editing history of documents in the software… also delete files are still there.

Question: Have you seen examples of materials that are coming into the archive where the digital materials are working drafts for a final paper version? This is in contrast to others are electronic experiments.

Answer: Yes, they do think about this. It can effect arrangement and how the records are described. The formats also impact how things are preserved.

Question: Access issues? Are you letting people link to them from the finding aids? How are the documents authenticity protected.

Answer: DSpace gives you a new version anytime you want it (the original bitstream) .. lots of cross linking supports people finding things from more than one path. In some cases documents (even electronic) can only be accessed from within the on site reading room.

Question: What is your relationship is like with your IT folks?

Answer: Gabriela: Our staff has been very helpful. We use ‘legacy’ machines to access our content. They build us computers. They are also not archivists, so there is a little divide about priorities and the kind of information that I am interested in.. but it has been a very productive conversation.

Question: (For Melissa) Why didn’t you accept Peter’s email (Melissa had said they refused a submission of email from Peter because it didn’t have research value)?

Answer: The emails that included personal medical emails were rejected. The agreement with Peter didn’t include an option to selectively accept (or weed) what was given.

Question: In terms of gathering information from the creators.. do you recommend a formal/recorded interview? Or a more informal arrangement in which you can contact them anytime on an ongoing basis?

Answer: Melissa: We do have more formal methods – ‘documentation study’ style approaches. We might do literature reviews.. Ultimately the submission agreement is the most formal document we have. Gabriela: It depends on what the author is open to.. formal documentation is best.. but if they aren’t willing to be recorded, then you take what you can get!

My Thoughts

I am very curious to see how best practices evolve in this arena. I wonder how stories written using something like Google Documents, which auto-saves and preserves all versions for future examination, will impact how scholars choose to evaluate the evolution of documents. There have already been interesting examinations of the evolution of collaborative documents. Consider this visual overview of the updates to the Wikipedia entry for Sarah Palin created by Dan Cohen and discussed in his blog post Sarah Palin, Crowdsourced. Another great example of this type of visual experience of a document being modified was linked to in the comments of that post: Heavy Metal Umlaut: The Movie. If you haven’t seen this before – take a few minutes to click through and watch the screencast which actually lets you watch as a Wikipedia page is modified over time.

While I can imagine that there will be many things to sort out if we try to start keeping these incredibly frequent snapshot save logs (disk space? quantity of versions? authenticity? author preferences to protect the unpolished versions of their work?) – I still think that being able to watch the creative process this way will still be valuable in some situations. I also believe that over time new tools will be created to automate the generation of document evolution visualization and movies (like the two I link to above) that make it easy for researchers to harness this sort of information.

Perhaps there will be ways for archivists to keep only certain parts of the auto-save versioning. I can imagine an author who does not want anyone to see early drafts of their writing (as is apparently also the case with architects and early drafts of their designs) – but who might be willing for the frequency of updates to be stored. This would let researchers at least understand the rhythm of the writing – if not the low level details of what was being changed.

I love the photo I found for the top of this post. I admit to still having stacks of 3 1/2 floppy disks. I have email from the early days of BITNET.  I have poems, unfinished stories, old resumes and SQL scripts. For the moment my disks live in a box on the shelf labeled ‘Old Media’. Lucky me – I at least still have a computer with a floppy drive that can read them!

Image Credit: oh messy disks by Blude via flickr.

As is the case with all my session summaries from SAA2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.