Menu Close

Month: March 2007

Considering Historians, Archivists and Born Digital Records

I think I renamed this post at least 12 times. My original intention was was to consider the impact of born digital records on the skills needed for the historian/researchers of the future. In addition I found myself exploring the dividing lines among a number of possible roles in ensuring access to the information written in the 1s and 0s of our born digital records.

After my last post about the impact of anonymization of Google Logs, a friend directed me to the work of Dr. Latanya Sweeney. Reading through the information about her research I found Trail Re-identification: Learning Who You are From Where You Have Been. Given enough data to work with, algorithms can be written that often can re-identify the individuals who performed the original searches. Carnegie Mellon University‘s Data Privacy Lab includes the Trails Learning Project with the goal of answering the question “How can people be identified to the trail of seemingly innocent and anonymous data they leave behind at different locations?”. So it seems that there may be a lot of born digital records that start out anonymous but that may permit ‘re-identification’ – given the application of the right tools or techniques. That is fine – historians have often needed to become detectives. They have spent years developing techniques for the analysis of paper documents to support ‘re-identification’. Who wrote this letter? Is this document real or a forgery? Who is the ‘Mildred’ referenced in this record?

The field of diplomatics studies the authenticity and provenance of documents by looking at everything from the paper they were written on to the style of writing to the ink used. I like the idea of using the term ‘digital diplomatics’ for the ever increasing process of verifying and validating born digital records. Google found me the Digital Diplomatics conference that took place earlier this year in Munich. Unfortunately it was more geared toward investigating how the use of computers can enhance traditional diplomatic approaches rather than how to authenticate the provenance of born digital records.

In the March 2007 issue of Scientific American I found the article A Digital Life. It talks primarily about the Microsoft Research project MyLifeBits. A team at Microsoft Research has spent the last six years creating what they call a ‘digital personal archive’ of team member Gordon Bell. This archive hopes to “record all of Bell’s communications with other people and machines, as well as the images he sees, the sounds he hears and the Web sites he visits–storing everything in a personal digital archive that is both searchable and secure.”

They are not blind to the long term challenges of preserving the data itself in some accessible format:

Digital archivists will have to constantly convert their files to the latest formats, and in some cases they may need to run emulators of older machines to retrieve the data. A small industry will probably emerge just to keep people from losing information because of format evolution.

The article concludes:

Digital memories will yield benefits in a wide spectrum of areas, providing treasure troves of information about how people think and feel. By constantly monitoring the health of their patients, future doctors may develop better treatments for heart disease, cancer and other illnesses. Scientists will be able to get a glimpse into the thought processes of their predecessors, and future historians will be able to examine the past in unprecedented detail. The opportunities are restricted only by our ability to imagine them.

Historians will have at least these two types of digital artifacts to explore – those gathered purposefully (such as the digital personal archives described above) and those generated as a byproduct of other activity (such as the Google search logs). Might these be the future parallels to the ‘manuscript’ and ‘corporate’ archives of today?

So we have both the ideas of the Digital Archivist and the Digital Historian. What about a Digital Archaeologist? I am not the first to ponder the possible future job of Digital Archaeologist. A bit of googling of the term led me to Dark Star Gazette and Dear Digital Archaeologist. Back in February of 2007 they pondered:

Will there be digital archaeologists, people who sift through our society’s discarded files and broken web links, carefully brushing away revisions and piecing together antiquated file formats? Will a team of grad students working on their PhDs a thousand, or two thousand, years from now be digging through old blog entries, still archived online in some remote descendant of the Wayback Machine or a copy of Google’s backup tapes?

I can only imagine a world in which this is in fact the case. Given that premise, at what point does the historian get too far from the primary source? If the historian does not understand exactly what a computer program does to extract the information they want from logs or ‘digital memory repositories’ – are they no longer working with the primary source?

Imagine any field in which historians do research. Music? Accounting? Science? In order examine and interpret primary source records a historian becomes something of an expert in that field. Consider the historian documenting the life of a famous scientist based partly on their lab notebooks. That historian would be best served by being taught how to interpret the notebooks themselves. The historian must be fluent in the language of the record in order to gain the most direct access to the information.

Ah – but if there really are Digital Archaeologists in the far future, perhaps they would be the connection between the primary source born digital records and the historians who wish to study them. Or perhaps the Digital Archivist, in a new take on ‘arranging records’, would transform digital chaos into meaningful records for use by researchers? The field of expertise on the historians part would need only be in the content of the records – not exactly how they were rescued from the digital abyss.

Would a Digital Historian be someone who only considers the history of the digital landscape or a historian especially well versed in the interpretation of digital records? In Daniel Cohen and Roy Rosenzweig‘s book Digital History: A Guide to Gathering, Preserving, And Presenting the Past on the Web they seem to use the term in the present tense to refer to historians who uses computers and technology to support and expand the reach of their research. Yet, in his essay Scarcity or Abundance? Preserving the Past in a Digital Era, Roy Rosenzweig proposes:

Future graduate programs will probably have to teach such social-scientific and quantitative methods as well as such other skills as “digital archaeology”(the ability to “read” arcane computer formats), “digital diplomatics” (the modern version of the old science of authenticating documents), and data mining (the ability to find the historical needle in the digital hay). In the coming years, “contemporary historians” may need more specialized research and “language” skills than medievalists do.

What is my imagined skill set for the historian of our digital world? A willingness to dig into the rich and chaotic world of born digital records. The ability to use tools and find partners to assist in the interpretation of those records. Equal comfort working at tables covered in dusty boxes and in the virtual domain of glowing computer terminals. And of course – the same curiosity and sense of adventure that has always drawn people to the path of being a historian.

We cannot predict the future – we can only do our best to adapt to what we see before us. I suspect the prefixing of every job title with the word ‘digital’ will disappear over time – much as the prefixing of everything with the letter ‘e’ to let you know that something was electronic or online has ebbed out of popular culture. As the historians and archivists of today evolve into the historians and archivists of tomorrow they will have to deal with born digital records – no matter what job title we give them.

Google, Privacy, Records Managment and Archives

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

The Archives and Archivists Listserv: hoping for a stay of execution

There has been a lot of discussion (both on the Archives & Archivists (A&A) Listserv and in blog posts) about the SAA‘s recent decision to not preserve the A&A listserv posts from 1996 through 2006 when they are removed from the listserv’s old hosting location at Miami University of Ohio.

Most of the outcry against this decision has fallen into two camps:

  • Those who don’t understand how the SAA task force assigned to appraise the listserv archives could decide it does not have informational value – lots of discussion about how the listserv reflects the move of archivists into the digital age as well as it’s usefulness for students
  • Those who just wish it wouldn’t go away because they still use it to find old posts. Some mentioned that there are scholarly papers that reference posts in the listserv archives as their primary sources.

I added this suggestion on the listserv:

I would have thought that the Archives Listserv would be the ideal test case for developing a set of best practices for archiving an organization’s web based listserv or bboard.

Perhaps a graduate student looking for something to work on as an independent project could take this on? Even if they only got permission for working with posts from 2001 onward [post 2001 those who posted had to agree to ‘terms of participation’ that reduce issues with copyright and ownership] – I suspect it would still be worthwhile.

I have always found that you can’t understand all the issues related to a technical project (like the preservation of a listserv) until you have a real life case to work on. Even if SAA doesn’t think we need to keep the data forever – here is the perfect set of data for archivists to experiment with. Any final set of best practices would be meant for archivists to use in the future – and would be all the easier to comprehend if they dealt with a listserv that many of them are already familiar with.

Another question: couldn’t the listserv posts still be considered ‘active records’? Many current listserv posters claim they still access the old list’s archives on a regular basis. I would be curious what the traffic for the site is. That is one nice side effect of this being on a website – it makes the usage of records quantifiable.

There are similar issues in the analog world when records people still want to use loose their physical home and are disposed of but, as others have also pointed out, digital media is getting cheaper and smaller by the day. We are not talking about paying rent on a huge wharehouse or a space that needs serious temperature and humidity control.

I was glad to see Rick Prelinger’s response on the current listerv that simply reads:

The Internet Archive is looking into this issue.

I had already checked when I posted my response to the listerv yesterday – having found my way to the A&A old listserv page in the Wayback Machine. For now all that is there is the list of links to each week’s worth of postings – nothing beyond that has been pulled in.

I have my fingers crossed that enough of the right people have become aware of the situation to pull the listserv back from the brink of the digital abyss.

NARA’s Electronic Records Archives in West Virginia

“WVU, NATIONAL ARCHIVES PARTNER” from http://wvutoday.wvu.edu/news/page/5419/

In a press release dated February 28, 2007, the National Archives and Records Administration of the United States (NARA) and West Virginia University (WVU) declared they had signed “a Memorandum of Understanding to establish a 10-year research and educational partnership in the study of electronic records and the promotion of civic awareness of the use of electronic records as educational resources.” It goes on to say that the two organizations “will engage in collaborative research and associated educational activities” including “research in the preservation and long-term access to complex electronic records and engineering design documentation.” WVU will receive “test collections” of electronic records from NARA to support their research and educational activities.

This sounded interesting. I stumbled across this on NARA’s website while looking for something else. No blog chatter or discussions about what this means for electronic records research (thinking of course of the big Footnote.com announcement and all the back and forth discussion that inspired). So I went hunting to see if I could find the actual Memorandum of Understanding. No sign of it. I did find WVU’s press release which included the photo above. This next quote is in the press release as well:

The new partnership complements NARA’s establishment of the Electronic Records Archives Program operations at the U.S. Navy’s Allegany Ballistics Laboratory in Rocket Center near Keyser in Mineral County.

Googling Allegany Ballistics Laboratory got me information about how it is a superfund site that is in late or final stages of cleanup. It also led me to an article from Senator Byrd about how pleased he was in October of 2006 about a federal spending bill that included funds for projects at ABL – including a sentence mentioning how NARA “will use the Mineral County complex for its electronic records archive program.” No mention of this on the Electronic Records Archive (ERA) website or on their special press release page. I don’t see any info about any NARA installations in West Virginia on their Locations webpage .

Then I found the WVU newspaper The Daily Athenaeum and an article titled “National Archives, WVU join forces ” dated March 1, 2007. (If the link gives you trouble – just search on the Athenaeum site for NARA and it should come right up.) The following quote is from the article:

”This is a tremendous opportunity for WVU. The National Archives has no other agreements like this with anyone else,” said John Weete, WVU’s vice president for research and economic development.

The University will help the NARA develop the next generation of technologies for the Electronic Records Archives. WVU will also assist in the management of NARA’s tremendous amount of data, Weete said.

”This is a great opportunity for students. The Archives will look for students who are masters at handling records and who care about the documents (for future job opportunities), ” said WVU President David Hardesty.

WVU students and faculty will hopefully soon have access to the Rocket Center archives, and faculty will be overseeing the maintenance of such records, Hardesty said.

Perhaps I am reading more into this than was intended, but I am confused. I was unable to find any information on the WVU website about an MLS or Archival Studies program there. I checked in both the ALA’s LIS directory and the SAA’s Directory of Archival Education to confirm there are no MLS or Archives degree programs in West Virginia. So where are the “students who are masters at handling records” going to come from? I work daily in the world of software development and I can imagine Computer Scientists who are interested in electronic records and their preservation. But as I have discovered many times over during my archives coursework there are a lot of important and unique ideas to learn in order to understand everything that is needed for the archival preservation of electronic records for “the life of the republic” (as NARA’s ERA project is so fond of saying).

I am pleased for WVU to have made such a landmark agreement with NARA to study and further research into the preservation and educational use of electronic records. Unfortunately I am also suspicious of this barely mentioned bit about the Rocket Center archives and ABL and how WVU is going to help NARA manage their data.

Has anyone else heard more about this?

Update (03/07/07):

Thanks to Donna in the comments for suggesting that WVU’s program is in ‘Public History’ (a aterm I had not thought to look under). This is definitely more reassuring.

WVU appears to offer both a Certificate in Cultural Resource Management and a M.A. in Public History – both described here on the Cultural Resource Management and Public History Requirements page.

The page listing History Department graduate courses included the two ‘public history’ courses listed below:

412 Introduction to Public History. 3 hr. Introduction to a wide range of career possibilities for historians in areas such as archives, historical societies, editing projects, museums, business, libraries, and historic preservation. Lectures, guest speakers, field trips, individual projects.

614 Internship in Public History. 6 hr. PR: HIST 212 and two intermediate public history courses. A professional internship at an agency involved in a relevant area of public history. Supervision will be exercised by both the Department of History and the host agency. Research report of finished professional project required.