2007 | Spellbound Blog

Epidemiological Research and Archival Records: Source of Records Used for Research Fails to Make the News

April 26, 2007 2 Comments

In early April, Reuters ran an article that was picked up by YahooNews titled Closing Schools reduced flu deaths in 1918. I was immediately convinced that archival records must have supported this research – even though no mention of that was included in the article. The article did tell me that it was Dr. Richard Hatchett of the National Institute of Allergy and Infectious Diseases (NIAID) who led the research.

I sent him an email asking about where the data for his research came from. Did the NIH have a set of data from long ago? Here is an excerpt from his kind reply:

Unfortunately, nobody kept track of data like this and you can see the great lengths we went to to track it down. Many of the people we thank in our acknowledgment at the end of the paper tracked down and provided information in local or municipal archives. For Baltimore, I came up and spent an entire day in the library going through old newspapers on microfilm. Some of the information had been gathered by previous historians in works on the epidemic in individual cities (Omaha — an unpublished Master’s thesis — and Newark are examples). Gathering the information was extremely arduous and probably one of the reasons no one had looked at this systematically before. Fortunately, several major newspapers (the NYTimes, Boston Globe, Washington Post, Atlanta Journal-Constitution, etc.) now have online archives going back at least until 1918 that facilitated our search.

Please let me know if you have any other questions. We were amateurs and pulling the information together took a lot longer than we would ever have imagined.

He also sent me a document titled “Supporting Information Methods”. This turned out to be 37 pages of detailed references found to support their research. They were hunting for three types of information: first reported flu cases, amplifying events (such as Liberty Loan Parades ) and interventions (such as quarantines, school closings and bands on public gatherings).

Many of the resources cited are newspapers (see The Baltimore Sun’s 1918 flu pandemic timeline for examples of what can be found in newspapers), but I was more intrigued by the wide range of non-newspaper records used to support this research. A few examples:

Chicago (First reported case): Robertson JD. Report and handbook of the Department of Health of the City of Chicago for the years 1911 to 1918 inclusive. Chicago, 1919.
Cleveland (School closings): The City Record of the Cleveland City Council, October 21, 1918, File No. 47932, citing promulgation of health regulations by Acting Commissioner of Health H.L. Rockwood.
New Orleans (Ban on public gatherings): Parish of Orleans and City of New Orleans. Report of the Board of Health, 1919, p. 131.
Seattle (Emergency Declaration): Ordinance No. 38799 of the Seattle City Council, signed by Mayor Hanson October 9, 1918.

The journal article referenced in the Reuter’s story, Public health interventions and epidemic intensity during the 1918 influenza pandemic, was published in the Proceedings of the National Academy of Sciences (PNAS) and is available online.

The good news here is that the acknowledgment that Dr. Hatchett mentions in his email includes this passage:

The analysis presented here would not have been possible without the contributions of a large number of public health and medical professionals, historians, librarians, journalists, and private citizens […followed by a long list of individuals].

The bad news is that the use of archival records is not mentioned in the news story.

We frequently hear about how little money there is at most archives. Cutbacks in funding are the norm. Every few weeks we hear of archives forced to cut their hours, staff or projects. Public understanding of the important ways that archival records are used can only help to reverse this trend.

Maybe we need a bumper sticker to hand out to new researchers. Something catchy and a little pushy – something that says “Tell the world how valuable our records are!” – only shorter.

If You Use Archival Records – Go On The Record
Put Primary Sources in the Spotlight
Archivists for Footnotes: Keep the paper trail alive
Archives Remember: Don’t Forget Them

I don’t love any of these – anyone else feeling wittier and willing to share?

(For more images of the 1918 Influenza Epidemic, visit the National Museum of Health and Medicine’s Otis Historical Archives’ Images from the 1918 Influenza Epidemic.)

Digital Archiving Articles – netConnect Spring 2007

April 19, 2007 1 Comment

Thanks to Jessamyn West’s blog post, I found my way to a series of articles in the Spring 2007 edition of netConnect:

Funding the Past – and Future by Fancine Fialkoff
Saving Digital History by Jessamyn West
LC Needs Digital Support by Norman Oder

“Saving Digital History” is the longest of the three and is a nice survey of many of the issues found at the interseciton of archiving, born digital records and the wild world of the web. I especially love the extensive Link List at the end of the articles — there are lots of interesting related resources. This is the sort of list of links I wish were available with ALL articles online!

I can see the evolution of some of the ideas she and her co-speakers touched on in their session at SAA 2006: Everyone’s Doing It: What Blogs Mean for Archivists in the 21st Century. I hope we continue to see more of these sorts of panels and articles. There is a lot to think about related to these issues – and there are no easy answers to the many hard questions.

Update: Here is a link to Jessamyn’s presentation from the SAA session mentioned above: Capturing Collaborative Information News, Blogs, Librarians, and You.

Copyright Law: Archives, Digital Materials and Section 108

April 12, 2007 2 Comments

I just found my way today to Copysense (obviously I don’t have enough feeds to read as it is!). Their current clippings post highlighted part of the following quote as their Quote of the Week.

“[L]egislative changes to the copyright law are needed. First, we need to amend the law to give the Library of Congress additional flexibility to acquire the digital version of a work that best meets the Library’s future needs, even if that edition has not been made available to the public. Second, section 108 of the law, which provides limited exceptions for libraries and archives, does not adequately address many of the issues unique to digital media—not from the perspective of copyright owners; not from the perspective of libraries and archives.” Marybeth Peters , Register of Copyrights, March 20, 2007

Marybeth Peters was speaking to the Subcommittee on Legislative Branch of the Committee on Appropriations about the Future of Digital Libraries.

Copysense makes some great points about the quote:

Two things strike us as interesting about Ms. Peters’ quote. First, she makes the quote while The Section 108 Study Group continues to work through some very thorny issues related to the statutes application in the digital age […] Second, while Peters’ quote articulates what most information professionals involved in copyright think is obvious, her comments suggest that only recently is she acknowledging the effect of copyright law on this nation’s de facto national library. […] [S]omehow it seems that Ms. Peters is just now beginning to realize that as the Library of Congress gets involved in the digitization and digital work so many other libraries already are involved in, that august institution also may be hamstrung by copyright.

I did my best to read through Section 108 of the Copyright Law – subtitled “Limitations on exclusive rights: Reproduction by libraries and archives”. I found it hard to get my head around … definitely stiff going. There are 9 different subsections (‘a’ through ‘i’) each with there own numbered exceptions or requirements. Anxious to get a grasp on what this all really means – I found LLRX.com and their Library Digitization Projects and Copyright page. This definitely was an easier read and helped me get further in my understanding of the current rules.

Next I explored the website for the Section 108 Study Group that is hard at work figuring out what a good new version of Section 108 would look like. I particularly like the overview on the About page. They have a 32 page document titled Overview of the Libraries and Archives Exception in the Copyright Act: Background, History, and Meaning for those of you who want the full 9 years on what has gotten us to where we are today with Section 108.

For a taste of current opinions – go to the Public Comments page which provides links to all the written responses submitted to the Notice of public roundtable with request for comments. There are clear representatives from many sides of the issue. I spotted responses from SAA, ALA and ARL as well as from MPAA, AAP and RIAA. All told there are 35 responses (and no, I didn’t read them all). I was more interested in all the different groups and individuals that took the time to write and send comments (and a lot of time at that – considering the complicated nature of the original request for comments and the length of the comments themselves). I was also intrigued to see the wide array of job titles of the authors. These are leaders and policy makers (and their lawyers) making sure their organizations’ opinions are included in this discussion.

Next stop – the Public Roundtables page with it’s links to transcripts from the roundtables – including the most recent one held January 31, 2007. Thanks to the magic of Victoria’s Transcription Services, the full transcripts of the roundtables are online. No, I haven’t read all of these either. I did skim through a bit of it to get a taste of the discussions – and there is some great stuff here. Lots of people who really care about the issues carefully and respectfully exploring the nitty-gritty details to try and reach good compromises. This is definitely on my ‘bookmark to read later’ list.

Karen Coyle has a nice post over on Coyle’s InFormation that includes all sorts of excerpts from the transcripts. It gives you a good flavor of what some of these conversations are like – so many people in the same room with such different frames of reference.

This is not easy stuff. There is no simple answer. It will be interesting to see what shape the next version of Section 108 takes with so many people with very different priorities pulling in so many directions.

The good news is that there are people with the patience and dedication to carefully gather feedback, hold roundtables and create recommendations. Hurrah for the hard working members of the Section 108 Study Group – all 19 of them!

Visualizing Archival Collections

April 8, 2007 1 Comment

As I mentioned earlier, I am taking an Information Visualization class this term. For our final class project I managed to inspire two other classmates to join me in creating a visualization tool based on the structured data found in the XML version of EAD finding aids.

We started with the XML of the EAD finding aids from University of Maryland’s ArchivesUM and the Library of Congress Finding Aids. My teammates have written a parser that extracts various things from the XML such as title, collection size, inclusive dates and subjects. Our goal is to create an innovative way to improve the exploration and understanding of archival collections using an interactive visualization.

Our main targets right now are to use a combination of subjects, years and collection size to give users a better impression of the quantity of archival materials that fit various search criteria. I am a bit obsessed about using the collection size as a metric for helping users understand the quantity of materials. If you do a search for a book in a library’s catalog – getting 20 hits usually means that you are considering 20 books. If you consider archival collections – 20 hits could mean 20 linear feet (20 collections each of which is 1 linear foot in size) or it could mean 2000 linear feet (20 collections each of which is 100 linear feet in size). Understanding this difference is something that visualization can help us with. Rather than communicating only the number of results – the visualization will communicate the total size of collections assigned each of the various subjects.

I have uploaded 2 preliminary screen mockups one here and the second here trying to get at my ideas for how this might work.

Not reflected in the mock-ups is what could happen when a user clicks on the ‘related subject’ bars. Depending on where they click – one of two things could happen. If they click on the ‘related subject’ bar WITHIN the boundaries of the selected subject (in the case above, that would mean within the ‘Maryland’ box), then the search would filter further to only show those collections that have both the ‘Maryland’ and newly ‘added’ tag. The ‘related subjects’ list and displayed year distribution would change accordingly as well. If, instead, the user clicks on a ‘related subject’ bar OUTSIDE the boundary of the selected subject — then that subject would become the new (and only) selected subject and the displayed collections, related subjects and years would change accordingly.

So that is what we have so far. If you want to keep an eye on our progress, our team has a page up on our class wiki about this project. I have a ton of ideas of other things I would love to add to this (my favorite being a map of the world with indications of where the largest amount of archival materials can be found based on a keyword or subject search) – but we have to keep our feet on the ground long enough actually build something for our class project. This is probably a good thing. Smaller goals make for a greater chance of success.

Getting Your Toes Wet: Basic Principals of Design for the New Web

April 5, 2007

Ellyssa Kroski of InfoTangle has created a great overview of current trends in website and application design in her post Information Design for the New Web. If you are going to Computers in Libraries, you can see her present the ideas she discusses in her post in a session of the same name on Monday April 16.

She highlights 3 core principles with clear explanations and great examples:

Keep it Simple
Make it Social
Offer Alternate Navigation

As archives continue to dive into the deep end of the internet pool, more and more archivists will find themselves participating in discussions about website design choices. Understanding basic principals like those discussed in Kroski’s post will go a long way to making archivists feel more comfortable contributing to these sorts of discussions.

Don’t think that things like this should be left to the IT department or only the ‘techie archivists’ on your staff (if you have any). You all have a lot to contribute. You know your collections. You know the importance of archival principals of provenance, original order and context. There are lots of aspects of archival materials that traditional web designers might not consider important. Things that you know are very important if people are to understand your archives’ materials while browsing from the comfort of their homes.

So dip your toes in. Learn some buzz words, look at some fun websites and get comfortable with some innovative ideas. The water is just fine.

Ideas for SAA2007: Web Awards, Wikis and Blogs

April 4, 2007 5 Comments

Online since late March of this year, the new ArchivesNext blog is wasting no time in generating great ideas. First of all – I love the idea of awards for the best archives websites. How about ‘Best Archives Blog’, ‘Best Online Exhibit’ and ‘Best Archives Website’? It seems like barely a week goes by on the Archives and Archivists’ listserv between each announcement of a new archives website or online exhibition. I think an entire blog could be created just showing off the best of archives websites. I would love to see those making the greatest online contributions to the profession honored at the annual conference.

Another great ArchivesNext idea is a wiki for SAA2007 in Chicago. I was amazed at the conference last summer to see the table where you could buy audio recordings of the presentations. I live so much in the tech/geek world that I had assumed that of course SAA would have someone recording the sessions so they could be posted online. I assumed that there would be a handy place for presenters to upload their handouts and slides. A wiki would be a great way to support this sort of knowledge sharing. People come from all over the world for just a few days together at conferences like this. Many more can’t make the trip. I think it would go a long way to build more of an online archival community to have something beyond a listserv that let groups of like minded individuals build a collection of resources surrounding the topics discussed at the conference.

What about blogging the conference? Last year Foldering.com suggested we all use SAA2006 to tag our conference blog posts. Technorati shows 25 posts with that tag (and yes, a lot of those posts are mine). One major stumbling block was a lack of wireless in the hotel where the convention was held. Another was a combination of lack of interest and lack of coordination. Too few people were mobilized in time to plan coverage of the panels.

We could leverage a conference wiki to coordinate more effectively than we did last year. Simple signup sheets could help us ensure coverage of the panels and roundtables. I think it would be interesting to see if those who cannot attend the conference might express preferences about which talks should definitely be covered. If there are wiki pages for each panel and roundtable, those pages could eventually include links to the blog posts of bloggers covering those talks.

Blogging last August at SAA2006 was interesting for me. I had never attempted to blog at a conference (Spellboundblog was less than 1 month old last August). I took 37 pages of notes on my laptop. Yes, there was a lot of white space – but it was still 37 pages long. I found that I couldn’t bring myself to post in the informal ‘stream of consciousness style that I have often seen in ‘live blogging’ posts. I wanted to include links. I wanted to include my thoughts about each speaker I listened to. I wanted to draw connections among all the different panels I attended. I wanted someone who hadn’t been there to be able to really understand the ideas presented from reading my posts. That took time. I ended up with 10 posts about specific panels and round tables and another 2 about general conference ideas and impressions. Then I gave up. I got to the point where I felt burdened by the pages I had not transcribed. I had gotten far enough away from the conference that I didn’t always understand my own notes. I had new things I wanted to talk about, so I set aside my notes and moved on.

I hope we get more folks interested in blogging the conference this year. Feel free to email me directly at jeanne AT spellboundblog.com if you would like to be kept in the loop for any blogging coordination (though I will certainly post whatever final plan we come up with here).

SAA Members: make sure your ballot is in the mail

March 30, 2007

If you are a member of SAA and you haven’t filled out and mailed your ballot for the 2007 Elections, please consider this a friendly reminder that it must be post marked on or before April 2, 2007 in order to be counted. They have convenientely posted all the bios and statements from the candidates online.

Supporting Appraisal of Digital Records

March 28, 2007

In his recent post to the A+A Listserv, Richard Pearce-Moses explores some really interesting ideas related to the appraisal of a listserv. The notions that particularly caught my imagination were in these passages:

We could take advantage of the fact that the list is in electronic format and conceivably use some AI filters to do some weeding. But at what cost? Is this truly feasible? And what are the implications on the integrity of the collection if only a portion are saved?

I was particularly interested in the number of people who said they searched the lists’ archives. Although demonstrated use can be used to justify preservation, what is sufficient use and how do we measure it? Are there use patterns that suggest these messages are inactive, with use falling off over time in a pattern that suggests they not be kept permanently? (To my knowledge the server logs are not accessible.)

What sort of infrastructure could archivists work toward putting in place to support automated weeding of listserv postings? If the postings were not sent via email but rather posted via some other interface, I can imagine a choice being presented at the time the post was written ‘Keep’ vs ‘Discard After 6 months’. There is something like this already in place for some government email systems – the sender indicates if the message is ‘permanent’ when the message is sent. Of course that presents a whole series of new problems. Someone in one of my classes mentioned that U.S. White House staffers had taken to marking EVERYTHING as permanent because emails that were marked not permanent were being scrutinized NOW. I wish I could find a source for this story online – but all I am finding today is the latest hubbub about White House staffers using non-government email accounts to communicate when they didn’t want to worry about it being preserved (or at least that seems to be the current allegation ). Luckily for this discussion we aren’t worried about people hiding posts.

Of course some of this can be implemented via those who post to the list. If everyone (anyone?) used standard post title prefixes, the appraising archivist of the future could easily filter out entire subsets of posts. SAA has this posted on the Terms of Participation Page for the Archives and Archivists List:

In order to maintain a highly informative and focused professional forum, SAA strongly encourages list participants to use the following labels at the beginning of all subject lines. This will allow others to filter list messages via mail rules and automatically select those types of information according to their individual needs and preferences.

“Calls:” (Calls for papers, survey participation, etc.)

“Disc:” (Discussion on various topics)

“Event:” (Conference, seminar, workshop announcements, etc.)

“FF:” (“Friday funnies,” see below)

“FYI:” (General announcements and information)

“Job:” (Job announcements)

“Media:” (Links to archives and archivists in the news)

“Qs:” (Questions)

“Pubs:” (Announcements re: books, chapters, papers, dissertations, and reviews)

There is a link to this page at the bottom of EVERY post to the listserv, but I have rarely seen anyone use these prefixes in post titles. ‘Job’ is the one that gets the most use, and usually only for a short time after someone politely ask everyone creating job posts to make sure they have good titles.

The idea of examining ‘usage’ patterns is also an interesting one. If we could easily capture and examine the view and search logs of posts we could build an understanding of what types of posts really are re-examined over time. But then what do we do with that information? Does past interest in a topic translate into permanent informational value? Just because someone didn’t look at it again yet – does that mean we assume that no-one will every be interested in its content?

My instinct (when wearing my techie hat) is to vote for the ‘keep it all – disk space is cheap’ approach. That said, I know that the expense of the space on that first hard drive you save your records on is just the tip of the iceberg in terms of digital preservation expenses.

Thinking about what you want to keep before you turn on any software system is always going to make things easier. I know that as the laws in the US continue to evolve to demand the retention of specific types of data the software will also continue evolve to make it easy to keep ONLY what needs to be kept. Private sector companies are usually quite intent on sticking to the letter of the law in that regard – they never want to keep more than they must (or so their lawyers like to tell them). It is also in the best interest of the software companies to ensure that all the required records are being kept.

Another driving force to generate systems that know how to filter and keep the ‘right’ records (whatever that means) could be individual users. In a universe of digital cameras where you can take 1000 photos as cheaply as 100 – I wonder if there is a place for software that intelligently archives your most frequently accessed (and tagged and shared) photos. The flip side of that could be auto-weeding (perhaps with a quick review option) every year. This would be the same approach some take to cleaning out their clothes closets – if I haven’t touched it in 2 years, then I should get rid of it.

While doing my research last term into the appraisal of Digital GIS records, I was amazed by how much of what was currently being done could only be accomplished by brute force. Frequently the work is being done through the sheer will of a small group of very dedicated people using tools not particularly suited to the task. I need to do more research into the realm of electronic record management – I want to understand what standard tools are being supplied (or not supplied). Are there tools for those who manage large repositories of electronic records where there is an acknowledged goal of supporting records scheduling and permanent preservation?

In our increasingly digital world, I think there will always be cases of born digital records that must be considered for appraisal without all the answers to all our questions. I am just fascinated at the notion of building the tools we need into the software systems from the start. At the end of a record’s active life cycle we would then be able to make and implement appraisal choices more easily. Imagine that – planning ahead for appraisal!

Considering Historians, Archivists and Born Digital Records

March 23, 2007 3 Comments

I think I renamed this post at least 12 times. My original intention was was to consider the impact of born digital records on the skills needed for the historian/researchers of the future. In addition I found myself exploring the dividing lines among a number of possible roles in ensuring access to the information written in the 1s and 0s of our born digital records.

After my last post about the impact of anonymization of Google Logs, a friend directed me to the work of Dr. Latanya Sweeney. Reading through the information about her research I found Trail Re-identification: Learning Who You are From Where You Have Been. Given enough data to work with, algorithms can be written that often can re-identify the individuals who performed the original searches. Carnegie Mellon University‘s Data Privacy Lab includes the Trails Learning Project with the goal of answering the question “How can people be identified to the trail of seemingly innocent and anonymous data they leave behind at different locations?”. So it seems that there may be a lot of born digital records that start out anonymous but that may permit ‘re-identification’ – given the application of the right tools or techniques. That is fine – historians have often needed to become detectives. They have spent years developing techniques for the analysis of paper documents to support ‘re-identification’. Who wrote this letter? Is this document real or a forgery? Who is the ‘Mildred’ referenced in this record?

The field of diplomatics studies the authenticity and provenance of documents by looking at everything from the paper they were written on to the style of writing to the ink used. I like the idea of using the term ‘digital diplomatics’ for the ever increasing process of verifying and validating born digital records. Google found me the Digital Diplomatics conference that took place earlier this year in Munich. Unfortunately it was more geared toward investigating how the use of computers can enhance traditional diplomatic approaches rather than how to authenticate the provenance of born digital records.

In the March 2007 issue of Scientific American I found the article A Digital Life. It talks primarily about the Microsoft Research project MyLifeBits. A team at Microsoft Research has spent the last six years creating what they call a ‘digital personal archive’ of team member Gordon Bell. This archive hopes to “record all of Bell’s communications with other people and machines, as well as the images he sees, the sounds he hears and the Web sites he visits–storing everything in a personal digital archive that is both searchable and secure.”

They are not blind to the long term challenges of preserving the data itself in some accessible format:

Digital archivists will have to constantly convert their files to the latest formats, and in some cases they may need to run emulators of older machines to retrieve the data. A small industry will probably emerge just to keep people from losing information because of format evolution.

The article concludes:

Digital memories will yield benefits in a wide spectrum of areas, providing treasure troves of information about how people think and feel. By constantly monitoring the health of their patients, future doctors may develop better treatments for heart disease, cancer and other illnesses. Scientists will be able to get a glimpse into the thought processes of their predecessors, and future historians will be able to examine the past in unprecedented detail. The opportunities are restricted only by our ability to imagine them.

Historians will have at least these two types of digital artifacts to explore – those gathered purposefully (such as the digital personal archives described above) and those generated as a byproduct of other activity (such as the Google search logs). Might these be the future parallels to the ‘manuscript’ and ‘corporate’ archives of today?

So we have both the ideas of the Digital Archivist and the Digital Historian. What about a Digital Archaeologist? I am not the first to ponder the possible future job of Digital Archaeologist. A bit of googling of the term led me to Dark Star Gazette and Dear Digital Archaeologist. Back in February of 2007 they pondered:

Will there be digital archaeologists, people who sift through our society’s discarded files and broken web links, carefully brushing away revisions and piecing together antiquated file formats? Will a team of grad students working on their PhDs a thousand, or two thousand, years from now be digging through old blog entries, still archived online in some remote descendant of the Wayback Machine or a copy of Google’s backup tapes?

I can only imagine a world in which this is in fact the case. Given that premise, at what point does the historian get too far from the primary source? If the historian does not understand exactly what a computer program does to extract the information they want from logs or ‘digital memory repositories’ – are they no longer working with the primary source?

Imagine any field in which historians do research. Music? Accounting? Science? In order examine and interpret primary source records a historian becomes something of an expert in that field. Consider the historian documenting the life of a famous scientist based partly on their lab notebooks. That historian would be best served by being taught how to interpret the notebooks themselves. The historian must be fluent in the language of the record in order to gain the most direct access to the information.

Ah – but if there really are Digital Archaeologists in the far future, perhaps they would be the connection between the primary source born digital records and the historians who wish to study them. Or perhaps the Digital Archivist, in a new take on ‘arranging records’, would transform digital chaos into meaningful records for use by researchers? The field of expertise on the historians part would need only be in the content of the records – not exactly how they were rescued from the digital abyss.

Would a Digital Historian be someone who only considers the history of the digital landscape or a historian especially well versed in the interpretation of digital records? In Daniel Cohen and Roy Rosenzweig‘s book Digital History: A Guide to Gathering, Preserving, And Presenting the Past on the Web they seem to use the term in the present tense to refer to historians who uses computers and technology to support and expand the reach of their research. Yet, in his essay Scarcity or Abundance? Preserving the Past in a Digital Era, Roy Rosenzweig proposes:

Future graduate programs will probably have to teach such social-scientific and quantitative methods as well as such other skills as “digital archaeology”(the ability to “read” arcane computer formats), “digital diplomatics” (the modern version of the old science of authenticating documents), and data mining (the ability to find the historical needle in the digital hay). In the coming years, “contemporary historians” may need more specialized research and “language” skills than medievalists do.

What is my imagined skill set for the historian of our digital world? A willingness to dig into the rich and chaotic world of born digital records. The ability to use tools and find partners to assist in the interpretation of those records. Equal comfort working at tables covered in dusty boxes and in the virtual domain of glowing computer terminals. And of course – the same curiosity and sense of adventure that has always drawn people to the path of being a historian.

We cannot predict the future – we can only do our best to adapt to what we see before us. I suspect the prefixing of every job title with the word ‘digital’ will disappear over time – much as the prefixing of everything with the letter ‘e’ to let you know that something was electronic or online has ebbed out of popular culture. As the historians and archivists of today evolve into the historians and archivists of tomorrow they will have to deal with born digital records – no matter what job title we give them.

Google, Privacy, Records Managment and Archives

March 16, 2007 6 Comments

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

Year: 2007