future-proofing | Spellbound Blog

The Archives and Archivists Listserv: hoping for a stay of execution

March 14, 2007 1 Comment

There has been a lot of discussion (both on the Archives & Archivists (A&A) Listserv and in blog posts) about the SAA‘s recent decision to not preserve the A&A listserv posts from 1996 through 2006 when they are removed from the listserv’s old hosting location at Miami University of Ohio.

Most of the outcry against this decision has fallen into two camps:

Those who don’t understand how the SAA task force assigned to appraise the listserv archives could decide it does not have informational value – lots of discussion about how the listserv reflects the move of archivists into the digital age as well as it’s usefulness for students
Those who just wish it wouldn’t go away because they still use it to find old posts. Some mentioned that there are scholarly papers that reference posts in the listserv archives as their primary sources.

I added this suggestion on the listserv:

I would have thought that the Archives Listserv would be the ideal test case for developing a set of best practices for archiving an organization’s web based listserv or bboard.

Perhaps a graduate student looking for something to work on as an independent project could take this on? Even if they only got permission for working with posts from 2001 onward [post 2001 those who posted had to agree to ‘terms of participation’ that reduce issues with copyright and ownership] – I suspect it would still be worthwhile.

I have always found that you can’t understand all the issues related to a technical project (like the preservation of a listserv) until you have a real life case to work on. Even if SAA doesn’t think we need to keep the data forever – here is the perfect set of data for archivists to experiment with. Any final set of best practices would be meant for archivists to use in the future – and would be all the easier to comprehend if they dealt with a listserv that many of them are already familiar with.

Another question: couldn’t the listserv posts still be considered ‘active records’? Many current listserv posters claim they still access the old list’s archives on a regular basis. I would be curious what the traffic for the site is. That is one nice side effect of this being on a website – it makes the usage of records quantifiable.

There are similar issues in the analog world when records people still want to use loose their physical home and are disposed of but, as others have also pointed out, digital media is getting cheaper and smaller by the day. We are not talking about paying rent on a huge wharehouse or a space that needs serious temperature and humidity control.

I was glad to see Rick Prelinger’s response on the current listerv that simply reads:

The Internet Archive is looking into this issue.

I had already checked when I posted my response to the listerv yesterday – having found my way to the A&A old listserv page in the Wayback Machine. For now all that is there is the list of links to each week’s worth of postings – nothing beyond that has been pulled in.

I have my fingers crossed that enough of the right people have become aware of the situation to pull the listserv back from the brink of the digital abyss.

NARA’s Electronic Records Archives in West Virginia

March 6, 2007 4 Comments

In a press release dated February 28, 2007, the National Archives and Records Administration of the United States (NARA) and West Virginia University (WVU) declared they had signed “a Memorandum of Understanding to establish a 10-year research and educational partnership in the study of electronic records and the promotion of civic awareness of the use of electronic records as educational resources.” It goes on to say that the two organizations “will engage in collaborative research and associated educational activities” including “research in the preservation and long-term access to complex electronic records and engineering design documentation.” WVU will receive “test collections” of electronic records from NARA to support their research and educational activities.

This sounded interesting. I stumbled across this on NARA’s website while looking for something else. No blog chatter or discussions about what this means for electronic records research (thinking of course of the big Footnote.com announcement and all the back and forth discussion that inspired). So I went hunting to see if I could find the actual Memorandum of Understanding. No sign of it. I did find WVU’s press release which included the photo above. This next quote is in the press release as well:

The new partnership complements NARA’s establishment of the Electronic Records Archives Program operations at the U.S. Navy’s Allegany Ballistics Laboratory in Rocket Center near Keyser in Mineral County.

Googling Allegany Ballistics Laboratory got me information about how it is a superfund site that is in late or final stages of cleanup. It also led me to an article from Senator Byrd about how pleased he was in October of 2006 about a federal spending bill that included funds for projects at ABL – including a sentence mentioning how NARA “will use the Mineral County complex for its electronic records archive program.” No mention of this on the Electronic Records Archive (ERA) website or on their special press release page. I don’t see any info about any NARA installations in West Virginia on their Locations webpage .

Then I found the WVU newspaper The Daily Athenaeum and an article titled “National Archives, WVU join forces ” dated March 1, 2007. (If the link gives you trouble – just search on the Athenaeum site for NARA and it should come right up.) The following quote is from the article:

”This is a tremendous opportunity for WVU. The National Archives has no other agreements like this with anyone else,” said John Weete, WVU’s vice president for research and economic development.

The University will help the NARA develop the next generation of technologies for the Electronic Records Archives. WVU will also assist in the management of NARA’s tremendous amount of data, Weete said.

”This is a great opportunity for students. The Archives will look for students who are masters at handling records and who care about the documents (for future job opportunities), ” said WVU President David Hardesty.

WVU students and faculty will hopefully soon have access to the Rocket Center archives, and faculty will be overseeing the maintenance of such records, Hardesty said.

Perhaps I am reading more into this than was intended, but I am confused. I was unable to find any information on the WVU website about an MLS or Archival Studies program there. I checked in both the ALA’s LIS directory and the SAA’s Directory of Archival Education to confirm there are no MLS or Archives degree programs in West Virginia. So where are the “students who are masters at handling records” going to come from? I work daily in the world of software development and I can imagine Computer Scientists who are interested in electronic records and their preservation. But as I have discovered many times over during my archives coursework there are a lot of important and unique ideas to learn in order to understand everything that is needed for the archival preservation of electronic records for “the life of the republic” (as NARA’s ERA project is so fond of saying).

I am pleased for WVU to have made such a landmark agreement with NARA to study and further research into the preservation and educational use of electronic records. Unfortunately I am also suspicious of this barely mentioned bit about the Rocket Center archives and ABL and how WVU is going to help NARA manage their data.

Has anyone else heard more about this?

Update (03/07/07):

Thanks to Donna in the comments for suggesting that WVU’s program is in ‘Public History’ (a aterm I had not thought to look under). This is definitely more reassuring.

WVU appears to offer both a Certificate in Cultural Resource Management and a M.A. in Public History – both described here on the Cultural Resource Management and Public History Requirements page.

The page listing History Department graduate courses included the two ‘public history’ courses listed below:

412 Introduction to Public History. 3 hr. Introduction to a wide range of career possibilities for historians in areas such as archives, historical societies, editing projects, museums, business, libraries, and historic preservation. Lectures, guest speakers, field trips, individual projects.

614 Internship in Public History. 6 hr. PR: HIST 212 and two intermediate public history courses. A professional internship at an agency involved in a relevant area of public history. Supervision will be exercised by both the Department of History and the host agency. Research report of finished professional project required.

Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges

February 17, 2007 3 Comments

My most recent Archival Access class had a great guest speaker from the Journalism department. Professor Ira Chinoy is currently teaching a course on Computer-Assisted Reporting. In the first half of the session, he spoke about ways that archival records can fuel and support reporting. He encouraged the class to brainstorm about what might make archival records newsworthy. How do old records that have been stashed away for so long become news? It took a bit of time, but we got into the swing of it and came up with a decent list. He then went through his own list and gave examples of published news stories that fit each of the scenarios.

In the second half of class he moved on to address issues related to the freedom of information and struggling to gain access to born digital public records. Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They have many hurdles. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state (see the Maryland Public Information Act as an example) and journalists often must quote chapter and verse to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hopes of proving that it can be done. I love this article linked to in Prof. Chinoy’s syllabus: The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (And Suggestions on What You Should Say or Do).

After all that work – just getting your hands on the magic file of data is not enough. The data is of no use without the decoder ring of documentation and context.

I spent most of the 1990s designing and building custom databases, many for federal government agencies. There are an almost inconceivable number of person hours that go into the creation of most of these systems. Stakeholders from all over the organization destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated … and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in it – even if they are on the IT team for the full development life cycle.

Sometimes you get lucky and the project includes people with amazing technical writing skills, but usually those talented people are aimed at writing documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They will rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself and that is not likely to be preserved along with the data.

Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection remember such things as where frequently requested documents are or what restrictions must be applied to certain documents.

So where does that leave those who are playing detective to understand the records in these systems? Trying to figure out what the data in the tables mean based on the understanding of end-users can be a fool’s errand – and that is if you even have access to actual users of the system in the first place. I don’t think there is any easy answer given the realities of how many unique systems of managing data are being used throughout the public sector.

Archivists often find themselves struggling with the same problems. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records – though I recall discussions in my appraisal class last term about all the benefits of working with the producers of records long before they are earmarked to head to the archives. Unfortunately, it appeared that this was often the exception rather than the rule – even if it is the preferred scenario.

The overly ambitious and optimistic part had the idea that what ‘we’ really need is a database that lists common commercial off-the-shelf (COTS) packages used by public agencies – along with information on how to extract and redact data from these packages. For those agencies using custom systems, we could include any information on what company or contractors did the work – that sort of thing can only help later. Or how about just a list of which agencies use what software? Does something like this exist? The records of what technology is purchased are public record – right? Definitely an interesting idea (for when I have all that spare time I dream about). I wonder if I set up a wiki for people to populate with this information if people would share what they already know.

I would like to imagine a future world in which all this stuff is online and you can login and download any public record you like at any time. You can get a taste of where we are on the path to achieving this dream on the archives side of things by exploring a single series of electronic records published on the US National Archives site. For example, look at the search screen for World War II Army Enlistment Records. It includes links to sample data, record group info and an FAQ. Once you make it to viewing a record – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War II and Army enlistment. You still need the context of the events and this is where the FAQ comes in. Look at the information they provide – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place).

This isn’t a new problem that has appeared with born digital records. Archivists and journalists have always sought the context of the information with which they are working. The new challenge is in the added obstacles that a cryptic database system can add on top of the already existing challenges of decrypting the meaning of the records.

Archivists and Journalists care about a lot of the same issues related to born digital records. How do we acquire the records people will care about? How do we understand what they mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this?

It is interesting for me find a new angle from which to examine rapid software development. I have spent so much of my time creating software based on the needs of a specific user community. Usually those who are paying for the software get to call the shots on the features that will be included. Certain industries do have detailed regulations designed to promote access by external observers (I am thinking of applications related to medical/pharmaceutical research and perhaps HAZMAT data) but they are definitely exceptions.

Many people are worrying about how we will make sure that the medium upon which we record our born digital records remains viable. I know that others are pondering how to make sure we have software that can actually read the data such that it isn’t just mysterious 1s and 0s. What I am addressing here is another aspect of preservation – the preservation of context. I know this too is being worried about by others, but while I suspect we can eventually come up with best practices for the IT folks to follow to ensure we can still access the data itself – it will ultimately be up to the many individuals carrying on their daily business in offices around the world to ensure that we can understand the information in the records. I suppose that isn’t new either – just another reason for journalists and archivists to make their voices heard while the people who can explain the relationships between the born digital records and the business processes that created them are still around to answer questions.

Should we be archiving fonts?

February 9, 2007

I am a fan of beautiful fonts. This is why I find myself on the mailing list if MyFonts.com. I recently received their Winter 2007 newsleter featuring the short article titled ‘A cast-iron investment’. It starts out with:

Of all the wonderful things about fonts, there’s one that is rarely mentioned by us font sellers. It’s this: fonts last for a very long time. Unlike almost all the other software you may have bought 10 or 15 years ago, any fonts you bought are likely still working well, waiting to be called back into action when you load up that old newsletter or greetings card you made!

Interesting. The article goes on to point out:

But, of course, foundries make updates to their fonts every now and then, with both bug fixes and major upgrades in features and language coverage.

All this leaves me wondering if there is a place in the world for a digital font archive. A single source of digital font files for use by archives around the world. Of course, there would be a number of hurdles:

How do you make sure that the fonts are only available for use in documents that used the fonts legally?
How do you make sure that the right version of the font is used in the document to show us how the document appeared originally?

You could say this is made moot by using something like Adobe’s PDF/A format. It is also likely that we won’t be running the original word processing program that used the fonts a hundred years from now.

Hurdles aside, somehow it feels like a clever thing to do. We can’t know how we might enable access to documents that use fonts in the future. What we can do is keep the font files so we have the option to do clever things with them in the future.

I would even make a case for the fact that fonts are precious in their own right and deserve to be preserved. My mother spent many years as a graphic designer. From her I inherited a number of type specimen books – including one labeled “Adcraft Typographers, Inc”. Google led me to two archival collections that include font samples from Adcraft:

University of Delaware Library Special Collections: J. Ben Lieberman Papers – Series VII: Type Specimens and Commercial Type Directories, 1900s
University of Central Florida: Sol and Sadie Malkoff Papers – listed as “including over a hundered typography and font specimens”

Another great reason for a digital font archive is the surge in individual foundries creating new fonts every day. What once was an elite craft now has such a low point of entry that anyone can download some software and hang out their shingle as a font foundry. Take a look around MyFonts.com. Read about selling your fonts on MyFonts.com.

While looking for a good page about type foundries I discovered the site for Precision Type which shows this on their only remaining page:

For the last 12 years, Precision Type has sought to provide our customers with convenient access to a large and diverse range of font software products. Our business grew as a result of the immense impact that digital technology had in the field of type design. At no other time in history had type ever been available from so many different sources. Precision Type was truly proud to play a part in this exciting evolution.

Unfortunately however, sales of font software for Precision Type and many others companies in the font business have been adversely affected in recent years by a growing supply of free font software via the Internet. As a result, we have decided to discontinue our Precision Type business so that we can focus on other business opportunities.

I have to go back to May 23, 2004 in the Internet Archive Wayback Machine to see what Precision Type’s used to look like.

There are more fonts than ever before. Amateurs are driving professionals out of business. Definitely sounds like digital fonts and their history are a worthy target for archival preservation.

Book Review: Digital Preservation

January 18, 2007

In my quest for information about archiving geospatial data last term, I got my hands on a copy of Marilyn Deegan and Simon Tanner’s Digital Preservation (part of the Digital Futures Series). This excellent volume consists of nine chapters each written by different authors who are leaders in their respective fields (shown in order of their respective chapters):

David Holdsworth: known for his work on the CEDARS and CAMiLEON projects
Robin Wendler: metadata analyst at the Harvard University Library Office for Information Studies
Julien Masanès: co-founder of the European Archive
Elisa Mason: maintains the Forced Migration Current Awareness Blog
Brian F. Lavoie: a research scientist at OCLC
Stephen Chapman: Preservation Librarian for Digital Initiatives in the Weissman Preservation Center, Harvard University Library
Peter McKinney: research officer for the espida project at the University of Glasgow
Jasmine Kelly: a former research assistant at the Centre for Computing in the Humanities, King’s College London

This fabulous band of writers and researchers were led by Marilyn Deegan and Simon Tanner, both based out of the King’s College London.

Published in 2006, this is one of the most comprehensive and up to date books I found on the subject. The book starts out with two chapters addressing the basic issues related to digital preservation. Subsequent chapters present information about all kinds of metadata, web archiving, the costs of digital preservation and an overview of European approaches. The final chapter presents an extensive series of case studies – complete with URLs to give you plently of information online to explore.

This book gave me a great foundation from which to explore the details of various geospatial data and GIS archiving efforts. For those faced with the challenge of planning for digital preservation, the two chapters on costs should be very useful. So many articles talk about how it will be so expensive to ensure proper digital preservation, but don’t give people in the field any practical advice in planning for the costs – this book is different. The exploration of existing approaches being used at major institutions throughout Europe give a good sense of evolving standards and best practices.

If you are looking for a way to get a handle on the issues involved in digital preservation – this a great starting point. The final chapter on case studies alone could keep you busy for a month as you explore all the websites of projects from around the world. While the book has a decidedly European focus, the concepts are applicable the world over. If you are responsible for ensuring that digital records (either digitized or born digital) are protected and preserved – this book explains the basics and explores various strategies. They don’t oversimplify things – but take the time to explain things well. They are honest about those questions that aren’t answered yet… and they point to as many resources, standards and examples as they can. While Digital Preservation cannot provide a formula for everyone to follow, it can help you start asking the right questions and begin to understand the possibilities.

The Edges of the GIS Electronic Record

January 2, 2007 3 Comments

I spent a good chunk of the end of my fall semester writing a paper ultimately titled “Digital Geospatial Records: Challenges of Selection and Appraisal”. I learned a lot – especially with the help of archivists out there on the cutting edge who are trying to find answers to these problems. I plan on a number of posts with various ideas from my paper.

To start off, I want to consider the topic of defining the electronic record in the context of GIS. One of the things I found most interesting in my research was the fact that defining exactly what a single electronic record consists of is perhaps one of the most challenging steps.

If we start with the SAA’s glossary definition of the term ‘record’ we find the statement that “A record has fixed content, structure, and context.” The notes go on to explain:

Fixity is the quality of content being stable and resisting change. To preserve memory effectively, record content must be consistent over time. Records made on mutable media, such as electronic records, must be managed so that it is possible to demonstrate that the content has not degraded or been altered. A record may be fixed without being static. A computer program may allow a user to analyze and view data many different ways. A database itself may be considered a record if the underlying data is fixed and the same analysis and resulting view remain the same over time.

This idea presents some major challenges when you consider data that does not seem ‘fixed’. In the fast moving and collaborative world of the internet, Geographic Information Systems are changing over time – but the changes themselves are important. We no longer live in a world in which the way you access a GIS is via a CD which has a specific static version of the map data you are considering.

One of the InterPARES 2 case studies I researched for my paper was the Preservation of the City of Vancouver GIS database (aka VanMap). Via a series of emails exchanged with the very helpful Evelyn McLellan (who is working on the case study) I learned that the InterPARES 2 researchers concluded that the entire VanMap system is a single record. This decision was based on the requirement of ‘archival bond’ to be present in order for a record to exist. I have included my two favorite definitions of archival bond from the InterPARES 2 dictionary below:

archival bond
n., The network of relationships that each record has with the records belonging in the same aggregation (file, series, fonds). [Archives]

n., The originary, necessary and determined web of relationships that each record has at the moment at which it is made or received with the records that belong in the same aggregation. It is an incremental relationship which begins when a record is first connected to another in the course of action (e.g., a letter requesting information is linked by an archival bond to the draft or copy of the record replying to it, and filed with it. The one gives meaning to the other). [Archives]

I especially appreciate the second definition above because it’s example gives me a better sense of what is meant by ‘archival bond’ – though I need to do more reading on this to get a better grasp of it’s importance.

Given the usage of VanMap by public officials and others, you can imagine that the state of the data at any specific time is crucial to determining the information used for making key decisions. Since a map may be created on the fly using multiple GIS layers but never saved or printed – it is only the knowledge that someone looked at the information at a particular time that would permit those down the road to look through the eyes of the decision makers of the past. Members of the VanMap team are now working with the Sustainable Archives & Library Technologies (SALT) lab at the San Diego Supercomputer Center (SDSC) to use data grid technology to permit capturing the changes to VanMap data over time. My understanding is that a proof of concept has been completed that shows how data from a specific date can be reconstructed.

In contrast with this approach we can consider what is being done to preserve GIS data by the Archivist of Maine in the Maine GeoArchives. In his presentation titled “Managing GIS in the Digital Archives” delivered at the 2006: Joint Annual Meeting of NAGARA, COSA, and SAA on August 3, 2006, Jim Henderson explained their approach of appraising individual layers to determine if they should be accessioned in the archive. If it is determined that the layer should be preserved, then issues of frequency of data capture are addressed. They have chosen a pragmatic approach and are currently putting these practices to the test in the real world in an ambitious attempt to prevent data loss as quickly as is feasible.

My background is as a database designer and developer in the software industry. In my database life, a record is usually a row in a database table – but when designing a database using Entity-Relationship Modeling (and I will admit I am of the “Crow’s Feet” notation school and still get a smile on my face when I see the cover of the CASE*Method: Entity Relationship Modelling book) I have spent a lot of time translating what would have been a single ‘paper record’ into the combination of rows from many tables.

The current system I am working on includes information concerning legal contracts. Each of these exists as a single paper document outside the computers – but in our system we distribute information that is needed to ‘rebuild’ the contract into many different tables. One for contact information – one for standard clauses added to all the contracts of this type – another set of tables for defining financial formulas associated with the contract. If I then put on my archivist hat and I didn’t just choose to keep the paper agreement, I would of course draw my line around all these different records needed to rebuild the full contract. I see that there is a similar definition listed as the second definition on the InterPARES 2 Terminology Dictionary for the term ‘Record‘:

n., In data processing, a grouping of interrelated data elements forming the basic unit of a file. A Glossary of Archival and Records Terminology (The Society of American Archivists)

Just in this brief survey we can see three very different possible views on where to draw a line around what constitutes a single Geographic Information System electronic record. Is it the entire database, a single GIS layer or some set of data elements which create a logical record? Is it worthwhile trying to contrast the definition of a GIS record with the definition of a record when considering analog paper maps? I think the answer to all of these questions is ‘sometimes’.

What is especially interesting about coming up with standard approaches to archiving GIS data is that I don’t believe there is one answer. Saying ‘GIS data’ is about as precise as saying ‘database record’ or ‘entity’ – it could mean anything. There might be a best answer for collaborative online atlases.. and another best answer for state government managed geographic information library.. and yet another best answer for corporations dependent on GIS data for doing their business.

I suspect that it will be via thorough analysis of the information stored in a GIS system, how it is/was created, how often it changes and how it was used that will determine the right approach for archiving these born digital records. There are many archivists (and IT folks and map librarians and records managers) around the world who have a strong sense of panic over the imminent loss of geospatial data. As a result, people from many fields are trying different approaches to stem the loss. It will be interesting to consider these varying approaches (and their varying levels of success) over the next few years. We can only hope that a few best practices will rise to the top quickly enough that we can ensure access to vital geospatial records in the future.

DMCA Exemption Added That Supports Archivists

November 24, 2006

The Digital Millennium Copyright Act, aka DMCA (which made it illegal to create or distribute technology which can get around copyright protection technology) has six new classes of exemptions added today.

From the very long named Rulemaking on Exemptions from Prohibition on Circumvention of Technological Measures that Control Access to Copyrighted Works out of the U.S. Copyright Office (part of the Library of Congress) comes the addition of the following class of work that will not be “subject to the prohibition against circumventing access controls”:

Computer programs and video games distributed in formats that have become obsolete and that require the original media or hardware as a condition of access, when circumvention is accomplished for the purpose of preservation or archival reproduction of published digital works by a library or archive. A format shall be considered obsolete if the machine or system necessary to render perceptible a work stored in that format is no longer manufactured or is no longer reasonably available in the commercial marketplace.

This remain valid from November 27, 2006 through October 27, 2009. Hmm.. three years? So what happens if this expires and doesn’t get extended (though one would imagine by then either we will have a better answer to this sort of problem OR the problem will be even worse than it is now)? When you look at the fact that places like NARA have fabulous mission statements for their Electronic Records Archives with phrases like “for the life of the republic” in them – three years sounds pretty paltry.

That said, how interesting to have archivists highlighted as benefactors of new legal rules. So now it will be legal (or at least not punishable under the DMCA) to create and share programs to access records created by obsolete software. I don’t know enough about the world of copyright and obsolete software to be clear on how much this REALLY changes what places like NARA’s ERA and other archives pondering the electronic records problem are doing, but clearly this exemption can only validate a lot of work that needs to be done.

Interesting Interface for exploring E-mail: TrampolineSystem SONAR Platform

October 26, 2006 1 Comment

Conversations and articles about the problem of archiving and accessing e-mail are often accompanied by the wringing of hands or the shrugging of shoulders. It has often seemed to me that figuring out how to archive and facilitate access to e-mail is a challenge that most people would rather ignore because it seems so difficult (and because there are plenty other things that need work too).

“In October 2003 the US Federal Energy Regulatory Commission placed 200,000 of Enron‘s internal emails from 1999-2002 into the public domain as part of its ongoing investigations.” So says TrampolineSystems on their facinating website that lets you explore those 200,000 public domain e-mails using their SONAR platform (that stands for Social Networks and Relevance). I would highly recommend taking a look and browsing around the Enron e-mails.

It appears that SONAR somehow tags the emails without human intervention – though they do not state this specifically one way or the other. The implication from the SONAR PR page is that you plug in the platform – and you instantly have this new access to your information. It is my impression that this works for either a fixed collection of e-mails (as is the case with the Enron emails) – or for an active live e-mail collection that is changing over time.

I like the social network Visualizer and the way it shows you how people are related to one another as represented by their e-mail correspondence. I like the theme and people tag clouds. I like the ease with which I can search for and read emails. I like how clearly they specify what you searched on at the top of your e-mail result list – and how many e-mails, people and themes the list represents.

On the other hand, there are a number of things I wish I could do. I wish that it was clear to me what order the emails are listed in when I do a search on a term. I searched on the word ‘pager’ – and received 2012 emails in no obvious order (most likely relevance – but that is not at all clear). I would like to be able to re-sort the results (by date for example). I would like to be able to add together multiple tags and people to get a scoped list of emails between two people on a specific set of theme.

Just as in traditional archival collections – there is some non-unique information in the mix. I found a generic Hotwire promotional email while looking at the theme The Insider (4th hit on the list). While I suppose spam and legitimate e-mail ads (ie, ones you asked for) are interesting – perhaps software considering e-mail to retain permanently could block some of these somehow.

I like clicking on things in the Visualizer and seeing the social networks hidden within the e-mails – but that gets old quickly unless you are looking for something very specific. I found myself wanting more context. Who are these people? What are their jobs? How are they ‘officially’ related in the corporate hierarchy? How do these e-mails compare with a timeline of events? What about the content of attachments (they don’t seem to be part of this interface)? All of this information could be linked into this interface in such a way as to improve an outsider’s understanding of this amazing landscape of 200,000 e-mails.

All in all I think it is an excellent starting point and I applaud them for trying to find an answer to the email question rather than just ignoring the problem.

(Thanks to Boing Boing for the pointer to this site.)

The Yahoo! Time Capsule

October 13, 2006 4 Comments

Yahoo! is creating a time capsule. The first paragraph of the Yahoo! Time Capsule Overview concludes by claiming “This is the first time that digital data will be gathered and preserved for historical purposes”. Excuse me? What has the Internet Archive been doing since 1996? What are the Hurricane Digital Memory Bank and The September 11 Digital Archive doing? And that is just off the top of my head – the list could go on and on.

I think that what they are doing (collecting digital content from around the world for 30 days, then giving the timecapsule to the Smithsonian Folkways Recordings in Washington, DC) is great. I am not sure what the bit about being “beamed along a path of laser light into space” is all about – but it sounds sort of cool. To add an entry, it must be put under one of 10 themes: Love, Anger, Fun, Sorrow, Faith, Beauty, Past, Now, Hope or You. It seems like an interesting attempt at organizing what would could otherwise be just an endless stream of images. At the time of this post, they had 15,564 contributions over the course of the first 3 days. I even explored some of what they have – it is pretty. It reminded me a bit of the America 24/7 project from a few years back – though with more types of media and an aim to record a snapshot of the world, not just America.

They have another ridiculous claim on the main time capsule page: “This first-ever collection of electronic anthropology captures the voices, images and stories of the online global community.”

Go ahead and make a fabulous digital archive of contributions from around the world Yahoo!, but please stop claiming that you invented the idea. I can’t be the only person who is frustrated by the way they are presenting this. Please tell me I am not alone!

My New Daydream: A Hosting Service for Digitized Collections

September 20, 2006 3 Comments

In her post Predictions over on hangingtogether.org, Merrilee asked “Where do you predict that universities, libraries, archives, and museums will be irresistibly drawn to pooling their efforts?” after reading this article.

And I say: what if there were an organization that created a free (or inexpensive fee-based) framework for hosting collections of digitized materials? What I am imagining is a large group of institutions conspiring to no longer be in charge of designing, building, installing, upgrading and supporting the websites that are the vehicle for sharing digital historical or scholarly materials. I am coming at this from the archivists perspective (also having just pondered the need for something like this in my recent post: Promise to Put It All Online ) – so I am imagining a central repository that would support the upload of digitized records, customizable metadata and a way to manage privacy and security.

The hurdles I imagine this dream solution removing are those that are roughly the same for all archival digitization projects. Lack of time, expertise and ongoing funding are huge challenges to getting a good website up and keeping it running – and that is even before you consider the effort required to digitize and map metadata to records or collections of records. It seems to me that if a central organization of some sort could build a service that everyone could use to publish their content – then the archivists and librarians and other amazing folks of all different titles could focus on the actual work of handling, digitizing and describing the records.

Being the optimist I am I of course imagine this service as providing easy to use software with the flexibility for building custom DTDs for metadata and security to protect those records that cannot (yet or ever) be available to the public. My background as a software developer drives me to imagine a dream team of talented analysts, designers and programmers building an elegant web based solution that supports everything needed by the archival community. The architecture of deployment and support would be managed by highly skilled technology professionals who would guarantee uptime and redundant storage.

I think the biggest difference between this idea and the wikipedias of the world is that there would be some step required for an institution to ‘join’ such that they could use this service. The service wouldn’t control the content (in fact would need to be super careful about security and the like considering all the issues related to privacy and copyright) – rather it would provide the tools to support the work of others. While I know that some institutions would not be willing to let ‘control’ of their content out of their own IT department and their own hard drives, I think others would heave a huge sigh of relief.

There would still be a place for the Archons and the Archivists’ Toolkits of the world (and any and all other fabulous open-source tools people might be building to support archivists’ interactions with computers), but the manifestation of my dream would be the answer for those who want to digitize their archival collection and provide access easily without being forced to invent a new wheel along the way.

If you read my GIS daydreams post, then you won’t be surprised to know that I would want GIS incorporated from the start so that records could be tied into a single map of the world. The relationships among records related to the same geographic location could be found quickly and easily.

Somehow I feel a connection in these ideas to the work that the Internet Archive is doing with Archive-IT.org. In that case, producers of websites want them archived. They don’t want to figure out how to make that happen. They don’t want to figure out how to make sure that they have enough copies in enough far flung locations with enough bandwidth to support access – they just want it to work. They would rather focus on creating the content they want Archive-It to keep safe and accessible. The first line on Archive-It’s website says it beautifully: “Internet Archive’s new subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise.”

So, the tag line for my new dream service would be “DigiCollection’s new subscription service, Digitize-It, allows institutions to upload, manage and search their own digitized collections through a user friendly web application, without requiring any technical expertise.”

Category: future-proofing