Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges

My most recent Archival Access class had a great guest speaker from the Journalism department. Professor Ira Chinoy is currently teaching a course on Computer-Assisted Reporting. In the first half of the session, he spoke about ways that archival records can fuel and support reporting. He encouraged the class to brainstorm about what might make archival records newsworthy. How do old records that have been stashed away for so long become news? It took a bit of time, but we got into the swing of it and came up with a decent list. He then went through his own list and gave examples of published news stories that fit each of the scenarios.

In the second half of class he moved on to address issues related to the freedom of information and struggling to gain access to born digital public records. Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They have many hurdles. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state (see the Maryland Public Information Act as an example) and journalists often must quote chapter and verse to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hopes of proving that it can be done. I love this article linked to in Prof. Chinoy’s syllabus: The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (And Suggestions on What You Should Say or Do).

After all that work – just getting your hands on the magic file of data is not enough. The data is of no use without the decoder ring of documentation and context.

I spent most of the 1990s designing and building custom databases, many for federal government agencies. There are an almost inconceivable number of person hours that go into the creation of most of these systems. Stakeholders from all over the organization destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated … and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in it – even if they are on the IT team for the full development life cycle.

Sometimes you get lucky and the project includes people with amazing technical writing skills, but usually those talented people are aimed at writing documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They will rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself and that is not likely to be preserved along with the data.

Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection remember such things as where frequently requested documents are or what restrictions must be applied to certain documents.

So where does that leave those who are playing detective to understand the records in these systems? Trying to figure out what the data in the tables mean based on the understanding of end-users can be a fool’s errand – and that is if you even have access to actual users of the system in the first place. I don’t think there is any easy answer given the realities of how many unique systems of managing data are being used throughout the public sector.

Archivists often find themselves struggling with the same problems. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records – though I recall discussions in my appraisal class last term about all the benefits of working with the producers of records long before they are earmarked to head to the archives. Unfortunately, it appeared that this was often the exception rather than the rule – even if it is the preferred scenario.

The overly ambitious and optimistic part had the idea that what ‘we’ really need is a database that lists common commercial off-the-shelf (COTS) packages used by public agencies – along with information on how to extract and redact data from these packages. For those agencies using custom systems, we could include any information on what company or contractors did the work – that sort of thing can only help later. Or how about just a list of which agencies use what software? Does something like this exist? The records of what technology is purchased are public record – right? Definitely an interesting idea (for when I have all that spare time I dream about). I wonder if I set up a wiki for people to populate with this information if people would share what they already know.

I would like to imagine a future world in which all this stuff is online and you can login and download any public record you like at any time. You can get a taste of where we are on the path to achieving this dream on the archives side of things by exploring a single series of electronic records published on the US National Archives site. For example, look at the search screen for World War II Army Enlistment Records. It includes links to sample data, record group info and an FAQ. Once you make it to viewing a record – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War II and Army enlistment. You still need the context of the events and this is where the FAQ comes in. Look at the information they provide – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place).

This isn’t a new problem that has appeared with born digital records. Archivists and journalists have always sought the context of the information with which they are working. The new challenge is in the added obstacles that a cryptic database system can add on top of the already existing challenges of decrypting the meaning of the records.

Archivists and Journalists care about a lot of the same issues related to born digital records. How do we acquire the records people will care about? How do we understand what they mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this?

It is interesting for me find a new angle from which to examine rapid software development. I have spent so much of my time creating software based on the needs of a specific user community. Usually those who are paying for the software get to call the shots on the features that will be included. Certain industries do have detailed regulations designed to promote access by external observers (I am thinking of applications related to medical/pharmaceutical research and perhaps HAZMAT data) but they are definitely exceptions.

Many people are worrying about how we will make sure that the medium upon which we record our born digital records remains viable. I know that others are pondering how to make sure we have software that can actually read the data such that it isn’t just mysterious 1s and 0s. What I am addressing here is another aspect of preservation – the preservation of context. I know this too is being worried about by others, but while I suspect we can eventually come up with best practices for the IT folks to follow to ensure we can still access the data itself – it will ultimately be up to the many individuals carrying on their daily business in offices around the world to ensure that we can understand the information in the records. I suppose that isn’t new either – just another reason for journalists and archivists to make their voices heard while the people who can explain the relationships between the born digital records and the business processes that created them are still around to answer questions.

Hi Jeanne,

I love checking out your blog and watching the journey of someone with vision and a strong technical background go through an archival studies program.

You’ve just encountered the black hole of digital preservation: databases.

Dynamic information systems built on relational databases for data persistence are the bane of e-recordkeeping and digital preservation. I don’t have time to go into all the issues.

This ERPAnet report does a pretty good job of summarizing the issues and potential solutions.

Based on some work I’ve done previously, I’d say these are the options for dealing with the preservation of databases and information objects stored in database information systems (i.e. the stuff that the journalists want to request under Freedom of Information requests). Each of them requires that the contextual information (i.e. system documentation) is available and up-to-date (yeah, right 😉

1)Technology preservation: maintaining all information within the live system or in a parallel system that mirrors a specific version of the system’s hardware, software and file format configurations.

2) Dumping data into delimited flat files, XML files or standardized SQL and maintaining server audit logs to track all user interactions and data creation, reading, updating and deletions within the system. This is a technique that is being experimented with on the InterPARES Project’s VanMap case study. As well, the Swiss National Archives has been developing a tool (SIARD) to preserve databases using standardized SQL.

3) Treating the entire system as one compound logical object and developing tools to reconstruct the system at certain points in time using a combination of backup data, audit logs and snapshots.

4) Using the data archiving functionality of commercial enterprise information systems (e.g. SAP) to extract and capture discrete ‘archive data objects’ from the system

5) Outputting sets of system information to a more static document format (i.e. rendering database reports or screen captures to PDF format)

3 Comments

Peter Van Garderen
February 20, 2007 at 2:18 pm

Hi Jeanne,

I love checking out your blog and watching the journey of someone with vision and a strong technical background go through an archival studies program.

You’ve just encountered the black hole of digital preservation: databases.

Dynamic information systems built on relational databases for data persistence are the bane of e-recordkeeping and digital preservation. I don’t have time to go into all the issues.

This ERPAnet report does a pretty good job of summarizing the issues and potential solutions.

Based on some work I’ve done previously, I’d say these are the options for dealing with the preservation of databases and information objects stored in database information systems (i.e. the stuff that the journalists want to request under Freedom of Information requests). Each of them requires that the contextual information (i.e. system documentation) is available and up-to-date (yeah, right 😉

1)Technology preservation: maintaining all information within the live system or in a parallel system that mirrors a specific version of the system’s hardware, software and file format configurations.

2) Dumping data into delimited flat files, XML files or standardized SQL and maintaining server audit logs to track all user interactions and data creation, reading, updating and deletions within the system. This is a technique that is being experimented with on the InterPARES Project’s VanMap case study. As well, the Swiss National Archives has been developing a tool (SIARD) to preserve databases using standardized SQL.

3) Treating the entire system as one compound logical object and developing tools to reconstruct the system at certain points in time using a combination of backup data, audit logs and snapshots.

4) Using the data archiving functionality of commercial enterprise information systems (e.g. SAP) to extract and capture discrete ‘archive data objects’ from the system

5) Outputting sets of system information to a more static document format (i.e. rendering database reports or screen captures to PDF format)
Pingback:Book Review: Dreaming in Code (a book about why software is hard) - SpellboundBlog.com - spellbound by archival science and information technology in the digital age
jeevi
November 18, 2015 at 5:44 am

Great Article..It was very informative..I need more details from your side..include some tips..I am working in best mobile application development company Dubai

Comments are closed.