Menu Close

Category: database design

Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges

My most recent Archival Access class had a great guest speaker from the Journalism department. Professor Ira Chinoy is currently teaching a course on Computer-Assisted Reporting. In the first half of the session, he spoke about ways that archival records can fuel and support reporting. He encouraged the class to brainstorm about what might make archival records newsworthy. How do old records that have been stashed away for so long become news? It took a bit of time, but we got into the swing of it and came up with a decent list. He then went through his own list and gave examples of published news stories that fit each of the scenarios.

In the second half of class he moved on to address issues related to the freedom of information and struggling to gain access to born digital public records. Journalists are usually early in the food chain of those vying for access to and understanding of federal, state and local databases. They have many hurdles. They must learn what databases are being kept and figure out which ones are worth pursuing. Professor Chinoy relayed a number of stories about the energy and perseverance required to convince government officials to give access to the data they have collected. The rules vary from state to state (see the Maryland Public Information Act as an example) and journalists often must quote chapter and verse to prove that officials are breaking the law if they do not hand over the information. There are officials who deny that the software they use will even permit extractions of the data – or that there is no way to edit the records to remove confidential information. Some journalists find themselves hunting down the vendors of proprietary software to find out how to perform the extract they need. They then go back to the officials with that information in the hopes of proving that it can be done. I love this article linked to in Prof. Chinoy’s syllabus: The Top 38 Excuses Government Agencies Give for Not Being Able to Fulfill Your Data Request (And Suggestions on What You Should Say or Do).

After all that work – just getting your hands on the magic file of data is not enough. The data is of no use without the decoder ring of documentation and context.

I spent most of the 1990s designing and building custom databases, many for federal government agencies. There are an almost inconceivable number of person hours that go into the creation of most of these systems. Stakeholders from all over the organization destined to use the system participate in meetings and design reviews. Huge design documents are created and frequently updated … and adjustments to the logic are often made even after the system goes live (to fix bugs or add enhancements). The systems I am describing are built using complex relational databases with hundreds of tables. It is uncommon for any one person to really understand everything in it – even if they are on the IT team for the full development life cycle.

Sometimes you get lucky and the project includes people with amazing technical writing skills, but usually those talented people are aimed at writing documentation for users of the system. Those documents may or may not explain the business processes and context related to the data. They will rarely expose the relationship between a user’s actions on a screen and the data as it is stored in the underlying tables. Some decisions are only documented in the application code itself and that is not likely to be preserved along with the data.

Teams charged with the support of these systems and their users often create their own documents and databases to explain certain confusing aspects of the system and to track bugs and their fixes. A good analogy here would be to the internal files that archivists often maintain about a collection – the notes that are not shared with the researchers but instead help the archivists who work with the collection remember such things as where frequently requested documents are or what restrictions must be applied to certain documents.

So where does that leave those who are playing detective to understand the records in these systems? Trying to figure out what the data in the tables mean based on the understanding of end-users can be a fool’s errand – and that is if you even have access to actual users of the system in the first place. I don’t think there is any easy answer given the realities of how many unique systems of managing data are being used throughout the public sector.

Archivists often find themselves struggling with the same problems. They have to fight to acquire and then understand the records being stored in databases. I suspect they have even less chance of interacting with actual users of the original system that created the records – though I recall discussions in my appraisal class last term about all the benefits of working with the producers of records long before they are earmarked to head to the archives. Unfortunately, it appeared that this was often the exception rather than the rule – even if it is the preferred scenario.

The overly ambitious and optimistic part had the idea that what ‘we’ really need is a database that lists common commercial off-the-shelf (COTS) packages used by public agencies – along with information on how to extract and redact data from these packages. For those agencies using custom systems, we could include any information on what company or contractors did the work – that sort of thing can only help later. Or how about just a list of which agencies use what software? Does something like this exist? The records of what technology is purchased are public record – right? Definitely an interesting idea (for when I have all that spare time I dream about). I wonder if I set up a wiki for people to populate with this information if people would share what they already know.

I would like to imagine a future world in which all this stuff is online and you can login and download any public record you like at any time. You can get a taste of where we are on the path to achieving this dream on the archives side of things by exploring a single series of electronic records published on the US National Archives site. For example, look at the search screen for World War II Army Enlistment Records. It includes links to sample data, record group info and an FAQ. Once you make it to viewing a record – every field includes a link to explain the value. But even this extensive detail would not be enough for someone to just pick up these records and understand them – you still need to understand about World War II and Army enlistment. You still need the context of the events and this is where the FAQ comes in. Look at the information they provide – and then take a moment to imagine what it would take for a journalist to recreate a similar level of detailed information for new database records being created in a public agency today (especially when those records are guarded by officials who are leery about permitting access to the records in the first place).

This isn’t a new problem that has appeared with born digital records. Archivists and journalists have always sought the context of the information with which they are working. The new challenge is in the added obstacles that a cryptic database system can add on top of the already existing challenges of decrypting the meaning of the records.

Archivists and Journalists care about a lot of the same issues related to born digital records. How do we acquire the records people will care about? How do we understand what they mean in the context of why and how they were created? How do we enable access to the information? Where do we get the resources, time and information to support important work like this?

It is interesting for me find a new angle from which to examine rapid software development. I have spent so much of my time creating software based on the needs of a specific user community. Usually those who are paying for the software get to call the shots on the features that will be included. Certain industries do have detailed regulations designed to promote access by external observers (I am thinking of applications related to medical/pharmaceutical research and perhaps HAZMAT data) but they are definitely exceptions.

Many people are worrying about how we will make sure that the medium upon which we record our born digital records remains viable. I know that others are pondering how to make sure we have software that can actually read the data such that it isn’t just mysterious 1s and 0s. What I am addressing here is another aspect of preservation – the preservation of context. I know this too is being worried about by others, but while I suspect we can eventually come up with best practices for the IT folks to follow to ensure we can still access the data itself – it will ultimately be up to the many individuals carrying on their daily business in offices around the world to ensure that we can understand the information in the records. I suppose that isn’t new either – just another reason for journalists and archivists to make their voices heard while the people who can explain the relationships between the born digital records and the business processes that created them are still around to answer questions.

GIS, Access, Archives and Daydreams

Today in my Information Structure class, our topic was Entity Relationship Modeling. While this is a technique that I have used frequently over the many years I have been designing Oracle databases, it was interesting to see a slightly different spin on the ideas. The second half of class was an exercise to take a stab (as a class) at coming up with a preliminary data model for a mythical genealogical database system.

While deciding if we should model PLACE as an entity, a woman in our class who is a genealogy specialist told us that only one database she has ever worked with tries to do any validation of location – but that it is virtually impossible due to the scale of the problem. Since the borders and names of places on earth have changed so rapidly over time, and often with little remaining documentation, it is hard to correlate place names from archival records with fixed locations on the planet. Anyone who has waded through the fabulous ship records on the Ellis Island website hunting for information about their grandparents or great-grandparents has struggled with trying to understand how the place names on those records relate to the physical world we live in.

So – now to my daydream. Imagine if we could somehow work towards a consolidated GIS database that included place names and boundary information throughout history. Each GIS layer would relate to specific years or eras in time. Imagine if you could connect any set of archival records that contained location data to this GIS database and not only visualize the records via a map – but visualize the records with the ability to change the layers so you could see how the boundaries and place names changed. And view the relationship between records that have different place names on them from different eras – but are actually from the same location.

I poked around to see what people are already doing – and found all of this:

I know it is a daydream – but I believe in my heart of hearts that it will exist someday as computing power increases, the price of storing data decreases and more data sources converge. I do forsee another issue related to the challenges presented by different versions of borders and place names from the same time period – but there are ways to address that too. It could happen – believe with me!