ArchivesZ Data Challenges: Oregon State University Archives

The Oregon State University Archives has generously contributed 356 of their finding aids in EAD format for use in the development of version 2 of ArchivesZ. This is my first post in a what will likely be a series of looks behind the scenes at the challenges facing a project like ArchivesZ on the data level.

Version one of ArchivesZ only used finding aids from the University of Maryland and the Library of Congress. This was definitely a case of the path of least resistance. I attend the University of Maryland and the Library of Congress has a very convenient page providing links to all their Finding Aids source XML files. A very key aspect of creating version 2 of ArchivesZ is making sure that the scripts that pull data from EAD XML files is robust enough to handle the encoding practices of a very diverse range of institutions.

Please keep in mind that OSU is likely to bear the brunt of many basic data issues that I would have unearthed with whatever data sets I tried first!

There are 3 crucial data elements on which the visualizations of ArchivesZ depend: subject, inclusive dates, and collection size. Each element presents unique challenges. The script parsing issues I am uncovering with the OSU finding aids are currently worst for collection size. In order to make pretty charts which let people compare the quantity of materials in each collection (or record group – please forgive that I use the term ‘collection’ to mean any set of records for which a finding aid has been created), we need to be able to assign a single number to represent the size of each collection. Based on the values used in the LOC and UMD finding aids, we chose to go with linear of feet as our standard unit of measurement. So the trick is to translate whatever archivists choose to put into the <physdesc> element of their finding aid into some number of linear feet.

These are the size conversion rules we implemented for version 1 of ArchivesZ:

1 microfilm reel = 1 linear foot
Collections represented only by a number of items will be represented as .25 linear feet
If size only specified in number of boxes, then 1 box = .5 linear feet
When the size is given in some different types of units, they are prioritized in the following order: linear feet > boxes > microfilm reels > items

This works reasonably well when the physical description values are simple – it starts to fall apart when what is entered is more complicated. Here are some examples of the physical descriptions in the OSU finding aids:

Guide to the Phi Kappa Phi-OSU Chapter Records: The display in the ‘pretty’ version of the finding aid online shows this: 5.5 cubic feet (9 boxes, including 2 oversize boxes) (3 microfilm reels)

The version in the XML file is this:

<physdesc>
  <extent>5.5 cubic feet</extent>
  <extent>9 boxes, including 2 oversize boxes</extent>
  <extent>3 microfilm reels</extent>
</physdesc>

With the current algorithm, this finding aid would be marked as being 3 linear feet in size. At a bare minimum, I must add ‘cubic feet’ as another unit to be converted. More difficult to discern is if I should have a value of 5.5 linear feet (assuming 1 cubic foot = 1 linear foot for the purposes of these comparisons) or a value of 8.5 linear feet (5.5 + 3 linear feet for the 3 microfilm reels). There is never going to be a perfect answer here, but clearly my logic needs to be more sophisticated than it is now.

Harvey L. McAlister Collection: The display in the pretty version of this finding aid online is this: 1 cubic foot, including 26 photographs (4 boxes, including 2 oversize boxes, and 1 map folder)

The version in the XML file is this:

<physdesc>
  <extent encodinganalog="300$a">1 cubic foot, including 26 photographs</extent>
  <extent encodinganalog="300$a">4 boxes, including 2 oversize boxes, and 1 map folder</extent>
</physdesc>

With the current algorithm, this finding aid would be marked as being 1 linear foot in size. From looking at these two examples, it would seem that this would be fine and in fact – for the purposes of calculating a comparable size – only looking at the first <extent> value might be the way to go – at least for OSU finding aids.

There are some other simpler issues relating to standardization in the way that certain values are entered. For example, after ingesting 173 finding aids from OSU (the number I got through before my script flat out choked on a size designation), I ended up with five different repositories added to my REPOSITORIES table. I had expected only one. Each of these was entered as repository name — and I have included the length of each value to show how extra spaces are causing part of the problem:

Oregon State University Libraries – length 36
Oregon State University – length 23
Oregon State UniversityLibraries – length 32
Oregon State University Libraries – length 36
Oregon State University Libraries – length 33

Some of these I can handle by adding smarter trimming of trailing spaces – but in this case it is clear that typos and inconsistency are also a challenge. I checked and each of these different <corpname> values, within the <repository> element is used by at least 10 finding aids. Perhaps they have been inherited over time from a template?

I have considered creating a repository definition file that could be used when loading finding aids from one repository at a time. This would remove dependence on perfect replication of these sorts of values while still supplying the data needed to let people limit their searches by a named repository.

The last issue is the most minor. There are many /n and /t characters throughout the XML documents. These I plan to simply strip out as the script parses the XML file.

A big thank you to Elizabeth Nielsen, Senior Staff Archivist at OSU Archives. Her response to my query about OSU’s comfort with my taking apart their finding aids in public on my blog was “Bring it on – we’re tough!”.

It is fascinating to dig into new finding aids and see how the parsing script handles what it finds. I plan to test the existing script on XML from more sources to see all the things that must be fixed. Then I get to wrap my head around code that someone else wrote (another member of the original ArchivesZ team wrote the version 1 ruby script). For those of you who are not programmers, you can skim through my Book Review of Dreaming in Code to get a handle on why this can be harder than it sounds like it should be.

Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.

Image Credit: OSU Archives image above from the OSU Archives Home Page.

1 Comment

Michele Combs
February 24, 2009 at 12:01 pm

Nice to hear the project is proceeding apace. Looks like you’re building a nice big test data set (we sent you the link to our 1000+ finding aids back in October as well). Looking forward to more peeks behind the curtain and most especially to an online prototype to play with!

Comments are closed.