The Oregon State University Archives has generously contributed 356 of their finding aids in EAD format for use in the development of version 2 of ArchivesZ. This is my first post in a what will likely be a series of looks behind the scenes at the challenges facing a project like ArchivesZ on the data level.
Version one of ArchivesZ only used finding aids from the University of Maryland and the Library of Congress. This was definitely a case of the path of least resistance. I attend the University of Maryland and the Library of Congress has a very convenient page providing links to all their Finding Aids source XML files. A very key aspect of creating version 2 of ArchivesZ is making sure that the scripts that pull data from EAD XML files is robust enough to handle the encoding practices of a very diverse range of institutions.
Please keep in mind that OSU is likely to bear the brunt of many basic data issues that I would have unearthed with whatever data sets I tried first!
There are 3 crucial data elements on which the visualizations of ArchivesZ depend: subject, inclusive dates, and collection size. Each element presents unique challenges. The script parsing issues I am uncovering with the OSU finding aids are currently worst for collection size. In order to make pretty charts which let people compare the quantity of materials in each collection (or record group – please forgive that I use the term ‘collection’ to mean any set of records for which a finding aid has been created), we need to be able to assign a single number to represent the size of each collection. Based on the values used in the LOC and UMD finding aids, we chose to go with linear of feet as our standard unit of measurement. So the trick is to translate whatever archivists choose to put into the <physdesc> element of their finding aid into some number of linear feet.
These are the size conversion rules we implemented for version 1 of ArchivesZ:
- 1 microfilm reel = 1 linear foot
- Collections represented only by a number of items will be represented as .25 linear feet
- If size only specified in number of boxes, then 1 box = .5 linear feet
- When the size is given in some different types of units, they are prioritized in the following order: linear feet > boxes > microfilm reels > items
This works reasonably well when the physical description values are simple – it starts to fall apart when what is entered is more complicated. Here are some examples of the physical descriptions in the OSU finding aids:
Guide to the Phi Kappa Phi-OSU Chapter Records: The display in the ‘pretty’ version of the finding aid online shows this: 5.5 cubic feet (9 boxes, including 2 oversize boxes) (3 microfilm reels)
The version in the XML file is this:
<extent>5.5 cubic feet</extent>
<extent>9 boxes, including 2 oversize boxes</extent>
<extent>3 microfilm reels</extent>
With the current algorithm, this finding aid would be marked as being 3 linear feet in size. At a bare minimum, I must add ‘cubic feet’ as another unit to be converted. More difficult to discern is if I should have a value of 5.5 linear feet (assuming 1 cubic foot = 1 linear foot for the purposes of these comparisons) or a value of 8.5 linear feet (5.5 + 3 linear feet for the 3 microfilm reels). There is never going to be a perfect answer here, but clearly my logic needs to be more sophisticated than it is now.
Harvey L. McAlister Collection: The display in the pretty version of this finding aid online is this: 1 cubic foot, including 26 photographs (4 boxes, including 2 oversize boxes, and 1 map folder)
The version in the XML file is this:
<extent encodinganalog="300$a">1 cubic foot, including 26 photographs</extent>
<extent encodinganalog="300$a">4 boxes, including 2 oversize boxes, and 1 map folder</extent>