In case you haven’t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions’ finding aids to extract data for my ArchivesZ project, I am here to tell you that this work is VERY important. With better standards in place we will have a better foundation upon which to create interesting new tools and services to support archivists and researchers. ...
I got a kind email today asking “Whither ArchivesZ?”. My reply was: “it is sleeping” (projects do need their rest) and “I just started a new job” (I am now a Metadata and Taxonomy Consultant at The World Bank) and “I need to find enthusiastic people to help me”. That final point brings me to this post.
I find myself in the odd position of having finished my Master’s Degree and not wanting to sign on for the long haul of a PhD. So I have a big project that was born in academia, initially as a joint class project and more recently as independent research with a grant-funded programmer, but I am no longer in academia. ...
Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA’s Archives and Special Collections finding aids in XML format.
With the current scripts, these are the fun tag stats:
- 1,684 total tags extracted
- 75% (1,266 tags) are associated with only one finding aid
- 3% (51 tags) are associated with 10 or more finding aids
235 out of tne 253 collections ended up with a collection size of 0.
Consider the encoding of the collection size in the Guide to the Women’s Overseas Service League Records, 1910-2007:
<physdesc label="Extent:" encodinganalog="300$a"> 77 linear feet (approximately 44,000 items) </physdesc>
Contrast this with one of the examples where the size of the collection was extracted properly by the current script:
<physdesc label="Extent:" encodinganalog="300$a"> <extent>8.4 linear feet</extent> (14 boxes) </physdesc>
Sometimes it feels like a game of Where’s Waldo. In this case we are simply missing the set of <extent> tags from the first example. Off I went to the EAD tag descriptions to find the guidelines for use of the <physdesc> tag, where I found this overview of the tag:
A wrapper element for bundling information about the appearance or construction of the described materials, such as their dimensions, a count of their quantity or statement about the space they occupy, and terms describing their genre, form, or function, as well as any other aspects of their appearance, such as color, substance, style, and technique or method of creation. The information may be presented as plain text, or it may be divided into the <dimension>, <extent>, <genreform>, and <physfacet> subelements.
Bad news for my script logic – both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the <extent> tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the <physdesc> tags without enclosing <extent> or <dimension> tags.
Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of <unitdate> formats that my current script could not parse properly, including the examples below:
- 1917-1980 (bulk 1920-1945)
- 1876-1903, 1914-1919, 1940-2002
- 1940s, 1970s-1990s
Another encoding approach that could not be parsed was the one used for the finding aid of the Church Women United of San Antonio Records. In this case the <unitdate> tag is within the <unittitle> tag as seen here:
<unittitle label="Title:" encodinganalog="245"> Church Women United of San Antonio Records, <unitdate label="Dates:" encodinganalog="245$a">1961-2005</unitdate> </unittitle> ...
Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:
- Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like William E. Towell Papers, 1941 – 1988, my database ended up only with “William E Towell Papers,” based on this encoding: <titleproper>Inventory of the William E. Towell Papers, <date normal=”1941/1988″>1941 – 1988</date></titleproper>
- Need to handle a conversion factor for a size of “1 folder” (as found in the Inventory of the Biltmore Forest School Images, 1890 – 1988)
- My script chokes on the Inclusive Year format “1910 and 1931 – 1937” (as found in the Inventory of the Alfred Cunningham Papers, 1910 and 1931 – 1937)
- The presence of a <lb/> character within the <extent> tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the Inventory of the DeWitt Nelson Papers, 1940 – 1976)
- Within the <abstract> tag, my script drops everything after an <emph render=”doublequote”> tag (making for a very short abstract in the case of the Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 – 1947).
The most dramatic issue, seen across all the finding aids in this set, is that no subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of <list> and <item> tags as shown here:
<controlaccess> <head>Subject Headings</head> <list type="simple"> <item><genreform source="lcnaf" encodinganalog="655">Audiotapes</genreform></item> <item><persname source="lcnaf" encodinganalog="600">Ainsworth, John H., 1909-</persname></item> <item><subject source="lcnaf" encodinganalog="650">Businessmen -- United States</subject></item> ...
Gina Strack of the Utah State Archives and Records Service provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from EAD 1.0 finding aids. Their website includes a detailed account of the EAD Project.
These finding aids have helped me identify three types of ArchivesZ data challenges: ...
The title says it all. I won 2nd place in the “Smart Computers and Computing” section of the University of Maryland’s Graduate Research Interaction Day (GRID) for my poster ArchivesZ: Visualizing Archival Collections (what is in all those boxes?).
1st place in “Smart Computers and Computing” went to the fabulous Dave Levin for his presentation on TrInc: Small Trusted Hardware for Large Distributed Systems. ...
Come meet me and hear my 8 minute talk in front of a poster about ArchivesZ.
- When? April 13, 2009, 1:30-3pm
- What? University of Maryland’s Graduate Research Interaction Day (GRID)
- Where? University of Maryland’s Stamp Student Union
My ArchivesZ poster has been assigned to the “Smart Computers and Computer Science” theme. I will be with my poster in the Benjamin Bannekar B room at UMD’s Stamp Student Union from 1:30 to 3pm. If you are attending GRID, please stop by and say hello!
Want a preview or can’t make it? Here is the poster in question: ...
- University Archives
- Public Policy Papers
- Manuscript Division
- Latin American Ephemera Collection
- Engineering Library
So onward to the data issues and what they mean for my ever growing ‘script fix to-do list’.
As we saw with the Oregon State University finding aids, the finding aids from Princeton University had a wide range of different values for repository names. In the list below we spot some issues. Some end in periods, some do not. One has extra space (probably a carriage return) in the middle. One does not include Princeton in the repository name. Once we have many repositories’ finding aids in ArchivesZ, a repository name of ‘Engineering Library’ does not tell the user enough about where those collections can be found. ...
The Syracuse University Special Collections Research Center has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files.
My scripts extracted 6632 unique ‘tags’ from the subjects assigned to the finding aids. As part of the data parsing and loading of data for use in the visualizations, the script divides up compound subjects into tags. For example, in the subjects we find assigned to Syracuse University finding aids we find these values (number shown is number of finding aids to which that subject is assigned): ...
The Oregon State University Archives has generously contributed 356 of their finding aids in EAD format for use in the development of version 2 of ArchivesZ. This is my first post in a what will likely be a series of looks behind the scenes at the challenges facing a project like ArchivesZ on the data level.
Version one of ArchivesZ only used finding aids from the University of Maryland and the Library of Congress. This was definitely a case of the path of least resistance. I attend the University of Maryland and the Library of Congress has a very convenient page providing links to all their Finding Aids source XML files. A very key aspect of creating version 2 of ArchivesZ is making sure that the scripts that pull data from EAD XML files is robust enough to handle the encoding practices of a very diverse range of institutions.
Please keep in mind that OSU is likely to bear the brunt of many basic data issues that I would have unearthed with whatever data sets I tried first!
There are 3 crucial data elements on which the visualizations of ArchivesZ depend: subject, inclusive dates, and collection size. Each element presents unique challenges. The script parsing issues I am uncovering with the OSU finding aids are currently worst for collection size. In order to make pretty charts which let people compare the quantity of materials in each collection (or record group – please forgive that I use the term ‘collection’ to mean any set of records for which a finding aid has been created), we need to be able to assign a single number to represent the size of each collection. Based on the values used in the LOC and UMD finding aids, we chose to go with linear of feet as our standard unit of measurement. So the trick is to translate whatever archivists choose to put into the <physdesc> element of their finding aid into some number of linear feet.
These are the size conversion rules we implemented for version 1 of ArchivesZ:
- 1 microfilm reel = 1 linear foot
- Collections represented only by a number of items will be represented as .25 linear feet
- If size only specified in number of boxes, then 1 box = .5 linear feet
- When the size is given in some different types of units, they are prioritized in the following order: linear feet > boxes > microfilm reels > items
This works reasonably well when the physical description values are simple – it starts to fall apart when what is entered is more complicated. Here are some examples of the physical descriptions in the OSU finding aids:
Guide to the Phi Kappa Phi-OSU Chapter Records: The display in the ‘pretty’ version of the finding aid online shows this: 5.5 cubic feet (9 boxes, including 2 oversize boxes) (3 microfilm reels)
The version in the XML file is this:
<physdesc> <extent>5.5 cubic feet</extent> <extent>9 boxes, including 2 oversize boxes</extent> <extent>3 microfilm reels</extent> </physdesc>
With the current algorithm, this finding aid would be marked as being 3 linear feet in size. At a bare minimum, I must add ‘cubic feet’ as another unit to be converted. More difficult to discern is if I should have a value of 5.5 linear feet (assuming 1 cubic foot = 1 linear foot for the purposes of these comparisons) or a value of 8.5 linear feet (5.5 + 3 linear feet for the 3 microfilm reels). There is never going to be a perfect answer here, but clearly my logic needs to be more sophisticated than it is now.
Harvey L. McAlister Collection: The display in the pretty version of this finding aid online is this: 1 cubic foot, including 26 photographs (4 boxes, including 2 oversize boxes, and 1 map folder)
The version in the XML file is this:
<physdesc> <extent encodinganalog="300$a">1 cubic foot, including 26 photographs</extent> <extent encodinganalog="300$a">4 boxes, including 2 oversize boxes, and 1 map folder</extent> </physdesc> ...