Gina Strack of the Utah State Archives and Records Service provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from EAD 1.0 finding aids. Their website includes a detailed account of the EAD Project.
These finding aids have helped me identify three types of ArchivesZ data challenges:
- strange characters
- broad composite subjects
- determination of accurate collection size
Strange and mysterious characters!
These finding aids use a special character in the place of the standard Library of Congress double dash which normally appears between subsections of the subject heading.
An example subject from the Utah Government XML looks like this:
Viewing the same subject in a pure text editor (such as vi):
By the time it gets into my database and is pulled out via a query in MySQL Query Browser it looks like this:
Rather than just stripping out all instances of —, my plan is to replace them with the standard Library of Congress double dash. This will ensure that the existing code that breaks the subjects down to tags will still work.
When I say “composite subject” what I mean is a subject that includes multiple very disparate terms. Rather than the Library of Congress style subjects, all aspects of which relate to the collection in question, these composite subjects cover multiple subjects which are grouped together for convenience.
This is a list of some of the most popular subjects for the Utah Gov collections:
- Politics, Government, and Law
- Business, Industry, Labor, and Commerce
- Science, Technology, and Health
- Arts, Humanities, and Social Sciences
These subjects throw a monkey wrench into my theories about decomposing subjects based on commas. The collections to which these subjects are assigned likely fit in only one of the component themes. For example, the “Inventory of Publications from Department of Technology Services, 1993-2008” is assigned the subject “Science, Technology, and Health”. If I divide this subject into 3 separate tags, the Science and Health tags would be quite misleading.
So that leaves me a bit trapped. If I want to divide subjects such as “Art, Cuban, 20th century”, as I discuss in my Syracuse University post, then I end up also dividing these umbrella subjects which separate such very divergent terms with commas.
This issue goes on my list of reasons to add a repository configuration file for use by the data extraction script.
Accurate Collection Size
In my quest to convert all sizes to linear feet – sizes such as these are challenging:
- 0.20 cubic foot and 1 microfilm reel
- 0.35 cubic foot and 2 microfilm reels
I also have situations of sizes be specified in multiple sections of the finding aid. The Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991 has a collection level size of “0.50 cubic foot and 2 microfilm reels”, but further down in this finding aid I see this:
series: ALERT Foundation records
- box 1, folder 1: Documentary: “”Letters from our Children,”” Motion picture film reel, 16mm
- box 1, folder 2: Documentary: “”Letters from our Children,”” VHS videocassette
- box 1, folder 3: Documentary: “”Letters from our Children,”” VHS videocassette
- box 1, folder 4: Documentary: “”Letters from our Children,”” VHS videocassette
When they said 2 microfilm reels – do they really mean a 16mm motion picture film reel and a VHS videocassette? Is there 1 VHS videocassette or 3? How sizes are specified in a specific repository’s finding aids is another possible candidate for a repository level configuration script.
Finally, here are a few tag stats:
- Only 31 tags (1.5% of all Utah Government tags) are associated with 10 or more collections
- 1404 tags (71.5%) are assigned to only a single collection
- 107 collections have been assigned only 1 tag
- 10 collections have no subjects
Of course these statistics are based on the current incarnation of the data extraction script. After I modify the script, there will be a greater number of tags and (hopefully) more overlap of tags across multiple collections. These types of statistics should help me gauge how well my data extraction logic is working.