The Syracuse University Special Collections Research Center has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files.
My scripts extracted 6632 unique ‘tags’ from the subjects assigned to the finding aids. As part of the data parsing and loading of data for use in the visualizations, the script divides up compound subjects into tags. For example, in the subjects we find assigned to Syracuse University finding aids we find these values (number shown is number of finding aids to which that subject is assigned):
- Art — American — 20th century (1)
- Art — Cartoonists (68)
- Art — Cartoonists. (3)
- Art — Exhibitions. (1)
- Art — Illustrators (36)
- Art — Illustrators. (1)
- Art — Painters (77)
- Art — Philosophy. (1)
- Art — Sculpture (33)
As well as subjects, where the components are separated by commas such as these (number listed indicates total finding aids assigned that subject):
- Art, American (33)
- Art, American. (46)
- Art, American, 20th century (28)
- Art, American, 20th century. (31)
- Art, Cuban, 20th century (1)
- Art, Modern (1)
- Art, French, 20th century. (1)
The goal is to capture the core ideas – to capture the overlap in subject matter among diverse collections. All of the collections with any of these subjects are about Art. With the current script, the tag Art is associated with 179 collections from Syracuse University. You can see from this tiny subset of subjects that other themes would be revealed when these subjects were decomposed more completely – and this just scratches the surface.
Out of the 6676 subjects, 5658 subjects are assigned to single collections. Out of the 6632 tags the current script extracted from those subjects, 5594 tags are assigned to single collections. Not much improvement with the current state of the script.
While currently the script does a good job with the Library of Congress double dash separation pattern, the Syracuse University data has shown me a number of other standard patterns that need to be handled which can be seen in the small sampling of art related subjects shown above. The easy one is removing periods and stripping spaces from the end of subject values. The harder change will be to implement smart separation of subjects into tags based on commas. This would need the code to only break up <subject> values while leaving <persname> and <corpname> alone. I will also need to examine <geogname> values from across various institutions to decide if it is better to break them up or leave them be.
Other than these subject issues, there are a few other script modification that I will need to make based on scenarios the data in the Syracuse finding aids have shown me:
- Syracuse University uses an entity to populate the repository values – the current script does not handle this at all.
- Ensure that single item collections are assigned a size of .25 linear feet
- Linear ft must be added as another recognized abbreviation for linear feet
All these issues are being added to my master ‘to do’ list for updating the EAD parsing script. Onward to the next data set.
Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.
Image Credit: Syracuse University image above from Syracuse University Special Collections Research Center home page.