- Spellbound Blog - https://www.spellboundblog.com -

ArchivesZ Data Challenges: Syracuse University Special Collections Research Center

The Syracuse University [1]Syracuse University Special Collections Research Center [1] has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files.

My scripts extracted 6632 unique ‘tags’ from the subjects assigned to the finding aids. As part of the data parsing and loading of data for use in the visualizations, the script divides up compound subjects into tags. For example, in the subjects we find assigned to Syracuse University finding aids we find these values (number shown is number of finding aids to which that subject is assigned):

As well as subjects, where the components are separated by commas such as these (number listed indicates total finding aids assigned that subject):

The goal is to capture the core ideas – to capture the overlap in subject matter among diverse collections. All of the collections with any of these subjects are about Art. With the current script, the tag Art is associated with 179 collections from Syracuse University. You can see from this tiny subset of subjects that other themes would be revealed when these subjects were decomposed more completely – and this just scratches the surface.

Out of the 6676 subjects, 5658 subjects are assigned to single collections. Out of the 6632 tags the current script extracted from those subjects, 5594 tags are assigned to single collections. Not much improvement with the current state of the script.

While currently the script does a good job with the Library of Congress double dash separation pattern, the Syracuse University data has shown me a number of other standard patterns that need to be handled which can be seen in the small sampling of art related subjects shown above. The easy one is removing periods and stripping spaces from the end of subject values. The harder change will be to implement smart separation of subjects into tags based on commas. This would need the code to only break up <subject> values while leaving <persname> and <corpname> alone. I will also need to examine <geogname> values from across various institutions to decide if it is better to break them up or leave them be.

Other than these subject issues, there are a few other script modification that I will need to make based on scenarios the data in the Syracuse finding aids have shown me:

All these issues are being added to my master ‘to do’ list for updating the EAD parsing script. Onward to the next data set.

Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form [2].

Image Credit: Syracuse University image above from Syracuse University Special Collections Research Center [1] home page.

Comments Disabled (Open | Close)

Comments Disabled To "ArchivesZ Data Challenges: Syracuse University Special Collections Research Center"

#1 Comment By Jackie Dooley On March 7, 2009 @ 1:06 pm

What’s your thinking in assigning .25 l.f. to a single-item collection? Seems like way too much.

#2 Comment By Jeanne On March 7, 2009 @ 1:42 pm

You are probably right, but at the moment the smallest unit I had been using was 1/4 of a linear foot. I guess I could go with .1 or .01 linear feet for a single item. Sound better?

#3 Comment By Michele On March 9, 2009 @ 9:25 am

Glad you managed to wade through our data (and happy there were no XML errors!) Amazing that so many (5658) subjects are assigned to single collections! I had no idea we had so many unique subject headings. When you parse them at the double-dash I wonder if that will change, since the different pieces of an LCSH heading will be able to mix and match?

Re. the size of the small collections: we use 0.1 linear ft.; they’re not actually physically that big, most of them, but there is a minimum amount of time and effort that goes into processing and EAD creation, no matter how small the collection, and we figured that 0.1 reflected that fairly accurately.

FYI, the art-related subjects that you noted are all @source=local and are used to create a list of the core subject areas in which we collect (see the drop-down subject list [3] or the full subject list [4]. You could probably omit them without distorting the data, since they’re also reflected more formally in the @source=lcsh subject tags.

#4 Comment By Michele On March 9, 2009 @ 9:29 am

Correction: Should have said, “Some of the art-related subjects you mention, as well as a few other common subject headings, are @source=local…” Too hasty first thing in the morning!

#5 Pingback By ArchivesZ Data Challenges: Utah Government Archives & Records Service – Spellbound Blog On April 26, 2009 @ 1:33 am

[…] If I want to divide subjects such as “Art, Cuban, 20th century”, as I discuss in my Syracuse University post, then I end up also dividing these umbrella subjects which separate such very divergent terms with […]

#6 Pingback By ArchivesZ Data Challenges: Forest History Society – Spellbound Blog On May 6, 2009 @ 5:30 pm

[…] This is in contrast with this example of encoding from Syracuse University: […]