ArchivesZ | Spellbound Blog

Support EAD Tagging Research

December 6, 2010

In case you haven’t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions’ finding aids to extract data for my ArchivesZ project, I am here to tell you that this work is VERY important. With better standards in place we will have a better foundation upon which to create interesting new tools and services to support archivists and researchers.

Is part of your job is to encode finding aids in EAD? Then please ask if you can send a dozen of them to the researchers on this project!

Seeking EAD records from repositories that have implemented EAD

Standards have been entering the archival lexicon at a fast pace to ensure data reliability, enable data aggregation, and manage data over the long term. However, we have not yet examined the use of these standards across the archival community. As we move into the next phase of standards-creation, a broad look at current implementations will help to inform the next generation of these standards. To do this, Kathy Wisser (Simmons College) and Jackie Dean (UNC Chapel Hill) are conducting research on EAD tag usage in the encoding community.

This project is intended to inform the TS-EAD revision process of the standard, and results will be disseminated through traditional publication avenues.

We are seeking a sample of encoded finding aids from institutions that have implemented EAD. If you are willing to participate in this project, please submit via electronic mail 12 to 15 finding aids to eadtagresearch@gmail.com by December 15, 2010.

The goal of the project is to identify encoding behavior and not to evaluate the quality of the encoding or the content of the finding aid. We will be noting the presence and absence of elements and attributes and the way that elements are used within the context of an EAD instance.

All results will be anonymized; no institution-specific information will be linked to the results. Institutions willing to participate will be acknowledged.

In order to obtain an accurate account of the use of the standard, we are looking for EAD instances from as many institutions as possible. We hope you will consider contributing to this effort.

If you have any questions about the project, please contact:

Kathy Wisser (Simmons College – wisser@simmons.edu)

Jackie Dean (UNC Chapel Hill – jdean@email.unc.edu)

ArchivesZ Needs You!

July 7, 2010 1 Comment

I got a kind email today asking “Whither ArchivesZ?”. My reply was: “it is sleeping” (projects do need their rest) and “I just started a new job” (I am now a Metadata and Taxonomy Consultant at The World Bank) and “I need to find enthusiastic people to help me”. That final point brings me to this post.

I find myself in the odd position of having finished my Master’s Degree and not wanting to sign on for the long haul of a PhD. So I have a big project that was born in academia, initially as a joint class project and more recently as independent research with a grant-funded programmer, but I am no longer in academia.

What happens to projects like ArchivesZ? Is there an evolutionary path towards it being a collaborative project among dispersed enthusiastic individuals? Or am I more likely to succeed by recruiting current graduate students at my former (and still nearby) institution? I have discussed this one-on-one with a number of individuals, but I haven’t thrown open the gates for those who follow me here online.

For those of you who have been waiting patiently, the ArchivesZ version 2 prototype is avaiable online. I can’t promise it will stay online for long – it is definitely brittle for reasons I haven’t totally identified. A few things to be aware of:

when you load the main page, you should see tags listed at the bottom – if you don’t at all, then drop me an email via my contact form and I will try and get Tomcat and Solr back up. If you have a small screen – you may need to view your browser full screen to get to all the parts of the UI.
I know there are lots of bugs of various sizes. Some paths through the app work – some don’t. Some screens are just placeholders. Feel free to poke around and try things – you can’t break it for anyone else!

I think there are a few key challenges to building what I would think of as the first ‘full’ version of ArchivesZ – listed here in no particular order:

In the process of creating version 2, I was too ambitious. The current version of ArchivesZ has lots of issues, some usability – some bugs (see prototype above!)
Wherever a collaborative workspace of ArchivesZ were going to live, it would need large data sets. I did a lot of work on data from eleven institutions in the spring of 2009, so there is a lot of data available – but it is still a challenge.
A lot of my future ideas for ArchivesZ are trapped in my head. The good news is that I am honestly open to others’ ideas for where to take it in the future.
How do we build a community around the creation of ArchivesZ?

I still feel that there is a lot to be gained by building a centralized visualization tool/service through which researchers and archivists could explore and discover archival materials. I even think there is promise to a freestanding tool that supports exploration of materials within a single institution. I can’t build it alone. This is a good thing – it will be a much better in the end with the input, energy and knowledge of others. I am good at ideas and good at playing the devil’s advocate. I have lots of strength on the data side of things and visualization has been a passion of mine for years. I need smart people with new ideas, strong tech skills (or a desire to learn) and people who can figure out how to organize the herd of cats I hope to recruit.

So – what can you do to help ArchivesZ? Do you have mad Action Script 3 skills? Do you want to dig into the scary little ruby script that populates the database? Maybe you prefer to organize and coordinate? You have always wanted to figure out how a project like this could group from a happy (or awkward?) prototype into a real service that people depend on?

Do you have a vision for how to tackle this as a project? Open source? Grant funded? Something else clever?

Know any graduate students looking for good research topics? There are juicy bits here for those interested in data, classification, visualization and cross-repository search.

I will be at SAA in DC in August chairing a panel on search engine optimization of archival websites. If there is even just one of you out there who is interested, I would cheerfully organize an ArchivesZ summit of some sort in which I could show folks the good, bad and ugly of the prototype as it stands. Let me know in the comments below.

Won’t be at SAA but want to help? Chime in here too. I am happy to set up some shared desktop tours of whatever you would like to see.

PS: Yes, I do have all the version 2 code – and what is online at the Google Code ArchivesZ page is not up to date. Updating the ArchivesZ website and uploading the current code is on my to do list!

ArchivesZ Data Challenges: University of Texas at San Antonio

May 13, 2009 2 Comments

Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA’s Archives and Special Collections finding aids in XML format.

With the current scripts, these are the fun tag stats:

1,684 total tags extracted
75% (1,266 tags) are associated with only one finding aid
3% (51 tags) are associated with 10 or more finding aids

Collection Size

235 out of tne 253 collections ended up with a collection size of 0.

Consider the encoding of the collection size in the Guide to the Women’s Overseas Service League Records, 1910-2007:

<physdesc label="Extent:" encodinganalog="300$a">
    77 linear feet (approximately 44,000 items)
</physdesc>

Contrast this with one of the examples where the size of the collection was extracted properly by the current script:

<physdesc label="Extent:" encodinganalog="300$a">
    <extent>8.4 linear feet</extent> 
    (14 boxes)
</physdesc>

Sometimes it feels like a game of Where’s Waldo. In this case we are simply missing the set of <extent> tags from the first example. Off I went to the EAD tag descriptions to find the guidelines for use of the <physdesc> tag, where I found this overview of the tag:

A wrapper element for bundling information about the appearance or construction of the described materials, such as their dimensions, a count of their quantity or statement about the space they occupy, and terms describing their genre, form, or function, as well as any other aspects of their appearance, such as color, substance, style, and technique or method of creation. The information may be presented as plain text, or it may be divided into the <dimension>, <extent>, <genreform>, and <physfacet> subelements.

Bad news for my script logic – both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the <extent> tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the <physdesc> tags without enclosing <extent> or <dimension> tags.

Inclusive Dates

Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of <unitdate> formats that my current script could not parse properly, including the examples below:

1917-1980 (bulk 1920-1945)
1876-1903, 1914-1919, 1940-2002
1940s, 1970s-1990s

Another encoding approach that could not be parsed was the one used for the finding aid of the Church Women United of San Antonio Records. In this case the <unitdate> tag is within the <unittitle> tag as seen here:

<unittitle label="Title:" encodinganalog="245">
Church Women United of San Antonio Records,
<unitdate label="Dates:" encodinganalog="245$a">1961-2005</unitdate>
</unittitle>

Among the finding aids for which I did extract a range of inclusive date years, I also found issues with values like 1950s-1990s. The current script interpreted this to represent 1950 through 1990, but I believe it would be more properly translated as representing 1950 through 1999.

General Code Fixes

The University of Texas at San Antonio’s finding aids have provided additional examples of the following data and encoding issues already identified in earlier data sets:

Inconsistent repository titles (26 different variations of “The University of Texas at San Antonio Library”)
Titles with embedded and tagged dates
Carriage return and tab characters that need to be removed
Emphasis within a title or abstract added via a tag (such as <emph render=”italic”>Storyletters</emph> seen in A Guide to the Storyletters Records, 1991-2000) which interrupts extraction of text at that point

Next Steps

This is the last data set I am analyzing before tackling actual updates to the ArchivesZ data extraction script. My next step is to review and prioritize my long to do list for updates to this script. Most of what I have found in my examination of the data sets are ways in which my script was not smart enough to handle valid variations in encoding and the tabs, carriage returns, formatting tags and special characters found throughout everyone’s XML. Yes, there are some cases in which the data itself is less than optimal (such as non-standardized repository titles) or the values challenging (so many ways to describe the size of a collection!), but overall I am optimistic about how much more I can improve the extraction script before I have to resort to hand correcting records in the database.

Thanks to everyone for your patience with these data analysis posts. Onward to programming!

ArchivesZ Data Challenges: Forest History Society

May 6, 2009 2 Comments

Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:

Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like William E. Towell Papers, 1941 – 1988, my database ended up only with “William E Towell Papers,” based on this encoding: <titleproper>Inventory of the William E. Towell Papers, <date normal=”1941/1988″>1941 – 1988</date></titleproper>
Need to handle a conversion factor for a size of “1 folder” (as found in the Inventory of the Biltmore Forest School Images, 1890 – 1988)
My script chokes on the Inclusive Year format “1910 and 1931 – 1937” (as found in the Inventory of the Alfred Cunningham Papers, 1910 and 1931 – 1937)
The presence of a <lb/> character within the <extent> tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the Inventory of the DeWitt Nelson Papers, 1940 – 1976)
Within the <abstract> tag, my script drops everything after an <emph render=”doublequote”> tag (making for a very short abstract in the case of the Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 – 1947).

The most dramatic issue, seen across all the finding aids in this set, is that no subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of <list> and <item> tags as shown here:

<controlaccess>
<head>Subject Headings</head>
<list type="simple">
<item><genreform source="lcnaf" encodinganalog="655">Audiotapes</genreform></item>
<item><persname source="lcnaf" encodinganalog="600">Ainsworth, John H., 1909-</persname></item>
<item><subject source="lcnaf" encodinganalog="650">Businessmen -- United States</subject></item>

This is in contrast with this example of encoding from Syracuse University:

<controlaccess>
<head>Subject and Genre Headings</head>
<subject encodinganalog="650" source="local">Adult education</subject>
<persname encodinganalog="600" source="lcnaf">Adolphson, L. H.</persname>
<persname encodinganalog="600" source="lcnaf">Bradford, Leland Powers, 1905-</persname>

Or this sample from Oregon State University:

<controlaccess id="a12">
	 <controlaccess>
		  <persname encodinganalog="600" source="local" rules="aacr2"
		  role="subject">Aitken, Frances Alva, 1889-1970.</persname>
	 </controlaccess>
	 <controlaccess>
		  <corpname encodinganalog="610" source="local" role="subject"
		  rules="aacr2">Oregon Agricultural College. Class of 1910.</corpname>
		  <corpname source="lcnaf" encodinganalog="610" role="subject">Oregon
				Agricultural College--Students.</corpname>
	 </controlaccess>
	 <controlaccess>
		  <geogname source="lcsh" role="subject" encodinganalog="651">Corvallis
				(Or.)</geogname>
	 </controlaccess>
	 <controlaccess>
		  <subject encodinganalog="650" source="lcsh">Student
				activities--Oregon--Corvallis.</subject>
	 </controlaccess>

Both the Syracuse and OSU examples are handled by the current state of the data extract script.

Amanda pointed me to the NCEAD Best Practice Guidelines for EAD 2002. Down in Appendex G: How Do I Encode…, the second question down is “What if I have multi-part scope notes, biographical notes or subject headings?” followed by exactly the <list> and <item> tag usage as is being done for the Forest History Society finding aids. This format clearly should be handled.

So, no fun tag stats for this run – but I hope to fix my ruby script so that the Forest History Society finding aids can be incorporated into the data set I use for testing version 2 of ArchivesZ. My ruby script to do list is getting quite long!

ArchivesZ Data Challenges: Utah Government Archives & Records Service

April 26, 2009 6 Comments

Gina Strack of the Utah State Archives and Records Service provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from EAD 1.0 finding aids. Their website includes a detailed account of the EAD Project.

These finding aids have helped me identify three types of ArchivesZ data challenges:

strange characters
broad composite subjects
determination of accurate collection size

Strange and mysterious characters!

These finding aids use a special character in the place of the standard Library of Congress double dash which normally appears between subsections of the subject heading.

An example subject from the Utah Government XML looks like this:

Women—Suffrage—Utah.

Viewing the same subject in a pure text editor (such as vi):

Women—Suffrage—Utah.

By the time it gets into my database and is pulled out via a query in MySQL Query Browser it looks like this:

Women?¢‚Ç¨‚ÄùSuffrage?¢‚Ç¨‚ÄùUtah.

Rather than just stripping out all instances of —, my plan is to replace them with the standard Library of Congress double dash. This will ensure that the existing code that breaks the subjects down to tags will still work.

Composite Subjects

When I say “composite subject” what I mean is a subject that includes multiple very disparate terms. Rather than the Library of Congress style subjects, all aspects of which relate to the collection in question, these composite subjects cover multiple subjects which are grouped together for convenience.

This is a list of some of the most popular subjects for the Utah Gov collections:

Politics, Government, and Law
Business, Industry, Labor, and Commerce
Science, Technology, and Health
Arts, Humanities, and Social Sciences

These subjects throw a monkey wrench into my theories about decomposing subjects based on commas. The collections to which these subjects are assigned likely fit in only one of the component themes. For example, the “Inventory of Publications from Department of Technology Services, 1993-2008” is assigned the subject “Science, Technology, and Health”. If I divide this subject into 3 separate tags, the Science and Health tags would be quite misleading.

So that leaves me a bit trapped. If I want to divide subjects such as “Art, Cuban, 20th century”, as I discuss in my Syracuse University post, then I end up also dividing these umbrella subjects which separate such very divergent terms with commas.

This issue goes on my list of reasons to add a repository configuration file for use by the data extraction script.

Accurate Collection Size

In my quest to convert all sizes to linear feet – sizes such as these are challenging:

0.20 cubic foot and 1 microfilm reel
0.35 cubic foot and 2 microfilm reels

I also have situations of sizes be specified in multiple sections of the finding aid. The Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991 has a collection level size of “0.50 cubic foot and 2 microfilm reels”, but further down in this finding aid I see this:

series: ALERT Foundation records

box 1, folder 1: Documentary: “”Letters from our Children,”” Motion picture film reel, 16mm
box 1, folder 2: Documentary: “”Letters from our Children,”” VHS videocassette
box 1, folder 3: Documentary: “”Letters from our Children,”” VHS videocassette
box 1, folder 4: Documentary: “”Letters from our Children,”” VHS videocassette

When they said 2 microfilm reels – do they really mean a 16mm motion picture film reel and a VHS videocassette? Is there 1 VHS videocassette or 3? How sizes are specified in a specific repository’s finding aids is another possible candidate for a repository level configuration script.

Tagging Statistics

Finally, here are a few tag stats:

Only 31 tags (1.5% of all Utah Government tags) are associated with 10 or more collections
1404 tags (71.5%) are assigned to only a single collection
107 collections have been assigned only 1 tag
10 collections have no subjects

Of course these statistics are based on the current incarnation of the data extraction script. After I modify the script, there will be a greater number of tags and (hopefully) more overlap of tags across multiple collections. These types of statistics should help me gauge how well my data extraction logic is working.

ArchivesZ Poster Wins 2nd Place at GRID 2009

April 22, 2009 2 Comments

The title says it all. I won 2nd place in the “Smart Computers and Computing” section of the University of Maryland’s Graduate Research Interaction Day (GRID) for my poster ArchivesZ: Visualizing Archival Collections (what is in all those boxes?).

1st place in “Smart Computers and Computing” went to the fabulous Dave Levin for his presentation on TrInc: Small Trusted Hardware for Large Distributed Systems.

Overall, it was a great experience. I wish I could have been in multiple rooms at the same time so I could have seen more posters and presentations. I also wished I had understood that I could have presented with either a poster or a power point deck. That was not entirely clear ahead of time. The downside of of my choice was being tied to my poster, but the upside is that I still have the poster that can be examined by readers like you. Obviously it all worked out in the end.

A big thanks to everyone in the Graduate Student Government who worked so hard to bring this event together.

ArchivesZ Poster at UMD’s GRID 2009

April 12, 2009

Come meet me and hear my 8 minute talk in front of a poster about ArchivesZ.

When? April 13, 2009, 1:30-3pm
What? University of Maryland’s Graduate Research Interaction Day (GRID)
Where? University of Maryland’s Stamp Student Union

My ArchivesZ poster has been assigned to the “Smart Computers and Computer Science” theme. I will be with my poster in the Benjamin Bannekar B room at UMD’s Stamp Student Union from 1:30 to 3pm. If you are attending GRID, please stop by and say hello!

Want a preview or can’t make it? Here is the poster in question:

ArchivesZ Data Challenges: Princeton University

March 23, 2009

I received a zip file of 1,771 EAD encoded finding aids from the kind EAD enthusiasts at the Seely G. Mudd Manuscript Library. These finding aids came from five divisions within Princeton’s Library:

So onward to the data issues and what they mean for my ever growing ‘script fix to-do list’.

Repository Names

As we saw with the Oregon State University finding aids, the finding aids from Princeton University had a wide range of different values for repository names. In the list below we spot some issues. Some end in periods, some do not. One has extra space (probably a carriage return) in the middle. One does not include Princeton in the repository name. Once we have many repositories’ finding aids in ArchivesZ, a repository name of ‘Engineering Library’ does not tell the user enough about where those collections can be found.

Here is the list of repository titles my script extracted:

Princeton University Library. Department of Rare Books and Special Collections.
Engineering Library
Princeton University Library
Princeton University Library. Department of Rare Books and Special Collections.
Princeton University Library.

My script can handle the extra period and the extra spaces, but the non-specific name would need to ultimately be fixed on the source side.

Collection Size

The current script assumes that there is only one extent value specified to express the size of the collection. Princeton’s finding aids showed me examples of multiple extent values. For example, the Christina Georgina Rossetti Collection has both a collection level size of 0.4 linear feet (1 archival box) as well as a 2nd extent specification corresponding to a specific folder with the value of (1 poem, 3 drawings, 1 photo, 1 incomplete article). The script must be modified to only consider the collection level size.

Complicated Titles

The current script logic apparently does not handle what I would call ‘complicated collection titles’. For example, I ended up with “Edward Livingston Papers, ” as the title for a collection with a full title of Edward Livingston Papers, 1683-1877 (bulk 1764-1836). This is the way that this title is encoded:<unittitle encodinganalog="245$a" label="Title and dates: ">Edward Livingston Papers, <unitdate encodinganalog="245$f" normal="1683/1877" type="inclusive">1683-1877</unitdate> (bulk <unitdate encodinganalog="245$g" normal="1764/1836" type="bulk">1764-1836</unitdate>)</unittitle>

Too Many Tags

The Engineering Library’s Department of Mechanical and Aerospace Engineering Technical Reports: Finding Aid has 522 tags assigned to it! Almost all of these are the names of the authors of the individual reports. This scenario goes on the list of reasons why I might choose to not include (at least for this version) persname subjects. The other option for handling this situation is to only use subjects assigned at the collection level and ignoring subjects assigned at lower unit/container levels. Without the author tags, this single collection ends up with this nice, reasonable list of tags:

Fluid mechanics
Mechanical engineering
Combustion
Aerospace engineering
Propulsion systems

Year Challenges
I found two different issues related to year ranges:

Women in Argentina, VI, 1989-2001: Finding Aid: The current script does not properly extract the inclusive dates which are encoded within the titleproper tags, but rather assumes that it will be encoded using a unitdate tag.
An assortment of finding aids include subjects which have year spans as part of the subject. When these subjects are decomposed into tags, we end up with tags like ‘1850-1950’. Since we have the time period communicated via the inclusive dates, I will likely just drop these portions of the subjects rather than create a tag for each unique year span.

General Code Fixes

It is reassuring at this point to spot the same issues with data from multiple repositories. Here are data and code logic issues that I have seen elsewhere that are revalidated by Princeton’s finding aids:

Need to strip /n & /t characters
Need to break subjects up based on commas
Need to drop final periods from repository names, subjects and titles
The designation of size in volumes, as in “793 volumes”. I need to pick an approach for translating from volumes to linear feet

The script to-do list is still getting longer, but I am not done cycling through new institutions’ XML files to find new issues. Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.

Image Credit: Top image from the Seeley G. Mudd Manuscript Library homepage.

ArchivesZ Data Challenges: Syracuse University Special Collections Research Center

March 7, 2009 6 Comments

The Syracuse University Special Collections Research Center has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files.

My scripts extracted 6632 unique ‘tags’ from the subjects assigned to the finding aids. As part of the data parsing and loading of data for use in the visualizations, the script divides up compound subjects into tags. For example, in the subjects we find assigned to Syracuse University finding aids we find these values (number shown is number of finding aids to which that subject is assigned):

Art — American — 20th century (1)
Art — Cartoonists (68)
Art — Cartoonists. (3)
Art — Exhibitions. (1)
Art — Illustrators (36)
Art — Illustrators. (1)
Art — Painters (77)
Art — Philosophy. (1)
Art — Sculpture (33)

As well as subjects, where the components are separated by commas such as these (number listed indicates total finding aids assigned that subject):

Art, American (33)
Art, American. (46)
Art, American, 20th century (28)
Art, American, 20th century. (31)
Art, Cuban, 20th century (1)
Art, Modern (1)
Art, French, 20th century. (1)

The goal is to capture the core ideas – to capture the overlap in subject matter among diverse collections. All of the collections with any of these subjects are about Art. With the current script, the tag Art is associated with 179 collections from Syracuse University. You can see from this tiny subset of subjects that other themes would be revealed when these subjects were decomposed more completely – and this just scratches the surface.

Out of the 6676 subjects, 5658 subjects are assigned to single collections. Out of the 6632 tags the current script extracted from those subjects, 5594 tags are assigned to single collections. Not much improvement with the current state of the script.

While currently the script does a good job with the Library of Congress double dash separation pattern, the Syracuse University data has shown me a number of other standard patterns that need to be handled which can be seen in the small sampling of art related subjects shown above. The easy one is removing periods and stripping spaces from the end of subject values. The harder change will be to implement smart separation of subjects into tags based on commas. This would need the code to only break up <subject> values while leaving <persname> and <corpname> alone. I will also need to examine <geogname> values from across various institutions to decide if it is better to break them up or leave them be.

Other than these subject issues, there are a few other script modification that I will need to make based on scenarios the data in the Syracuse finding aids have shown me:

Syracuse University uses an entity to populate the repository values – the current script does not handle this at all.
Ensure that single item collections are assigned a size of .25 linear feet
Linear ft must be added as another recognized abbreviation for linear feet

All these issues are being added to my master ‘to do’ list for updating the EAD parsing script. Onward to the next data set.

Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.

Image Credit: Syracuse University image above from Syracuse University Special Collections Research Center home page.

ArchivesZ Data Challenges: Oregon State University Archives

February 22, 2009 1 Comment

The Oregon State University Archives has generously contributed 356 of their finding aids in EAD format for use in the development of version 2 of ArchivesZ. This is my first post in a what will likely be a series of looks behind the scenes at the challenges facing a project like ArchivesZ on the data level.

Version one of ArchivesZ only used finding aids from the University of Maryland and the Library of Congress. This was definitely a case of the path of least resistance. I attend the University of Maryland and the Library of Congress has a very convenient page providing links to all their Finding Aids source XML files. A very key aspect of creating version 2 of ArchivesZ is making sure that the scripts that pull data from EAD XML files is robust enough to handle the encoding practices of a very diverse range of institutions.

Please keep in mind that OSU is likely to bear the brunt of many basic data issues that I would have unearthed with whatever data sets I tried first!

There are 3 crucial data elements on which the visualizations of ArchivesZ depend: subject, inclusive dates, and collection size. Each element presents unique challenges. The script parsing issues I am uncovering with the OSU finding aids are currently worst for collection size. In order to make pretty charts which let people compare the quantity of materials in each collection (or record group – please forgive that I use the term ‘collection’ to mean any set of records for which a finding aid has been created), we need to be able to assign a single number to represent the size of each collection. Based on the values used in the LOC and UMD finding aids, we chose to go with linear of feet as our standard unit of measurement. So the trick is to translate whatever archivists choose to put into the <physdesc> element of their finding aid into some number of linear feet.

These are the size conversion rules we implemented for version 1 of ArchivesZ:

1 microfilm reel = 1 linear foot
Collections represented only by a number of items will be represented as .25 linear feet
If size only specified in number of boxes, then 1 box = .5 linear feet
When the size is given in some different types of units, they are prioritized in the following order: linear feet > boxes > microfilm reels > items

This works reasonably well when the physical description values are simple – it starts to fall apart when what is entered is more complicated. Here are some examples of the physical descriptions in the OSU finding aids:

Guide to the Phi Kappa Phi-OSU Chapter Records: The display in the ‘pretty’ version of the finding aid online shows this: 5.5 cubic feet (9 boxes, including 2 oversize boxes) (3 microfilm reels)

The version in the XML file is this:

<physdesc>
  <extent>5.5 cubic feet</extent>
  <extent>9 boxes, including 2 oversize boxes</extent>
  <extent>3 microfilm reels</extent>
</physdesc>

With the current algorithm, this finding aid would be marked as being 3 linear feet in size. At a bare minimum, I must add ‘cubic feet’ as another unit to be converted. More difficult to discern is if I should have a value of 5.5 linear feet (assuming 1 cubic foot = 1 linear foot for the purposes of these comparisons) or a value of 8.5 linear feet (5.5 + 3 linear feet for the 3 microfilm reels). There is never going to be a perfect answer here, but clearly my logic needs to be more sophisticated than it is now.

Harvey L. McAlister Collection: The display in the pretty version of this finding aid online is this: 1 cubic foot, including 26 photographs (4 boxes, including 2 oversize boxes, and 1 map folder)

The version in the XML file is this:

<physdesc>
  <extent encodinganalog="300$a">1 cubic foot, including 26 photographs</extent>
  <extent encodinganalog="300$a">4 boxes, including 2 oversize boxes, and 1 map folder</extent>
</physdesc>

With the current algorithm, this finding aid would be marked as being 1 linear foot in size. From looking at these two examples, it would seem that this would be fine and in fact – for the purposes of calculating a comparable size – only looking at the first <extent> value might be the way to go – at least for OSU finding aids.

There are some other simpler issues relating to standardization in the way that certain values are entered. For example, after ingesting 173 finding aids from OSU (the number I got through before my script flat out choked on a size designation), I ended up with five different repositories added to my REPOSITORIES table. I had expected only one. Each of these was entered as repository name — and I have included the length of each value to show how extra spaces are causing part of the problem:

Oregon State University Libraries – length 36
Oregon State University – length 23
Oregon State UniversityLibraries – length 32
Oregon State University Libraries – length 36
Oregon State University Libraries – length 33

Some of these I can handle by adding smarter trimming of trailing spaces – but in this case it is clear that typos and inconsistency are also a challenge. I checked and each of these different <corpname> values, within the <repository> element is used by at least 10 finding aids. Perhaps they have been inherited over time from a template?

I have considered creating a repository definition file that could be used when loading finding aids from one repository at a time. This would remove dependence on perfect replication of these sorts of values while still supplying the data needed to let people limit their searches by a named repository.

The last issue is the most minor. There are many /n and /t characters throughout the XML documents. These I plan to simply strip out as the script parses the XML file.

A big thank you to Elizabeth Nielsen, Senior Staff Archivist at OSU Archives. Her response to my query about OSU’s comfort with my taking apart their finding aids in public on my blog was “Bring it on – we’re tough!”.

It is fascinating to dig into new finding aids and see how the parsing script handles what it finds. I plan to test the existing script on XML from more sources to see all the things that must be fixed. Then I get to wrap my head around code that someone else wrote (another member of the original ArchivesZ team wrote the version 1 ruby script). For those of you who are not programmers, you can skim through my Book Review of Dreaming in Code to get a handle on why this can be harder than it sounds like it should be.

Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via my contact form.

Image Credit: OSU Archives image above from the OSU Archives Home Page.

Category: ArchivesZ