<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Spellbound Blog &#187; ArchivesZ</title>
	<atom:link href="http://www.spellboundblog.com/category/archivesz/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spellboundblog.com</link>
	<description>Archives, Digital Humanities, Cultural Heritage, Technology</description>
	<lastBuildDate>Sat, 14 Aug 2010 03:54:17 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>ArchivesZ Needs You!</title>
		<link>http://www.spellboundblog.com/2010/07/07/archivesz-needs-you/</link>
		<comments>http://www.spellboundblog.com/2010/07/07/archivesz-needs-you/#comments</comments>
		<pubDate>Wed, 07 Jul 2010 04:48:24 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[archival community]]></category>
		<category><![CDATA[learning technology]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[virtual collaboration]]></category>
		<category><![CDATA[what if]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=996</guid>
		<description><![CDATA[I got a kind email today asking &#8220;Whither ArchivesZ?&#8221;. My reply was: &#8220;it is sleeping&#8221; (projects do need their rest) and &#8220;I just started a new job&#8221; (I am now a Metadata and Taxonomy Consultant at The World Bank) and &#8220;I need to find enthusiastic people to help me&#8221;. That final point brings me to [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/07/07/archivesz-needs-you/">ArchivesZ Needs You!</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.spellboundblog.com/wp-content/uploads/2010/07/Unclesamwantyou2.jpg"><img class="alignright size-full wp-image-997" title="I Want You!" src="http://www.spellboundblog.com/wp-content/uploads/2010/07/Unclesamwantyou2.jpg" alt="" width="288" height="320" /></a>I got a kind email today asking &#8220;Whither ArchivesZ?&#8221;. My reply was: &#8220;it is sleeping&#8221; (projects do need their rest) and &#8220;I just started a new job&#8221; (I am now a Metadata and Taxonomy Consultant at The World Bank) and &#8220;I need to find enthusiastic people to help me&#8221;. That final point brings me to this post.</p>
<p>I find myself in the odd position of having finished my Master&#8217;s Degree and not wanting to sign on for the long haul of a PhD. So I have a big project that was born in academia, initially as a joint class project and more recently as independent research with a grant-funded programmer, but I am no longer in academia.</p>
<p>What happens to projects like ArchivesZ? Is there an evolutionary path towards it being a collaborative project among dispersed enthusiastic individuals? Or am I more likely to succeed by recruiting current graduate students at my former (and still nearby) institution? I have discussed this one-on-one with a number of individuals, but I haven&#8217;t thrown open the gates for those who follow me here online.</p>
<p>For those of you who have been waiting patiently, the <a title="ArchivesZ" href="http://zaphod.mindlab.umd.edu/ArchivesZ/Main.html">ArchivesZ  version 2 prototype</a> is avaiable online. I can&#8217;t promise it will stay  online for long &#8211; it is definitely brittle for reasons I haven&#8217;t  totally identified. A few things to be aware of:</p>
<ul>
<li>when you  load the main page, you should see tags listed at the bottom &#8211; if you  don&#8217;t at all, then drop me an email via my contact form and I will try  and get Tomcat and Solr back up. If you have a small screen &#8211; you may need to  view your browser full screen to get to all the parts of the UI.</li>
<li>I know there are lots of bugs of various sizes. Some paths through  the app work &#8211; some don&#8217;t. Some screens are just placeholders. Feel free  to poke around and try things &#8211; you can&#8217;t break it for anyone else!</li>
</ul>
<p>I think there are a few key challenges to building what I would think of as the first &#8216;full&#8217; version of ArchivesZ &#8211; listed here in no particular order:</p>
<ul>
<li>In the process of creating version 2, I was too ambitious. The current version of ArchivesZ has lots of issues, some usability &#8211; some bugs (see prototype above!)</li>
<li>Wherever a collaborative workspace of ArchivesZ were going to live, it would need large data sets. I did a lot of work on data from eleven institutions in the spring of 2009, so there is a lot of data available &#8211; but it is still a challenge.</li>
<li>A lot of my future ideas for ArchivesZ are trapped in my head. The good news is that I am honestly open to others&#8217; ideas for where to take it in the future.</li>
<li>How do we build a community around the creation of ArchivesZ?</li>
</ul>
<p>I still feel that there is a lot to be gained by building a centralized visualization tool/service through which researchers and archivists could explore and discover archival materials. I even think there is promise to a freestanding tool that supports exploration of materials within a single institution. I can&#8217;t build it alone. This is a good thing &#8211; it will be a much better in the end with the input, energy and knowledge of others. I am good at ideas and good at playing the devil&#8217;s advocate. I have lots of strength on the data side of things and visualization has been a passion of mine for years. I need smart people with new ideas, strong tech skills (or a desire to learn) and people who can figure out how to organize the herd of cats I hope to recruit.</p>
<p>So &#8211; what can you do to help ArchivesZ? Do you have mad Action Script 3 skills? Do you want to dig into the scary little ruby script that populates the database? Maybe you prefer to organize and coordinate? You have always wanted to figure out how a project like this could group from a happy (or awkward?) prototype into a real service that people depend on?</p>
<p>Do you have a vision for how to tackle this as a project? Open source? Grant funded? Something else clever?</p>
<p>Know any graduate students looking for good research topics? There are juicy bits here for those interested in data, classification, visualization and cross-repository search.</p>
<p>I will be at SAA in DC in August chairing a panel on search engine optimization of archival websites. If there is even just one of you out there who is interested, I would cheerfully organize an ArchivesZ summit of some sort in which I could show folks the good, bad and ugly of the prototype as it stands. Let me know in the comments below.</p>
<p>Won&#8217;t be at SAA but want to help? Chime in here too. I am happy to set up some shared desktop tours of whatever you would like to see.</p>
<p>PS: Yes, I do have all the version 2 code &#8211; and what is online at the <a title="Google Code: ArchivesZ" href="http://code.google.com/p/archivesz/">Google Code ArchivesZ page</a> is not up to date. Updating the <a title="ArchivesZ" href="http://www.archivesz.org">ArchivesZ website</a> and uploading the current code is on my to do list!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/07/07/archivesz-needs-you/">ArchivesZ Needs You!</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2010/07/07/archivesz-needs-you/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: University of Texas at San Antonio</title>
		<link>http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/</link>
		<comments>http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/#comments</comments>
		<pubDate>Wed, 13 May 2009 06:28:53 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=534</guid>
		<description><![CDATA[Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA&#8217;s Archives and Special Collections finding aids in XML format. With the current scripts, these are the fun tag stats: 1,684 total tags extracted 75% (1,266 tags) [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/">ArchivesZ Data Challenges: University of Texas at San Antonio</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="USTA Archives and Special Collections" href="http://www.lib.utexas.edu/taro/browse/browse_utsa1.html"><img class="alignright size-full wp-image-535" title="University of Texas San Antonio Archives and Special Collections" src="http://www.spellboundblog.com/wp-content/uploads/2009/05/logo-utsa.gif" alt="University of Texas San Antonio Archives and Special Collections" width="205" height="101" /></a></p>
<p><a title="Mark Shelstad" href="http://www.linkedin.com/pub/dir/mark/shelstad">Mark Shelstad</a>, head of <a title="Archives and Special Collections at University of Texas at San Antonio" href="http://www.lib.utsa.edu/archives/">Archives and Special Collections at University of Texas at San Antonio</a>, sent me a link to the <a title="TARO: UTSA" href="http://www.lib.utexas.edu/taro/utsa/utsa_xml.html">TARO</a> (Texas Archival Resources Online) page for <a title="USTA Archives and Special Collections" href="http://www.lib.utexas.edu/taro/browse/browse_utsa1.html">UTSA&#8217;s Archives and Special Collections finding aids</a> in XML format.</p>
<p>With the current scripts, these are the fun tag stats:</p>
<ul>
<li>1,684 total tags extracted</li>
<li>75% (1,266 tags) are associated with only one finding aid</li>
<li>3% (51 tags) are associated with 10 or more finding aids</li>
</ul>
<p><strong>Collection Size</strong></p>
<p>235 out of tne 253 collections ended up with a collection size of 0.</p>
<p>Consider the encoding of the collection size in the <a title="A Guide to the Women's Overseas Service League Records, 1910-2007" href="http://www.lib.utexas.edu/taro/utsa/00008/utsa-00008.html">Guide to the Women&#8217;s Overseas Service League Records, 1910-2007</a>:</p>
<pre>&lt;physdesc label="Extent:" encodinganalog="300$a"&gt;
    77 linear feet (approximately 44,000 items)
&lt;/physdesc&gt;</pre>
<p>Contrast this with one of the examples where the size of the collection was extracted properly by the current script:</p>
<pre>&lt;physdesc label="Extent:" encodinganalog="300$a"&gt;
    &lt;extent&gt;8.4 linear feet&lt;/extent&gt;
    (14 boxes)
&lt;/physdesc&gt;</pre>
<p>Sometimes it feels like a game of Where&#8217;s Waldo. In this case we are simply missing the set of &lt;extent&gt; tags  from the first example. Off I went to the EAD tag descriptions to find the <a title="LOC: physdesc tag library description" href="http://www.loc.gov/ead/tglib/elements/physdesc.html">guidelines for use of the &lt;physdesc&gt; tag</a>, where I found this overview of the tag:</p>
<p style="padding-left: 30px;">A wrapper element for bundling information about the appearance or construction   of the described materials, such as their dimensions, a count of their quantity   or statement about the space they occupy, and terms describing their genre,   form, or function, as well as any other aspects of their appearance, such as   color, substance, style, and technique or method of creation. The information   may be presented as plain text, or it may be divided into the &lt;dimension&gt;, &lt;extent&gt;, &lt;genreform&gt;,   and &lt;physfacet&gt; subelements.</p>
<p>Bad news for my script logic &#8211; both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the &lt;extent&gt; tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the &lt;physdesc&gt; tags without enclosing &lt;extent&gt; or &lt;dimension&gt; tags.</p>
<p><strong>Inclusive Dates</strong></p>
<p>Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of &lt;unitdate&gt; formats that my current script could not parse properly, including the examples below:</p>
<ul>
<li>1917-1980 (bulk 1920-1945)</li>
<li>1876-1903, 1914-1919, 1940-2002</li>
<li>1940s, 1970s-1990s</li>
</ul>
<p>Another encoding approach that could not be parsed was the one used for the finding aid of the <a title="Church Women United of San Antonio Records" href="http://www.lib.utexas.edu/taro/utsa/00046/utsa-00046.html">Church Women United of San Antonio Records</a>. In this case the &lt;unitdate&gt; tag is within the &lt;unittitle&gt; tag as seen here:</p>
<pre style="padding-left: 30px;">&lt;unittitle label="Title:" encodinganalog="245"&gt;
Church Women United of San Antonio Records,
&lt;unitdate label="Dates:" encodinganalog="245$a"&gt;1961-2005&lt;/unitdate&gt;
&lt;/unittitle&gt;</pre>
<p>Among the finding aids for which I did extract a range of inclusive date years, I also found issues with values like 1950s-1990s. The current script interpreted this to represent 1950 through 1990, but I believe it would be more properly translated as representing 1950 through 1999.</p>
<p><strong>General Code Fixes</strong></p>
<p>The University of Texas at San Antonio’s finding aids have provided additional examples of the following data and encoding issues already identified in earlier data sets:</p>
<ul>
<li>Inconsistent repository titles (26 different variations of &#8220;The University of Texas at San Antonio Library&#8221;)</li>
<li>Titles with embedded and tagged dates</li>
<li>Carriage return and tab characters that need to be removed</li>
<li>Emphasis within a title or abstract added via a tag (such as &lt;emph render=&#8221;italic&#8221;&gt;Storyletters&lt;/emph&gt; seen in <a title="A Guide to the Storyletters Records, 1991-2000" href="http://www.lib.utexas.edu/taro/utsa/00021/utsa-00021.html">A Guide to the Storyletters Records, 1991-2000</a>) which interrupts extraction of text at that point</li>
</ul>
<p><strong>Next Steps</strong></p>
<p>This is the last data set I am analyzing before tackling actual updates to the ArchivesZ data extraction script. My next step is to review and prioritize my long to do list for updates to this script. Most of what I have found in my examination of the data sets are ways in which my script was not smart enough to handle valid variations in encoding and the tabs, carriage returns, formatting tags and special characters found throughout everyone&#8217;s XML. Yes, there are some cases in which the data itself is less than optimal (such as non-standardized repository titles) or the values challenging (so many ways to describe the size of a collection!), but overall I am optimistic about how much more I can improve the extraction script before I have to resort to hand correcting records in the database.</p>
<p>Thanks to everyone for your patience with these data analysis posts. Onward to programming!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/">ArchivesZ Data Challenges: University of Texas at San Antonio</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Forest History Society</title>
		<link>http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/</link>
		<comments>http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/#comments</comments>
		<pubDate>Wed, 06 May 2009 21:30:48 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=470</guid>
		<description><![CDATA[Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address: Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/">ArchivesZ Data Challenges: Forest History Society</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Forest History Society" href="http://www.foresthistory.org"><img class="alignright size-full wp-image-471" title="The Forest History Society" src="http://www.spellboundblog.com/wp-content/uploads/2009/04/fhs_logo_small.jpg" alt="The Forest History Society" width="82" height="130" /></a><a title="Amanda Ross" href="http://fhsarchives.wordpress.com/author/amandatross/">Amanda Ross</a>, project archivist for the <a title="Forest History Society" href="http://www.foresthistory.org/">Forest History Society</a>, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:</p>
<ul>
<li>Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like <a title="William E. Towell Papers, 1941 - 1988" href="http://foresthistory.org/ead/Towell_William_E.html">William E. Towell Papers, 1941 &#8211; 1988</a>, my database ended up only with &#8220;William E Towell Papers,&#8221; based on this encoding:  &lt;titleproper&gt;Inventory of the William E. Towell Papers, &lt;date normal=&#8221;1941/1988&#8243;&gt;1941 &#8211; 1988&lt;/date&gt;&lt;/titleproper&gt;</li>
<li>Need to handle a conversion factor for  a size of  &#8220;1 folder&#8221; (as found in the <a title="Inventory of the Biltmore Forest School Images, 1890 - 1988" href="http://foresthistory.org/ead/Biltmore_Forest_School_Images.html">Inventory of the Biltmore Forest School Images, 1890 &#8211; 1988</a>)</li>
<li>My script chokes on the Inclusive Year format &#8220;1910 and 1931 &#8211; 1937&#8243; (as found in the <a title="Inventory of the Alfred Cunningham Papers, 1910 and 1931 - 1937" href="http://foresthistory.org/ead/Cunningham_Alfred.html">Inventory of the Alfred Cunningham Papers, 1910 and 1931 &#8211; 1937</a>)</li>
<li>The presence of a &lt;lb/&gt; character within the &lt;extent&gt; tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the <a title="Inventory of the DeWitt Nelson Papers, 1940 - 1976" href="http://foresthistory.org/ead/Nelson_DeWitt.html">Inventory of the DeWitt Nelson Papers, 1940 &#8211; 1976</a>)</li>
<li>Within the &lt;abstract&gt; tag, my script drops everything after an &lt;emph render=&#8221;doublequote&#8221;&gt; tag (making for a very short abstract in the case of the <a title="Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 - 1947" href="http://foresthistory.org/ead/Recknagel_Arthur_Bernard.html">Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 &#8211; 1947</a>).</li>
</ul>
<p>The most dramatic issue, seen across all the finding aids in this set, is that <strong>no</strong> subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of &lt;list&gt; and &lt;item&gt; tags as shown here:</p>
<pre>&lt;controlaccess&gt;
&lt;head&gt;Subject Headings&lt;/head&gt;
&lt;list type="simple"&gt;
&lt;item&gt;&lt;genreform source="lcnaf" encodinganalog="655"&gt;Audiotapes&lt;/genreform&gt;&lt;/item&gt;
&lt;item&gt;&lt;persname source="lcnaf" encodinganalog="600"&gt;Ainsworth, John H., 1909-&lt;/persname&gt;&lt;/item&gt;
&lt;item&gt;&lt;subject source="lcnaf" encodinganalog="650"&gt;Businessmen -- United States&lt;/subject&gt;&lt;/item&gt;</pre>
<p>This is in contrast with this example of encoding from <a title="ArchivesZ Data Challenges: Syracuse University Special Collections Research Center" href="http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/">Syracuse University</a>:</p>
<pre>&lt;controlaccess&gt;
&lt;head&gt;Subject and Genre Headings&lt;/head&gt;
&lt;subject encodinganalog="650" source="local"&gt;Adult education&lt;/subject&gt;
&lt;persname encodinganalog="600" source="lcnaf"&gt;Adolphson, L. H.&lt;/persname&gt;
&lt;persname encodinganalog="600" source="lcnaf"&gt;Bradford, Leland Powers, 1905-&lt;/persname&gt;</pre>
<p>Or this sample from <a title="ArchivesZ Data Challenges: Oregon State University Archives" href="http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/">Oregon State University</a>:</p>
<pre>&lt;controlaccess id="a12"&gt;
	 &lt;controlaccess&gt;
		  &lt;persname encodinganalog="600" source="local" rules="aacr2"
		  role="subject"&gt;Aitken, Frances Alva, 1889-1970.&lt;/persname&gt;
	 &lt;/controlaccess&gt;
	 &lt;controlaccess&gt;
		  &lt;corpname encodinganalog="610" source="local" role="subject"
		  rules="aacr2"&gt;Oregon Agricultural College. Class of 1910.&lt;/corpname&gt;
		  &lt;corpname source="lcnaf" encodinganalog="610" role="subject"&gt;Oregon
				Agricultural College--Students.&lt;/corpname&gt;
	 &lt;/controlaccess&gt;
	 &lt;controlaccess&gt;
		  &lt;geogname source="lcsh" role="subject" encodinganalog="651"&gt;Corvallis
				(Or.)&lt;/geogname&gt;
	 &lt;/controlaccess&gt;
	 &lt;controlaccess&gt;
		  &lt;subject encodinganalog="650" source="lcsh"&gt;Student
				activities--Oregon--Corvallis.&lt;/subject&gt;
	 &lt;/controlaccess&gt;</pre>
<p>Both the Syracuse and OSU examples are handled by the current state of the data extract script.</p>
<p>Amanda pointed me to the <a title="NCEAD Best Practice Guidelines for EAD 2002" href="http://www.ncecho.org/dig/ead2002.shtml">NCEAD Best Practice Guidelines for EAD 2002</a>. Down in <a title="APPENDIX G: HOW DO I ENCODE...?" href="http://www.ncecho.org/dig/ead2002.shtml#appendixG">Appendex G: How Do I Encode&#8230;</a>, the second question down is &#8220;What if I have multi-part scope notes, biographical notes or subject headings?&#8221; followed by exactly the &lt;list&gt; and &lt;item&gt; tag usage as is being done for the Forest History Society finding aids. This format clearly should be handled.</p>
<p>So, no fun tag stats for this run &#8211; but I hope to fix my ruby script so that the Forest History Society finding aids can be incorporated into the data set I use for testing version 2 of ArchivesZ. My ruby script to do list is getting quite long!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/">ArchivesZ Data Challenges: Forest History Society</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Utah Government Archives &amp; Records Service</title>
		<link>http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/</link>
		<comments>http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/#comments</comments>
		<pubDate>Sun, 26 Apr 2009 05:33:17 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[interface design]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=424</guid>
		<description><![CDATA[Gina Strack of the Utah State Archives and Records Service provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from EAD 1.0 finding aids. Their website includes a detailed account of the EAD [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/">ArchivesZ Data Challenges: Utah Government Archives &#038; Records Service</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Utah State Archives and Records Service" href="http://www.archives.state.ut.us/"><img class="alignright size-full wp-image-425" title="Utah dot Gov Logo" src="http://www.spellboundblog.com/wp-content/uploads/2009/03/utahgovlogoglow.png" alt="Utah dot Gov Logo" width="87" height="66" /></a><a title="Gina Strack" href="http://ginastrack.com/">Gina Strack</a> of the <a title="Utah State Archives and Records Service" href="http://www.archives.state.ut.us/"><span class="il">Utah</span> State Archives and Records Service</a> provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from <a title="EAD verion 1 finding aids" href="http://historyresearch.utah.gov/inventories/inventories-ac.htm">EAD 1.0 finding aids</a>. Their website includes a <a title="Utah State Archives EAD Project" href="http://archives.utah.gov/research/inventories/ead.html">detailed account of the EAD Project</a>.</p>
<p>These finding aids have helped me identify three types of ArchivesZ data challenges:</p>
<ul>
<li>strange characters</li>
<li>broad composite subjects</li>
<li>determination of accurate collection size</li>
</ul>
<p><strong>Strange and mysterious characters!</strong></p>
<p>These finding aids use a special character in the place of the standard Library of Congress double dash which normally appears between subsections of the subject heading.</p>
<p>An example subject from the Utah Government XML looks like this:</p>
<p style="padding-left: 30px;">Women—Suffrage—Utah.</p>
<p>Viewing the same subject in a pure text editor (such as <a title="Wikipedia: vi" href="http://en.wikipedia.org/wiki/Vi">vi</a>):</p>
<p style="padding-left: 30px;">Women&amp;#8212;Suffrage&amp;#8212;Utah.</p>
<p>By the time it gets into my database and is pulled out via a query in MySQL Query Browser it looks like this:</p>
<p style="padding-left: 30px;">Women√¢‚Ç¨‚ÄùSuffrage√¢‚Ç¨‚ÄùUtah.</p>
<p>Rather than just stripping out all instances of &amp;#8212;,  my plan is to replace them with the standard Library of Congress double dash. This will ensure that the existing code that breaks the subjects down to tags will still work.</p>
<p><strong>Composite Subjects</strong></p>
<p>When I say &#8220;composite subject&#8221; what I mean is a subject that includes multiple very disparate terms. Rather than the Library of Congress style subjects, all aspects of which relate to the collection in question, these composite subjects cover multiple subjects which are grouped together for convenience.</p>
<p>This is a list of some of the most popular subjects for the Utah Gov collections:</p>
<ul>
<li>Politics, Government, and Law</li>
<li>Business, Industry, Labor, and Commerce</li>
<li>Science, Technology, and Health</li>
<li>Arts, Humanities, and Social Sciences</li>
</ul>
<p>These subjects throw a monkey wrench into my theories about decomposing subjects based on commas. The collections to which these subjects are assigned likely fit in only one of the component themes. For example, the &#8220;Inventory of Publications from Department of Technology Services, 1993-2008&#8243; is assigned the subject &#8220;Science, Technology, and Health&#8221;. If I divide this subject into 3 separate tags, the Science and Health tags would be quite misleading.</p>
<p>So that leaves me a bit trapped. If I want to divide subjects such as &#8220;Art, Cuban, 20th century&#8221;, as I discuss in <a title="ArchivesZ Data Challenges: Syracuse University Special Collections Research Center" href="http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/">my Syracuse University post</a>, then I end up also dividing these umbrella subjects which separate such very divergent terms with commas.</p>
<p>This issue goes on my list of reasons to add a repository configuration file for use by the data extraction script.</p>
<p><strong>Accurate Collection Size</strong></p>
<p>In my quest to convert all sizes to linear feet &#8211; sizes such as these are challenging:</p>
<ul>
<li>0.20 cubic foot and 1 microfilm reel</li>
<li>0.35 cubic foot and 2 microfilm reels</li>
</ul>
<p class="label">I also have situations of sizes be specified in multiple sections of the finding aid. The <a title="Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991." href="http://images.archives.utah.gov/cdm4/item_viewer.php?CISOROOT=/ead&amp;CISOPTR=991&amp;CISOBOX=1&amp;REC=1">Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991</a> has a collection level size of &#8220;0.50 cubic foot and 2 microfilm reels&#8221;, but further down in this finding aid I see this:</p>
<p class="label"><em><span class="label">series: </span>ALERT Foundation records </em></p>
<ul>
<li><span class="label">box 1, folder 1: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" Motion picture film reel, 16mm</li>
<li> <span class="label">box 1, folder 2: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" VHS videocassette</li>
<li> <span class="label">box 1, folder 3: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" VHS videocassette</li>
<li> <span class="label">box 1, folder 4: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" VHS videocassette</li>
</ul>
<p>When they said 2 microfilm reels &#8211; do they really mean a 16mm motion picture film reel and a VHS videocassette? Is there 1 VHS videocassette or 3? How sizes are specified in a specific repository&#8217;s finding aids is another possible candidate for a repository level configuration script.</p>
<p><strong>Tagging Statistics</strong></p>
<p>Finally, here are a few tag stats:</p>
<ul>
<li>Only 31 tags (1.5% of all Utah Government tags) are associated with 10 or more collections</li>
<li>1404 tags  (71.5%) are assigned to only a single collection</li>
<li>107 collections have been assigned only 1 tag</li>
<li>10 collections have no subjects</li>
</ul>
<p>Of course these statistics are based on the current incarnation of the data extraction script. After I modify the script, there will be a greater number of tags and (hopefully) more overlap of tags across multiple collections. These types of statistics should help me gauge how well my data extraction logic is working.</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/">ArchivesZ Data Challenges: Utah Government Archives &#038; Records Service</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Poster Wins 2nd Place at GRID 2009</title>
		<link>http://www.spellboundblog.com/2009/04/22/archivesz-poster-wins-2nd-place-at-grid-2009/</link>
		<comments>http://www.spellboundblog.com/2009/04/22/archivesz-poster-wins-2nd-place-at-grid-2009/#comments</comments>
		<pubDate>Wed, 22 Apr 2009 12:09:06 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[learning technology]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=473</guid>
		<description><![CDATA[The title says it all. I won 2nd place in the &#8220;Smart Computers and Computing&#8221; section of the University of Maryland&#8217;s Graduate Research Interaction Day (GRID) for my poster ArchivesZ: Visualizing Archival Collections (what is in all those boxes?). 1st place in &#8220;Smart Computers and Computing&#8221; went to the fabulous Dave Levin for his presentation [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/22/archivesz-poster-wins-2nd-place-at-grid-2009/">ArchivesZ Poster Wins 2nd Place at GRID 2009</a></p>
]]></description>
			<content:encoded><![CDATA[<p><img class="alignright size-full wp-image-476" title="2nd Place" src="http://www.spellboundblog.com/wp-content/uploads/2009/04/2nd-place.jpg" alt="2nd Place" width="200" height="200" />The title says it all. <a title="GRID Winners" href="http://www.gsg.umd.edu/go/events/grid/winners">I won 2nd place</a> in the &#8220;Smart Computers and Computing&#8221; section of the University of Maryland&#8217;s Graduate Research Interaction Day (<a title="GRID" href="http://www.gsg.umd.edu/go/events/grid">GRID</a>) for my poster <a title="ArchivesZ Poster" href="http://www.spellboundblog.com/wp-content/uploads/2009/04/archivesz-poster.jpg">ArchivesZ: Visualizing Archival Collections (what is in all those boxes?)</a>.</p>
<p>1st place in &#8220;Smart Computers and Computing&#8221; went to the fabulous <a title="Dave Levin" href="http://www.cs.umd.edu/~dml/">Dave Levin</a> for his presentation on <a title="TrInc: Small Trusted Hardware for Large Distributed Systems" href="http://www.gsg.umd.edu/index.cfm?objectid=FDC1B628-07F9-E799-B11A856B3ACEA60B">TrInc: Small Trusted Hardware for Large Distributed Systems</a>.</p>
<p>Overall, it was a great experience. I wish I could have been in multiple rooms at the same time so I could have seen more posters and presentations. I also wished I had understood that I could have presented with either a poster or a power point deck. That was not entirely clear ahead of time. The downside of of my choice was being tied to my poster, but the upside is that I still have the poster that can be examined by readers like you. Obviously it all worked out in the end.</p>
<p>A big thanks to everyone in the <a title="University of Maryland Graduate Student Government" href="http://www.gsg.umd.edu">Graduate Student Government</a> who worked so hard to bring this event together.</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/22/archivesz-poster-wins-2nd-place-at-grid-2009/">ArchivesZ Poster Wins 2nd Place at GRID 2009</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/04/22/archivesz-poster-wins-2nd-place-at-grid-2009/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Poster at UMD&#8217;s GRID 2009</title>
		<link>http://www.spellboundblog.com/2009/04/12/archivesz-poster-umd-grid-2009/</link>
		<comments>http://www.spellboundblog.com/2009/04/12/archivesz-poster-umd-grid-2009/#comments</comments>
		<pubDate>Sun, 12 Apr 2009 19:02:43 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[learning technology]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=432</guid>
		<description><![CDATA[Come meet me and hear my 8 minute talk in front of a poster about ArchivesZ. When? April 13, 2009, 1:30-3pm What? University of Maryland&#8217;s Graduate Research Interaction Day (GRID) Where? University of Maryland&#8217;s Stamp Student Union My ArchivesZ poster has been assigned to the &#8220;Smart Computers and Computer Science&#8221; theme. I will be with [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/12/archivesz-poster-umd-grid-2009/">ArchivesZ Poster at UMD&#8217;s GRID 2009</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Come meet me and hear my 8 minute talk in front of a poster about ArchivesZ.</p>
<ul>
<li>When? April 13, 2009, 1:30-3pm</li>
<li>What? <a title="UMD Graduate Research Interaction Day" href="http://www.gsg.umd.edu/go/events/grid">University of Maryland&#8217;s Graduate Research Interaction Day</a> (GRID)</li>
<li>Where? <a title="UMD Stamp Student Union" href="http://www.union.umd.edu/">University of Maryland&#8217;s Stamp Student Union</a></li>
</ul>
<p>My ArchivesZ poster has been assigned to the &#8220;Smart Computers and Computer Science&#8221; theme. I will be with my poster in the Benjamin Bannekar B room at UMD&#8217;s Stamp Student Union from 1:30 to 3pm. If you are attending GRID, please stop by and say hello!</p>
<p>Want a preview or can&#8217;t make it? Here is the poster in question:</p>
<p style="text-align: center;"><a href="http://www.spellboundblog.com/wp-content/uploads/2009/04/archivesz-poster.jpg"><img class="size-medium wp-image-433 aligncenter" title="ArchivesZ Poster" src="http://www.spellboundblog.com/wp-content/uploads/2009/04/archivesz-poster-300x241.jpg" alt="ArchivesZ Poster" width="300" height="241" /></a></p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/12/archivesz-poster-umd-grid-2009/">ArchivesZ Poster at UMD&#8217;s GRID 2009</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/04/12/archivesz-poster-umd-grid-2009/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Princeton University</title>
		<link>http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/</link>
		<comments>http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 04:13:54 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=395</guid>
		<description><![CDATA[I received a zip file of 1,771 EAD encoded finding aids from the kind EAD enthusiasts at the Seely G. Mudd Manuscript Library. These finding aids came from five divisions within Princeton&#8217;s Library: University Archives Public Policy Papers Manuscript Division Latin American Ephemera Collection Engineering Library So onward to the data issues and what they [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/">ArchivesZ Data Challenges: Princeton University</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Princeton University Seely G. Mudd Manuscript Library" href="http://www.princeton.edu/mudd/"><img class="aligncenter size-full wp-image-406" title="Princeton University Seeley G. Mudd Manuscript Library" src="http://www.spellboundblog.com/wp-content/uploads/2009/03/princeton-mudd.jpg" alt="Princeton University Seeley G. Mudd Manuscript Library" width="501" height="96" /></a><br />
I received a zip file of 1,771 EAD encoded finding aids from the kind EAD enthusiasts at the <a title="Seely G. Mudd Manuscript Library" href="http://www.princeton.edu/~mudd/">Seely G. Mudd Manuscript Library</a>. These finding aids came from five divisions within <a title="Princeton University Library" href="http://library.princeton.edu/">Princeton&#8217;s Library</a>:</p>
<ul>
<li><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;"><a title="Princeton University Archives" href="http://www.princeton.edu/~mudd/finding_aids/archives.html">University Archives</a><br />
</span></span></li>
<li><a title="Princeton University: Public Policy Papers" href="http://www.princeton.edu/~mudd/finding_aids/policy.html"><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;">Public Policy Papers</span></span></a></li>
<li><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;"><a title="Princeton Manuscript Division" href="http://www.princeton.edu/~rbsc/department/manuscripts/index.shtml">Manuscript Division</a><br />
</span></span></li>
<li><a title="Princeton: Latin American Ephemera Collection" href="http://firestone.princeton.edu/latinam/ephemera.php"><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;">Latin American Ephemera Collection</span></span></a></li>
<li><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;"><a title="Princeton Engineering Library" href="http://libblogs.princeton.edu/englib/?s=">Engineering Library</a><br />
</span></span></li>
</ul>
<p>So onward to the data issues and what they mean for my ever growing &#8216;script fix to-do list&#8217;.</p>
<p><strong>Repository Names</strong></p>
<p>As we saw with the Oregon State University finding aids, the finding aids from Princeton University had a wide range of different values for repository names. In the list below we spot some issues. Some end in periods, some do not. One has extra space (probably a carriage return) in the middle. One does not include Princeton in the repository name. Once we have many repositories&#8217; finding aids in ArchivesZ, a repository name of &#8216;Engineering Library&#8217; does not tell the user enough about where those collections can be found.</p>
<p>Here is the list of repository titles my script extracted:</p>
<ul>
<li><span class="il">Princeton</span> University Library. Department of Rare Books and Special Collections.</li>
<li>Engineering Library</li>
<li><span class="il">Princeton</span> University Library</li>
<li><span class="il">Princeton</span> University Library. Department of Rare                    Books and Special Collections.</li>
<li><span class="il">Princeton</span> University Library.</li>
</ul>
<p>My script can handle the extra period and the extra spaces, but the non-specific name would need to ultimately be fixed on the source side.</p>
<p><strong>Collection Size</strong></p>
<p>The current script assumes that there is only one extent value specified to express the size of the collection. Princeton&#8217;s finding aids showed me examples of multiple extent values. For example, the <a title="Christina Georgina Rossetti Collection" href="http://diglib.princeton.edu/ead/getEad?eadid=C0222">Christina Georgina Rossetti Collection</a> has both a collection level size of 0.4 linear feet (1 archival box) as well as a 2nd extent specification corresponding to a specific folder with the value of (1 poem, 3 drawings, 1 photo, 1 incomplete article). The script must be modified to only consider the collection level size.</p>
<p><strong>Complicated Titles</strong></p>
<p>The current script logic apparently does not handle what I would call &#8216;complicated collection titles&#8217;. For example, I ended up with &#8220;Edward Livingston Papers, &#8221; as the title for a collection with a full title of <a title="Edward Livingston Papers, 1683-1877 (bulk 1764-1836)" href="http://http://diglib.princeton.edu/ead/getEad?eadid=C0280">Edward Livingston Papers, 1683-1877 (bulk 1764-1836)</a>. This is the way that this title is encoded:<code><br />
&lt;unittitle encodinganalog="245$a" label="Title and dates: "&gt;Edward Livingston Papers, &lt;unitdate encodinganalog="245$f" normal="1683/1877" type="inclusive"&gt;1683-1877&lt;/unitdate&gt; (bulk &lt;unitdate encodinganalog="245$g" normal="1764/1836" type="bulk"&gt;1764-1836&lt;/unitdate&gt;)&lt;/unittitle&gt;</code></p>
<div id=":1aw" class="ii gt">
<p><strong>Too Many Tags</strong></p>
<p>The Engineering Library&#8217;s <a title="Department of Mechanical and Aerospace Engineering Technical Reports: Finding Aid" href="http://diglib.princeton.edu/ead/getEad?id=ark:/88435/qf85nb33h">Department of Mechanical and Aerospace Engineering Technical Reports: Finding Aid</a> has 522 tags assigned to it! Almost all of these are the names of the authors of the individual reports. This scenario goes on the list of reasons why I might choose to not include (at least for this version) persname subjects. The other option for handling this situation is to only use subjects assigned at the collection level and ignoring subjects assigned at lower unit/container levels. Without the author tags, this single collection ends up with this nice, reasonable list of tags:</p>
<ul>
<li>Fluid mechanics</li>
<li>Mechanical engineering</li>
<li>Combustion</li>
<li>Aerospace engineering</li>
<li>Propulsion systems</li>
</ul>
<p><strong>Year Challenges</strong><br />
I found two different issues related to year ranges:</p>
<ul>
<li><a title="Women in Argentina, VI, 1989-2001: Finding Aid" href="http://diglib.princeton.edu/ead/getEad?id=ark:/88435/2z10wq25w">Women in Argentina, VI, 1989-2001: Finding Aid</a>: The current script does not properly extract the inclusive dates which are encoded within the titleproper tags, but rather assumes that it will be encoded using a unitdate tag.</li>
<li>An assortment of finding aids include subjects which have year spans as part of the subject. When these subjects are decomposed into tags, we end up with tags like &#8217;1850-1950&#8242;. Since we have the time period communicated via the inclusive dates, I will likely just drop these portions of the subjects rather than create a tag for each unique year span.</li>
</ul>
<p><strong>General Code Fixes</strong></p>
<p>It is reassuring at this point to spot the same issues with data from multiple repositories. Here are data and code logic issues that I have seen elsewhere that are revalidated by Princeton&#8217;s finding aids:</p>
<ul>
<li>Need to strip /n &amp; /t characters</li>
<li>Need to break subjects up based on commas</li>
<li>Need to drop final periods from repository names, subjects and titles</li>
<li>The designation of size in volumes, as in &#8220;793 volumes&#8221;. I need to pick an approach for translating from volumes to linear feet</li>
</ul>
<p>The script to-do list is still getting longer, but I am not done cycling through new institutions&#8217; XML files to find new issues. Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via <a title="Contact Jeanne" href="../contact/">my contact form</a>.</p>
<p><em>Image Credit: Top image from the <a title="Seeley G. Mudd Manuscript Library" href="http://www.princeton.edu/mudd/">Seeley G. Mudd Manuscript Library homepage</a>.</em></div>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/">ArchivesZ Data Challenges: Princeton University</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Syracuse University Special Collections Research Center</title>
		<link>http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/</link>
		<comments>http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/#comments</comments>
		<pubDate>Sat, 07 Mar 2009 04:48:44 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=353</guid>
		<description><![CDATA[The Syracuse University Special Collections Research Center has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files. My scripts extracted 6632 unique &#8216;tags&#8217; from the subjects [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/">ArchivesZ Data Challenges: Syracuse University Special Collections Research Center</a></p>
]]></description>
			<content:encoded><![CDATA[<p>The <a title="Syracuse University Special Collections Research Center" href="http://library.syr.edu/information/spcollections/"><img class="alignright size-full wp-image-357" title="Syracuse University" src="http://www.spellboundblog.com/wp-content/uploads/2009/03/syracuse-university.jpg" alt="Syracuse University" width="200" height="200" /></a><a title="Syracuse University Special Collections Research Center" href="http://library.syr.edu/information/spcollections/">Syracuse University Special Collections Research Center</a> has also been so kind as to provide the XML source files for their finding aids for use in the ArchivesZ project. I loaded 572 finding aids and no errors were generated during the parsing of the XML files.</p>
<p>My scripts extracted 6632 unique &#8216;tags&#8217; from the subjects assigned to the finding aids. As part of the data parsing and loading of data for use in the visualizations, the script divides up compound subjects into tags. For example, in the subjects we find assigned to Syracuse University finding aids we find these values (number shown is number of finding aids to which that subject is assigned):</p>
<ul>
<li>Art &#8212; American &#8212; 20th century (1)</li>
<li>Art &#8212; Cartoonists (68)</li>
<li>Art &#8212; Cartoonists. (3)</li>
<li>Art &#8212; Exhibitions. (1)</li>
<li>Art &#8212; Illustrators (36)</li>
<li>Art &#8212; Illustrators. (1)</li>
<li>Art &#8212; Painters (77)</li>
<li>Art &#8212; Philosophy. (1)</li>
<li>Art &#8212; Sculpture (33)</li>
</ul>
<p>As well as subjects, where the components are separated by commas such as these (number listed indicates total finding aids assigned that subject):</p>
<ul>
<li>Art, American (33)</li>
<li>Art, American. (46)</li>
<li>Art, American, 20th century (28)</li>
<li>Art, American, 20th century. (31)</li>
<li>Art, Cuban, 20th century (1)</li>
<li>Art, Modern (1)</li>
<li>Art, French, 20th century. (1)</li>
</ul>
<p>The goal is to capture the core ideas &#8211; to capture the overlap in subject matter among diverse collections. All of the collections with any of these subjects are about Art. With the current script, the tag Art is associated with 179 collections from Syracuse University. You can see from this tiny subset of subjects that other themes would be revealed when these subjects were decomposed more completely &#8211; and this just scratches the surface.</p>
<p>Out of the 6676 subjects, 5658 subjects are assigned to single collections. Out of the 6632 tags the current script extracted from those subjects, 5594 tags are assigned to single collections. Not much improvement with the current state of the script.</p>
<p>While currently the script does a good job with the Library of Congress double dash separation pattern, the Syracuse University data has shown me a number of other standard patterns that need to be handled which can be seen in the small sampling of art related subjects shown above. The easy one is removing periods and stripping spaces from the end of subject values.  The harder change will be to implement smart separation of subjects into tags based on commas. This would need the code to only break up &lt;subject&gt; values while leaving &lt;persname&gt; and &lt;corpname&gt; alone. I will also need to examine &lt;geogname&gt; values from across various institutions to decide if it is better to break them up or leave them be.</p>
<p>Other than these subject issues, there are a few other script modification that I will need to make based on scenarios the data in the Syracuse finding aids have shown me:</p>
<ul>
<li>Syracuse University uses an entity to populate the repository values &#8211; the current script does not handle this at all.</li>
<li>Ensure that single item collections are assigned a size of .25 linear feet</li>
<li>Linear ft must be added as another recognized abbreviation for linear feet</li>
</ul>
<p>All these issues are being added to my master &#8216;to do&#8217; list for updating the EAD parsing script. Onward to the next data set.</p>
<p>Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via <a title="Contact Jeanne" href="../contact/">my contact form</a>.</p>
<p><em>Image Credit: Syracuse University image above from <a title="Syracuse University Special Collections Research Center" href="http://library.syr.edu/information/spcollections/">Syracuse University Special Collections Research Center</a> home page.<br />
</em></p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/">ArchivesZ Data Challenges: Syracuse University Special Collections Research Center</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Oregon State University Archives</title>
		<link>http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/</link>
		<comments>http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/#comments</comments>
		<pubDate>Sun, 22 Feb 2009 07:48:45 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=344</guid>
		<description><![CDATA[The Oregon State University Archives has generously contributed 356 of their finding aids in EAD format for use in the development of version 2 of ArchivesZ. This is my first post in a what will likely be a series of looks behind the scenes at the challenges facing a project like ArchivesZ on the data [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/">ArchivesZ Data Challenges: Oregon State University Archives</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a href="http://osulibrary.oregonstate.edu/archives/"><img class="alignright size-full wp-image-346" title="OSU Archives" src="http://www.spellboundblog.com/wp-content/uploads/2009/02/osu_archives_home1.jpg" alt="OSU Archives" width="233" height="179" /></a>The <a title="Oregon State University Archives" href="http://osulibrary.oregonstate.edu/archives/archive/">Oregon State University Archives</a> has generously contributed 356 of their finding aids in EAD format for use in the development of version 2 of <a title="ArchivesZ" href="http://www.archivesz.com/">ArchivesZ</a>. This is my first post in a what will likely be a series of looks behind the scenes at the challenges facing a project like ArchivesZ on the data level.</p>
<p>Version one of ArchivesZ only used finding aids from the University of Maryland and the Library of Congress. This was definitely a case of the path of least resistance. I attend the University of Maryland and the Library of Congress has a very convenient <a title="Library of Congress Finding Aid Source" href="http://lcweb2.loc.gov/faid/source.html">page providing links to all their Finding Aids source XML files</a>. A very key aspect of creating version 2 of ArchivesZ is making sure that the scripts that pull data from EAD XML files is robust enough to handle the encoding practices of a very diverse range of institutions.</p>
<p>Please keep in mind that OSU is likely to bear the brunt of many basic data issues that I would have unearthed with whatever data sets I tried first!</p>
<p>There are 3 crucial data elements on which the visualizations of ArchivesZ depend: subject, inclusive dates, and collection size. Each element presents unique challenges. The script parsing issues I am uncovering with the OSU finding aids are currently worst for collection size. In order to make pretty charts which let people compare the quantity of materials in each collection (or record group  &#8211; please forgive that I use the term &#8216;collection&#8217; to mean any set of records for which a finding aid has been created), we need to be able to assign a single number to represent the size of each collection. Based on the values used in the LOC and UMD finding aids, we chose to go with linear of feet as our standard unit of measurement. So the trick is to translate whatever archivists choose to put into the &lt;physdesc&gt; element of their finding aid into some number of linear feet.</p>
<p>These are the size conversion rules we implemented for version 1 of ArchivesZ:</p>
<ul>
<li> 1 microfilm reel = 1 linear foot</li>
<li> Collections represented only by a number of items will be represented as .25 linear feet</li>
<li> If size only specified in number of boxes, then 1 box = .5 linear feet</li>
<li> When the size is given in some different types of units, they are prioritized in the following order: linear feet &gt; boxes &gt; microfilm reels &gt; items</li>
</ul>
<p>This works reasonably well when the physical description values are simple &#8211; it starts to fall apart when what is entered is more complicated. Here are some examples of the physical descriptions in the OSU finding aids:</p>
<p><a title="OSU Archives: Guide to the Phi Kappa Phi-OSU Chapter Records " href="http://nwda-db.wsulibs.wsu.edu/findaid/ark:/80444/xv95428">Guide to the Phi Kappa Phi-OSU Chapter Records</a>: The display in the &#8216;pretty&#8217; version of the finding aid  online shows this: 5.5 cubic feet (9 boxes, including 2 		  oversize boxes) (3 microfilm reels)</p>
<p>The version in the XML file is this:</p>
<pre>&lt;physdesc&gt;
  &lt;extent&gt;5.5 cubic feet&lt;/extent&gt;
  &lt;extent&gt;9 boxes, including 2 oversize boxes&lt;/extent&gt;
  &lt;extent&gt;3 microfilm reels&lt;/extent&gt;
&lt;/physdesc&gt;</pre>
<p>With the current algorithm, this finding aid would be marked as being 3 linear feet in size. At a bare minimum, I must add &#8216;cubic feet&#8217; as another unit to be converted. More difficult to discern is if I should have a value of  5.5 linear feet (assuming 1 cubic foot = 1 linear foot for the purposes of these comparisons) or a value of 8.5 linear feet (5.5 + 3 linear feet for the 3 microfilm reels). There is never going to be a perfect answer here, but clearly my logic needs to be more sophisticated than it is now.</p>
<p><a title="Harvey L. McAlister Collection" href="http://osulibrary.oregonstate.edu/archives/archive/mss/documents/OREmcalister.pdf">Harvey L. McAlister Collection</a>: The display in the pretty version of this finding aid online is this: 1 cubic foot, including 26 photographs (4 boxes, including 2 oversize boxes, and 1 map folder)</p>
<p>The version in the XML file is this:</p>
<pre>&lt;physdesc&gt;
  &lt;extent encodinganalog="300$a"&gt;1 cubic foot, including 26 photographs&lt;/extent&gt;
  &lt;extent encodinganalog="300$a"&gt;4 boxes, including 2 oversize boxes, and 1 map folder&lt;/extent&gt;
&lt;/physdesc&gt;</pre>
<p>With the current algorithm, this finding aid would be marked as being 1 linear foot in size. From looking at these two examples, it would seem that this would be fine and in fact &#8211; for the purposes of calculating a comparable size &#8211; only looking at the first &lt;extent&gt; value might be the way to go &#8211; at least for OSU finding aids.</p>
<p>There are some other simpler issues relating to standardization in the way that certain values are entered. For example, after ingesting 173 finding aids from OSU (the number I got through before my script flat out choked on a size designation), I ended up with five different repositories added to my REPOSITORIES table. I had expected only one. Each of these was entered as repository name &#8212; and I have included the length of each value to show how extra spaces are causing part of the problem:</p>
<ul>
<li>Oregon State University                Libraries &#8211; length 36</li>
<li>Oregon State University    &#8211; length 23</li>
<li>Oregon State UniversityLibraries    &#8211; length 32</li>
<li>Oregon State University             Libraries  &#8211; length 36</li>
<li>Oregon State University Libraries    &#8211; length 33</li>
</ul>
<p>Some of these I can handle by adding smarter trimming of trailing spaces &#8211; but in this case it is clear that typos and inconsistency are also a challenge. I checked and each of these different &lt;corpname&gt; values, within the &lt;repository&gt; element is used by at least 10 finding aids. Perhaps they have been inherited over time from a template?</p>
<p>I have considered creating a repository definition file that could be used when loading finding aids from one repository at a time. This would remove dependence on perfect replication of these sorts of values while still supplying the data needed to let people limit their searches by a named repository.</p>
<p>The last issue is the most minor. There are many /n and /t characters throughout the XML documents. These I plan to simply strip out as the script parses the XML file.</p>
<p>A big thank you to <a title="Elizabeth Nielsen" href="http://osulibrary.oregonstate.edu/staff/nielseel">Elizabeth Nielsen</a>, Senior Staff Archivist at OSU Archives. Her response to my query about OSU&#8217;s comfort with my taking apart their finding aids in public on my blog was &#8220;Bring it on – we’re tough!&#8221;.</p>
<p>It is fascinating to dig into new finding aids and see how the parsing script handles what it finds. I plan to test the existing script on XML from more sources to see all the things that must be fixed. Then I get to wrap my head around code that someone else wrote (another member of the original ArchivesZ team wrote the version 1 ruby script). For those of you who are not programmers, you can skim through my <a title="Book Review of Dreaming in Code" href="http://www.spellboundblog.com/2007/05/24/book-review-dreaming-in-code-a-book-about-why-software-is-hard/">Book Review of Dreaming in Code</a> to get a handle on why this can be harder than it sounds like it should be.</p>
<p>Want to share your institution&#8217;s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via <a title="Contact Jeanne" href="http://www.spellboundblog.com/contact/">my contact form</a>.</p>
<p><em>Image Credit: OSU Archives image above from the <a title="OSU Archives" href="http://osulibrary.oregonstate.edu/archives/">OSU Archives Home Page</a>.</em></p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/">ArchivesZ Data Challenges: Oregon State University Archives</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>NEH Digital Humanities Startup Grant News: Visualizing Archival Collections</title>
		<link>http://www.spellboundblog.com/2008/09/12/neh-digital-humanities-startup-grant-news-visualizing-archival-collections/</link>
		<comments>http://www.spellboundblog.com/2008/09/12/neh-digital-humanities-startup-grant-news-visualizing-archival-collections/#comments</comments>
		<pubDate>Fri, 12 Sep 2008 05:23:28 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[information visualization]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/2008/09/12/neh-digital-humanities-startup-grant-news-visualizing-archival-collections/</guid>
		<description><![CDATA[As of August 22nd, 2008 it was official. There is even a blog post over on the NEH Office of Digital Humanities updates page to prove it. The University of Maryland was granted a Level I NEH Digital Humanities Startup Grant to fund work on the &#8216;Visualizing Archival Collections&#8217; project. The official one liner is [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2008/09/12/neh-digital-humanities-startup-grant-news-visualizing-archival-collections/">NEH Digital Humanities Startup Grant News: Visualizing Archival Collections</a></p>
]]></description>
			<content:encoded><![CDATA[<p align="left"><a title="ArchivesZ" href="http://www.archivesz.com"></a></p>
<p style="text-align: center"><a title="ArchivesZ" href="http://www.archivesz.com"><img src="http://www.spellboundblog.com/wp-content/uploads/2008/09/archivesz-ng.jpg" alt="archivesz ng" width="450" height="130" /></a></p>
<p>As of August 22nd, 2008 it was official. There is even a <a title="NEH ODH: Announcement of Awardees" href="http://www.neh.gov/ODH/ODHHome/tabid/36/EntryID/81/Default.aspx">blog post over on the NEH Office of Digital Humanities</a> updates page to prove it. The <a title="University of Maryland" href="http://www.umd.edu">University of Maryland</a> was granted a Level I <a title="NEH Digital Humanities Startup Grant" href="http://www.neh.gov/grants/guidelines/digitalhumanitiesstartup.html">NEH Digital Humanities Startup Grant</a> to fund work on the &#8216;Visualizing Archival Collections&#8217; project. The official one liner is that the project will support &#8220;The development of visualization tools for assessing information contained in electronic archival finding aids created with Encoded Archival Description (EAD)&#8221;. Why did I wait so long to announce this on the blog? I wanted to have something fun to announce at the end of my SAA presentation out in San Francisco!</p>
<p>The project director is <a title="Dr. Jennifer Golbeck" href="http://www.cs.umd.edu/~golbeck/index.shtml">Dr. Jennifer Golbeck</a>. I also have the support of University of Maryland&#8217;s Jennie Levine, <a title="Dr. Bruce Ambacher" href="http://ischool.umd.edu/people/ambacher/">Dr. Bruce Ambacher</a>, and <a title="Dr. Doug Oard" href="http://www.glue.umd.edu/~oard/">Dr. Doug Oard</a>. This amazing set collaborators should help me stay on the right track and make sure I keep the sometimes competing issues relating to archives, information retrieval and interface design in balance.</p>
<p>I will be collecting EAD encoded finding aids over the next few months. My goal is to gather a broad sample of English language finding aids from a wide range of institutions and work on the script that extracts this data into a database. Once we have the data extracted I get to look at what we have, do some data cleanup and start thinking about what sorts of visualizations might work with our real world data. During the spring term we will design and build a 2nd generation prototype of <a title="ArchivesZ" href="http://www.archivesz.com">ArchivesZ</a>.</p>
<p>Want your data to be part of this? If you would like to contribute EAD finding aids in XML format to the project, please send me the following information:</p>
<ol>
<li>Archives Name</li>
<li>Archives Parent Institution (if applicable)</li>
<li>Archives Location</li>
<li>Contact at Archives for questions about the finding aids (name, email and phone number)</li>
<li>Estimate of # of finding aids being offered</li>
<li>Controlled Vocabulary or Thesaurus used for Subject values (as many as are used)</li>
<li>Method of finding aid delivery (sending me a zip file? pointing me at a directory online? some other way?)</li>
<li>Do I have your permission to post a discussion of the data issues I may find in your finding aids here on Spellbound Blog? (Please see the <a title="OSU ArchivesZ Data Challenges" href="http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/">OSU Archives</a> post as an example of they types of issues I discuss)</li>
</ol>
<p>You can either put this into the form on my <a title="Contact Jeanne" href="http://www.spellboundblog.com/contact/">Contact Page</a> or send email directly to jeanne AT spellboundblog dot com.</p>
<p>Thank you to everyone for their enthusiasm about the ArchivesZ project. It is very exciting to have the opportunity to take all these shiny ideas to the next level.</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2008/09/12/neh-digital-humanities-startup-grant-news-visualizing-archival-collections/">NEH Digital Humanities Startup Grant News: Visualizing Archival Collections</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2008/09/12/neh-digital-humanities-startup-grant-news-visualizing-archival-collections/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
