<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Spellbound Blog &#187; metadata</title>
	<atom:link href="http://www.spellboundblog.com/category/metadata/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.spellboundblog.com</link>
	<description>Archives, Digital Humanities, Cultural Heritage, Technology</description>
	<lastBuildDate>Mon, 06 Feb 2012 14:49:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Support EAD Tagging Research</title>
		<link>http://www.spellboundblog.com/2010/12/06/support-ead-tagging-research/</link>
		<comments>http://www.spellboundblog.com/2010/12/06/support-ead-tagging-research/#comments</comments>
		<pubDate>Mon, 06 Dec 2010 15:27:44 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[access]]></category>
		<category><![CDATA[archival community]]></category>
		<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=1069</guid>
		<description><![CDATA[In case you haven&#8217;t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions&#8217; finding aids to extract data for my ArchivesZ project, I am here to tell you that [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/12/06/support-ead-tagging-research/">Support EAD Tagging Research</a></p>
]]></description>
			<content:encoded><![CDATA[<p>In case you haven&#8217;t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions&#8217; finding aids to extract data for my ArchivesZ project, I am here to tell you that this work is VERY important. With better standards in place we will have a better foundation upon which to create interesting new tools and services to support archivists and researchers.</p>
<p>Is part of your job is to encode finding aids in EAD? Then please ask if you can send a dozen of them to the researchers on this project!</p>
<blockquote><p><strong>Seeking EAD records from repositories that have implemented EAD</strong></p>
<p>Standards have been entering the archival lexicon at a fast pace to ensure data reliability, enable data aggregation, and manage data over the long term. However, we have not yet examined the use of these standards across the archival community. As we  move into the next phase of standards-creation, a broad look at current implementations will help to inform the next  generation of these standards. To do this, Kathy Wisser (Simmons College) and Jackie Dean (UNC Chapel Hill) are conducting research on EAD tag usage in the encoding community.</p>
<p>This project is intended to inform the TS-EAD revision process of the standard, and results will be disseminated through traditional publication avenues.</p>
<p>We are seeking a sample of encoded finding aids from institutions that have implemented EAD.  If you are willing to participate in this project, please submit via electronic mail 12 to 15 finding aids to eadtagresearch@gmail.com by December 15, 2010.</p>
<p>The goal of the project is to identify encoding behavior and <strong>not</strong> to evaluate the quality of the encoding or the content of the finding aid. We will be noting the presence and absence of elements and attributes and the way that elements are used within the context of an EAD instance.</p>
<p>All results will be <strong>anonymized</strong>; no institution-specific information will be linked to the results.  Institutions willing to participate will be acknowledged.</p>
<p>In order to obtain an accurate account of the use of the standard, we are looking for EAD instances from as many institutions as possible. We hope you will consider contributing to this effort.</p>
<p>If you have any questions about the project, please contact:</p>
<ul>
<li>Kathy Wisser (Simmons College &#8211; wisser@simmons.edu)</li>
<li>Jackie Dean (UNC Chapel Hill &#8211; jdean@email.unc.edu)</li>
</ul>
</blockquote>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/12/06/support-ead-tagging-research/">Support EAD Tagging Research</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2010/12/06/support-ead-tagging-research/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Gridworks: Super Data Cleanup and Exploration Tool</title>
		<link>http://www.spellboundblog.com/2010/05/29/gridworks-data-cleanup-exploration-tool/</link>
		<comments>http://www.spellboundblog.com/2010/05/29/gridworks-data-cleanup-exploration-tool/#comments</comments>
		<pubDate>Sat, 29 May 2010 06:26:31 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[electronic records]]></category>
		<category><![CDATA[information visualization]]></category>
		<category><![CDATA[learning technology]]></category>
		<category><![CDATA[MARAC]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[software]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=987</guid>
		<description><![CDATA[In my presentation at the Spring 2010 Mid-Atlantic Regional Archives Conference (MARAC), Whirlwind Tour of Visualization-Land,  I showed some screenshots of a tool called Gridworks. At the time, Gridworks was not available to the general public. The good news is that earlier this month Gridworks 1.0 was officially released and you can get Gridworks right [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/05/29/gridworks-data-cleanup-exploration-tool/">Gridworks: Super Data Cleanup and Exploration Tool</a></p>
]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://code.google.com/p/freebase-gridworks/"><img class="size-full wp-image-988  aligncenter" title="ridworks" src="http://www.spellboundblog.com/wp-content/uploads/2010/05/gridworks.jpg" alt="" width="400" height="100" /></a></p>
<p>In my presentation at the Spring 2010 <a title="MARAC" href="http://www.marac.info">Mid-Atlantic Regional Archives Conference</a> (MARAC), <a title="Whirlwind Tour of Visualization-Land" href="http://www.slideshare.net/JKramerSmyth/marac-2010-visualization">Whirlwind Tour of  Visualization-Land</a>,  I showed some screenshots of a tool called Gridworks. At the time, Gridworks was not available to the general public. The good news is that earlier this month <a title="Gridworks 1.0 Announcment" href="http://blog.freebase.com/2010/05/10/announcing-the-release-of-freebase-gridworks-1-0/">Gridworks 1.0 was officially released</a> and you can <a title="Gridworks on Google Code" href="http://code.google.com/p/freebase-gridworks/">get Gridworks right now</a>.</p>
<p>For those of you who didn&#8217;t see my presentation, Gridworks is tool you run locally on your computer via a web browser. It permits you to load &#8216;grid-shaped data&#8217; for examination, filtering and data cleanup. That makes is sound so much less exciting than it is. The best way to get a sense of what you can do is to watch the <a title="Gridworks Videos" href="http://vimeo.com/groups/gridworks/videos">Gridworks Videos</a>.</p>
<p>What sort of data do I think there is in archives to be pumped  into Gridworks? How about collection descriptive data and electronic  record datasets? Since all the data is kept locally, you don&#8217;t need to worry about uploading your data to some anonymous server in order to work with it. It all stays safely on your local computer the whole time.</p>
<p>A quick list of things that Gridworks can do:</p>
<ul>
<li>Cluster data to find values that are almost the same so you can normalize your data (for example &#8211; NYC vs N.Y.C.)</li>
<li>Create instant facetted browsing based on any column in your data</li>
<li>Provide scatterplots of the values from any two numeric columns as well as a way to spot the most interesting combinations across many possible columns</li>
<li>Reconcilliation and validation of values based on data from within <a title="Freebase.com" href="http://www.freebase.com/">Freebase.com</a></li>
<li>Pull data from Freebase.com based on a matched column &#8211; such as the population of a country, if you have a column in your dataset with country specified</li>
<li>Splitting data within a cell based on a specified delimiter</li>
<li>Application of <a title="Wikipedia: Regular Expressions" href="http://en.wikipedia.org/wiki/Regular_expression">regular expressions</a> and other simple code to data to create new columns</li>
</ul>
<p>This list just scratches the surface, but it should give you a decent idea of the power of Gridworks. Even if the only feature you ever use is the one which lets you cluster and update your data to remove the &#8216;almost the same&#8217; values, Gridworks can save you hours of painstaking data cleanup.</p>
<p>Why is data cleanup exciting? Because once you have nice clean data with all the attributes that are usefull to have for your data set &#8211; then you can start playing with the data in visualization tools! So go watch some <a title="Gridworks Videos" href="http://vimeo.com/groups/gridworks/videos">Gridworks Videos</a>, <a title="Gridworks on Google Code" href="http://code.google.com/p/freebase-gridworks/">get Gridworks for yourself</a> and start playing with data. It is free and it makes working with data fun!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/05/29/gridworks-data-cleanup-exploration-tool/">Gridworks: Super Data Cleanup and Exploration Tool</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2010/05/29/gridworks-data-cleanup-exploration-tool/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MARAC Spring 2010: Hurray for Archival Metadata (Session S2)</title>
		<link>http://www.spellboundblog.com/2010/05/07/marac-spring-2010-hurray-for-archival-metadata-session-s2/</link>
		<comments>http://www.spellboundblog.com/2010/05/07/marac-spring-2010-hurray-for-archival-metadata-session-s2/#comments</comments>
		<pubDate>Fri, 07 May 2010 05:16:30 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[learning technology]]></category>
		<category><![CDATA[MARAC]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[virtual collaboration]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/2010/05/07/marac-spring-2010-hurray-for-archival-metadata-session-s2/</guid>
		<description><![CDATA[The official title for this session is &#8220;Discovery Tools for Archival Collections: Getting the Most Out of Your Metadata&#8221; and was divided into two presentations with introduction and question moderation by Jaime L. Margalotti, senior assistant librarian in Special Collections at the University of Delaware. Introduction to Metadata Standards Michael Bolam, metadata librarian for digital [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/05/07/marac-spring-2010-hurray-for-archival-metadata-session-s2/">MARAC Spring 2010: Hurray for Archival Metadata (Session S2)</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Flickr Commons: South Colonnade, arches and statues by Henry Hering" href="http://www.flickr.com/photos/field_museum_library/3333082739/"><img class="alignright" title="Statue of Research" src="http://www.spellboundblog.com/wp-content/uploads/2010/05/research-statue.jpg" alt="research-statue.jpg" width="185" height="272" /></a>The official title for this session is &#8220;Discovery Tools for Archival Collections: Getting the Most Out of Your Metadata&#8221; and was divided into two presentations with introduction and question moderation by <a title="LinkedIn: Jaime Margalotti" href="http://www.linkedin.com/pub/jaime-margalotti/11/259/406">Jaime L. Margalotti</a>, senior assistant librarian in Special Collections at the <a title="University of Delaware" href="http://www.udel.edu/">University of Delaware</a>.</p>
<p><strong>Introduction to Metadata Standards</strong></p>
<p><a title="LinkedIn: Michael Bolam" href="http://www.linkedin.com/pub/michael-bolam/4/a41/978">Michael Bolam</a>, metadata librarian for digital production, is in charge of all the metadata for all the collections at the <a title="University of Pittsburgh Digital Research Library" href="http://www.library.pitt.edu/libraries/drl/">Digital Research Library</a> at the <a title="University of Pittsburgh" href="http://www.pitt.edu/">University of Pittsburgh</a>. He is not an archivist &#8211; but does know where the archives is at Pitt! He has put lots of archival material online through digitization and assignment of metadata.</p>
<p>The best definition he has found of metadata, good for all audiences: &#8220;Metadata consists of statements we make about resources to help us find, identify, use, manage, evaluate and preserve them&#8221; Marty Kurth &#8211; Head of Metadata Services, Cornell University Libraries</p>
<p>Reviewed examples of metadata for images, text documents and archival collections. There is also data related to the business of scanning and making content available &#8211; administrative/behind the scene. Standards let you take your data and use it for other purposes.</p>
<p>Overview of alphabet soup of metadata standards:</p>
<ul>
<li><a title="MARC" href="http://www.loc.gov/marc/">MARC</a>: bibliographic information in machine-readable form (a <span class="pcolor"><strong>MA</strong></span>chine-<span class="pcolor"><strong>R</strong></span>eadable <span class="pcolor"><strong>C</strong></span>ataloging record).</li>
<li><a title="Dublin Core" href="http://dublincore.org/documents/dces/">Dublin Core</a>: the goal of Dublin Core was to create a core set of metadata fields that could be used across platforms, across various disciplines.</li>
<li><a title="MARCXML" href="http://www.loc.gov/standards/marcxml/">MARCXML</a>: schema for representing MARC in XML. Makes it easy to convert to and from MARC without loosing any data. May have more data than you need. MARCXML is not very &#8216;human readable&#8217;. You need to recall all the code numbers for the different data elements. Can be exported from Archivist Toolkit.</li>
<li><a title="MODS" href="http://www.loc.gov/standards/mods/">MODS</a>: <strong>M</strong>etadata <strong>O</strong>bject <strong>D</strong>escription <strong>S</strong>chema &#8211; sort of a &#8216;MARCXML light&#8217;. Tries to be a step between MARCXML (robust &amp; complicated) and Dublin Core (really simple). May result in compacting multiple MARCXML fields into single MODS fields. May loose some of the granularity of the data. The tags ARE human readable. The tag is the word &#8216;author&#8217; &#8211; not a number. Also can be exported in Archivists Toolkit.</li>
<li><a title="ONYX" href="http://www.editeur.org/8/ONIX">ONIX</a>: <strong>ON</strong>line <strong>I</strong>nformation e<strong>X</strong>change &#8211; standard used by the book publishing industry. XML-based standard for making available intellectual property in published form, both physical &amp; digital. Data created by the publisher. They use different ways of representing authors, keywords..etc in comparison to LOC and library cataloging.</li>
<li><a title="METS" href="http://www.loc.gov/standards/mets">METS</a>: <strong>M</strong>etadata <strong>E</strong>ncoding &amp; <strong>T</strong>ransmission <strong>S</strong>tandard. XML standard wrapper for describing divergent types of content within a digital library. The metadata for books, images, collections etc keep this data in different formats &#8211; METS lets you bring them together.</li>
<li><a title="OAI-PMH" href="http://www.loc.gov/standards/marcxml/">OAI-PMH</a>: Not a metadata standard &#8211; but rather a protocol for sharing metadata. Gives us a way to pull baseline information about a digital object out of a database and put it out somewhere where it can be harvested and used.</li>
</ul>
<p>Examples of projects built on shared metadata:</p>
<ul>
<li><a title="Worldcat.org" href="http://www.worldcat.org">Worldcat.org</a>: Has everything that is shared with OCLC. They do expose their records to google and yahoo harvesting.</li>
<li><a title="OAIster" href="http://oaister.worldcat.org">OAIster</a>: Searches a harvested data set &#8211; it is not going live out on the web. The OAIster records are also available in Worldcat. Example: search for Pittsburgh City Photographer (that is a provider of data). Most digitization software will generate an OAIster harvestable version. In his example we see that address and location get compressed into Notes. This is because there is not always a place in Dublin Core that maps to the level of detail you collect at your local institution. http://www.oclc.org/us/en/oaister/default.htm &#8211; has the info about contributing your content for crawling.</li>
<li><a href="http://www.archivegrid.org">Archive Grid</a>: The goal is to pull in finding aids from many sources. It is a service &#8211; requires some sort of subscription and payment to see the data. Uses Lucene for searching. The content in Archive Grid is now available in Worldcat. To participate &#8211; see http://www.oclc.org/us/en/archivegrid/default.htm</li>
</ul>
<p>Google and Yahoo do index OAIster and WorldCat, so that is one path to being found in search engines.</p>
<p><strong>MARC Records for Archival Materials in WorldCat Local</strong></p>
<p><a title="LinkedIn: Jennifer MacDonald" href="http://www.linkedin.com/pub/jennifer-macdonald/4/663/609">Jennifer MacDonald</a> from the <a title="University of Delaware" href="http://www.udel.edu/">University of Delaware</a> presented a cataloger&#8217;s perspective of a WorldCat Local environment. She is a &#8220;concerned enthusiast&#8221; with regard to metadata. The University of Delaware was the first institution to buy <a title="WorldCat Local" href="http://www.oclc.org/worldcatlocal/default.htm">WorldCat Local</a>. She ended up on the WorldCat Local Special collections and Archives Task Force. The task force made their <a title="WorldCat Local Special Collections and Archives Task Force Report 2008" href="http://www.rbms.info/committees/bibliographic_standards/committee-docs/FinalReportWCLSpecCollTaskForce.pdf">final report in 2008</a> and got a <a title="OCLC Feedback to Task Force 2009" href="http://www.rbms.info/committees/bibliographic_standards/committee-docs/OCLCResponseWCLTaskForce.pdf">response from OCLC in 2009</a>. They did get some immediate changes based on their feedback &#8211; like moving the 520 &#8220;summary&#8221; data element higher in the display. For some problems the task force identified, such as Archival Materials that were not being identified properly (Internet Resource is the type for all OAI records), it is hard to tell if the issue has been fixed.</p>
<p>She showed some screenshots from WorldCat local to show what data elements are there and how they are organized. In the FirstSearch screenshot (only available at the school), Notes and General Info holds a mishmash of content from various data elements consolidated into single fields. The task force asked for the &#8220;Browse&#8221; feature but apparently this feature is dead. They got no response from OCLC to this request in their report.</p>
<p>If you use the <a title="University of Delaware WorldCat Local" href="http://udel.worldcat.org">University of Delaware instance of WorldCat Local</a> to search for <a title="UDel Worldcat Local: walter penn shipley" href="http://udel.worldcat.org/search?q=walter+penn+shipley&amp;qt=results_page&amp;scope=0&amp;oldscope=1">walter penn shipley</a> and drill down to the detail record display for the <a title="Walter Penn Shipley Papers" href="http://udel.worldcat.org/title/walter-penn-shipley-papers-1879-1951/oclc/502285399&amp;referer=brief_results">Walter Penn Shipley Papers</a> you will see what was shown during the session. This display is customizable at the institution level in WorldCat Local. Some data is shown. You see lots of Web 2.0 options to add your own data, but the display is missing some of the data from the original MARC record. The full MARC record is indexed for keyword search, but since some of it is not displayed, users may not be able to determine why a record was returned.</p>
<p>Fields missing from the WorldCat Local display:</p>
<ul>
<li>351 &#8211; Organization and Arrangement of Materials</li>
<li>545 &#8211; biographical note</li>
<li>506 &#8211; restrictions on access</li>
<li>540 &#8211; Use of materials &#8211; with link to an askspec page: http://www.lib.udel.edu/cgi-bin/askspec.cgi</li>
<li>525 &#8211; preferred citation form &#8211; and this is where the manuscript number is</li>
<li>655 &#8211; some of the parts of the genre terms are missing</li>
<li>656 &#8211; occupation</li>
</ul>
<p>OCLC says that they have not included all this because people don&#8217;t want this displayed. Given that local organization is already deciding what to show, the task force would prefer the option to displayable all data elements. Due to this missing data, Jennifer prefers the FirstSearch interface &#8211; but this option is not always available at all institutions. You should take advantage of the Web 2.0 features. Archivist can create an account on WorldCat Local and add data elements.</p>
<p><strong>Questions and Answers</strong></p>
<p><strong>QUESTION:</strong> You talk about having the metadta in a format that is accessible to harvesting. What I have is a bunch of CDs with images on them that have a folder and descriptor structure. Is there a metadata harvester that can go in and pull that metadata out? New York Stock Exchange photographer sent these.</p>
<p><strong>ANSWER (Michael):</strong> So the metadata you are looking to extract is the filename and descriptors? You could have someone write a little script and extract what you need. I would hand it to the guy I work with because he writes perl. If then you made that available via your website &#8211; then people could find it. To get it into a database &#8211; it is just a small script.</p>
<p><strong>QUESTION:</strong> Are there any specifically useful webinars/seminars for becoming familiar with these formats for skillbuilding?</p>
<p><strong>ANSWER (Michael):</strong> Tons on the web. The LoC websites are very useful. You may have heard the term &#8216;crosswalking&#8217; &#8211; that is where you take one format and turn it into another. Looking at the crosswalks can make it much easier to understand how a format you understand maps to one you are trying to learn about. Shareable Metadata &#8211; metadata for you and me. Not online yet &#8211; but someone in the audience said the plan is to post the materials. There have been a couple of books and ALA publications. Most of the ones I know of are about 10 years old. <strong>Jaime</strong>: SAA has a good workshop series.</p>
<p><strong>QUESTION:</strong> One of the first things you said was to take data out of EAD and you didn&#8217;t go into detail in that. Were you talking about DAO tagged items?</p>
<p><strong>ANSWER (Michael):</strong> I was just talking about reusing data in a new environment. For example, we just started digitizing manuscripts and each item is becoming an individual digital object. The only metadata we have is in the EAD finding aid &#8211; so we are using that data to make descriptive data about the digital objects. We are going to create a MODS or METS record for every digital object. <strong>Jaime:</strong> We use EAD to make MODS records. She has been manually extracting EAD data as Dublin Core data for ContentDM.</p>
<p><strong>My QUESTION:</strong> What format does OAIster want?</p>
<p><strong>ANSWER (Michael):</strong> OAIster is just harvesting Dublin Core. You can share MODS and other metadata types and you may find other aggregators that are expecting their users to work in a more detailed environment. You may publish more data elements for other harvesters as well &#8211; but OAIster will only pull the Dublin Core data elements.</p>
<p><strong>QUESTION:</strong> We are working on a digitization project to digitize local historical societies, museums and libraries. Might the catalogers be able to deal with MODS or will the loss of granularity be a problem?</p>
<p><strong>ANSWER (Michael):</strong> I am not a MODS expert. MARC is very granular. Maybe look at the MARCXML &#8211; MODS crosswalk?</p>
<p><strong>QUESTION:</strong> At the University of Delaware, do you have any other systems?</p>
<p><strong>ANSWER (Jennifer):</strong> When we first got WorldCat Local you had to know the URL to get to the library. That changed fast! The patrons couldn&#8217;t find anything. <strong>Jaime:</strong> In WorldCat Local you cannot scope the search to specific sub-collections.</p>
<p><strong>QUESTION:</strong> Thank you Jennifer for your remarks. Is there a problem with catalogers trying to &#8216;sneak&#8217; data elements into other places &#8211; are standards in danger?</p>
<p><strong>ANSWER (Jennifer):</strong> I would hope we wouldn&#8217;t move 524 data into a 500 field just to get it displayed. There is some danger of loosing the granularity by pushing everything to Dublin Core. I don&#8217;t know how real that danger is at this point.</p>
<p><strong>QUESTION:</strong> A political question for Jennifer: Who has the clout to push for changes with OCLC?</p>
<p><strong>ANSWER (Jennifer):</strong> I think leaning encouraging users to give feedback is important. We were told that users don&#8217;t want that &#8220;we have proven that users don&#8217;t want that&#8221;. Users need to make comments about their challenges in dealing with the interface. <strong>FROM AUDIENCE:</strong> The strongest is to say that you are looking at Sky River. <strong>FROM AUDIENCE:</strong> Make your data more discoverable outside the catalog world &#8211; internal websites and Google. <strong>Jaime:</strong> We are working hard to make MARC records to push access to our collections. The push is to make the data available in as many locations as possible.</p>
<p><strong>QUESTION:</strong> Are these all different levels of subscriptions? Are they trying to push people to buy more subscriptions?</p>
<p><strong>ANSWER (Jennifer):</strong> There is a sense that WorldCat Local is pushed at local public libraries. Yes &#8211; WorldCat Local is something they have to pay for. <strong>Michael:</strong> With Archive Grid you are going a step further &#8211; EVERYTHING in the finding aid is indexed. Every search I did in there returned thousands of records. Then I filtered by institution &#8211; and it never loaded. <strong>FROM AUDIENCE:</strong> I think they are revamping Archive Grid &#8211; but I don&#8217;t know how far they are in the process. <strong>Michael:</strong> I love the detail &#8211; you don&#8217;t have to dig through other data to find something useful. Depending on the institution &#8211; and how they are allowing their data to be harvested &#8211; you may see less information. <strong>Jaime:</strong> You have to actively work with OCLC to get Archive Grid to pick up your data.</p>
<p><strong>QUESTION:</strong> We are tinkering with users adding tags &#8211; are you having any success with people adding tags?</p>
<p><strong>ANSWER (Jaime):</strong> No &#8211; it isn&#8217;t something we have dealt with. WorldCat Local does let you add stuff like that.</p>
<p><strong>QUESTION:</strong> Will OCLC provide that UGC (user generated content) back to the institution?</p>
<p><strong>ANSWER:</strong> We wouldn&#8217;t know.</p>
<p><strong>QUESTION:</strong> Have they provided access to the user studies?</p>
<p><strong>ANSWER:</strong> Yes &#8211; but it is based on watching individuals use the tools.</p>
<p><em>Image Credit:</em> Statue representing Research by Henry Hering from <a title="Flickr Commons: South Colonnade, arches and statues by Henry Hering" href="http://www.flickr.com/photos/field_museum_library/3333082739/">image of the interior of the Field Museum of Natural History interior</a>.</p>
<p><em>As is the case with all my session summaries from MARAC, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my <a title="Contact Jeanne" href="http://www.spellboundblog.com/contact/">contact form</a>.</em></p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/05/07/marac-spring-2010-hurray-for-archival-metadata-session-s2/">MARAC Spring 2010: Hurray for Archival Metadata (Session S2)</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2010/05/07/marac-spring-2010-hurray-for-archival-metadata-session-s2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Topic Modeling, Auto-Classification and Archival Description</title>
		<link>http://www.spellboundblog.com/2010/04/27/topic-modeling-auto-classification-archival-description/</link>
		<comments>http://www.spellboundblog.com/2010/04/27/topic-modeling-auto-classification-archival-description/#comments</comments>
		<pubDate>Tue, 27 Apr 2010 06:28:08 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[access]]></category>
		<category><![CDATA[interface design]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[text mining]]></category>
		<category><![CDATA[what if]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=963</guid>
		<description><![CDATA[In an example of Twitter serendipity, @silverasm&#8216;s (Aditi Muralidharan) tweet pointed me to @historying&#8216;s blog post about Topic Modeling. In this post Cameron Blevins explains the results of using the topic modeling feature of UMass Amherst&#8216;s MAchine Learning for LanguagE Toolkit (MALLET) on the text of Martha Ballard’s Diary. I have spent lot of time [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/04/27/topic-modeling-auto-classification-archival-description/">Topic Modeling, Auto-Classification and Archival Description</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a href="http://mallet.cs.umass.edu/index.php"><img class="alignright size-full wp-image-964" title="MALLET logo" src="http://www.spellboundblog.com/wp-content/uploads/2010/04/logo3.png" alt="" width="215" height="95" /></a>In an example of Twitter serendipity, <a title="Twitter: silverasm" href="http://twitter.com/silverasm">@silverasm</a>&#8216;s (Aditi Muralidharan) <a title="tweet about text mining" href="http://twitter.com/silverasm/statuses/12842112825">tweet</a> pointed me to <a title="Twitter: historying" href="http://twitter.com/historying">@historying</a>&#8216;s <a title="Topic Modeling Martha Ballard’s Diary" href="http://historying.org/2010/04/01/topic-modeling-martha-ballards-diary/">blog post about Topic Modeling</a>. In this post Cameron Blevins explains the results of using the <a title="MALLET: Topic Modeling" href="http://mallet.cs.umass.edu/topics.php">topic modeling</a> feature of <a title="UMass Amherst" href="http://www.umass.edu/">UMass Amherst</a>&#8216;s <a title="MAchine Learning for LanguagE Toolkit" href="http://mallet.cs.umass.edu/index.php">MAchine Learning for LanguagE Toolkit</a> (MALLET) on the text of <a title="Martha Ballard's Diary Online" href="http://dohistory.org/diary/">Martha Ballard’s Diary</a>.</p>
<p>I have spent lot of time thinking about how to generate thematic overviews of groups of archival collections. My information visualization project, <a title="ArchivesZ Blog Posts" href="http://www.spellboundblog.com/category/archivesz/">ArchivesZ</a>, aims to provide ways of understanding aggregated archival description data, both from a single institution or across institutional boundaries. Now I find myself wondering if text mining with a tool like MALLET might generate smart topic groupings more elegantly than fighting with the wide range of non-standardized collection subjects.</p>
<p><strong>Topic Modeling with MALLET</strong></p>
<p>To get a sense of what MALLET generates, see the excerpt below from Blevins&#8217;s post:</p>
<blockquote><p>With some tinkering, MALLET generated a list of thirty topics  comprised of twenty words each, which I then labeled with a descriptive  title. Below is a quick sample of what the program<em> </em>“thinks” are  some of the topics in the diary:</p>
<ul>
<li><strong>MIDWIFERY:</strong> birth deld safe morn receivd calld left  cleverly pm labour fine reward arivd infant expected recd shee born  patient</li>
<li><strong>CHURCH: </strong>meeting attended  afternoon reverend worship foren mr famely performd vers attend public  supper st service lecture discoarst administred supt</li>
<li><strong>DEATH:</strong> day yesterday  informd morn years death ye hear expired expird weak dead las past heard  days drowned departed evinn</li>
<li><strong>GARDENING:</strong> gardin sett  worked clear beens corn warm planted matters cucumbers gatherd potatoes  plants ou sowd door squash wed seeds</li>
</ul>
</blockquote>
<p>He goes on to explain that &#8220;MALLET also allows us to track those topics across the text.&#8221; What if, instead of text mining a diary, we pumped the descriptions of every archival collection from a single institution into MALLET. Of course we would need a good list of stop words including such common terms as archives, history, sources and records. But I wonder how the topics MALLET suggests would compare to the official subjects associated with each collection? Could this give us a broad overview of the topics covered by a specific repository and give us a new way to build paths to the collections based on topic?</p>
<p><strong>Auto-Classification Using Castanet</strong></p>
<p>Text miner <a title="Aditi Muralidharan" href="http://www.cs.berkeley.edu/~aditi/">Aditi Muralidharan</a> also posted recently on this theme in <a title="Castanet: automatically generating a browsing structure for a collection" href="http://mininghumanities.com/2010/04/24/castanet-automatically-generating-a-browsing-structure-for-a-collection/">Castanet: automatically generating a browsing structure for a collection</a> and explains:</p>
<blockquote><p>Castanet automatically carves a sub-structure from the hierarchical  concept dictionary, WordNet (<a href="http://wordnet.princeton.edu/">http://wordnet.princeton.edu</a>),  and matches items in the collection to one or many appropriate places  within that hierarchy. Then, after some automated trimming and  flattening, the result is a hierarchical browsing system.</p></blockquote>
<p>I have heard of Castanet before via the <a title="Flamenco Search Interface Project" href="http://flamenco.berkeley.edu/">Flamenco Search Interface Project</a>. Apparently Muralidharan did a project using Castanet last summer to create <a href="http://go2.wordpress.com/?id=725X1342&amp;site=textdigihum.wordpress.com&amp;url=http%3A%2F%2Forange.sims.berkeley.edu%2Fcgi-bin%2Fflamenco.cgi%2Fflickr%2FFlamenco&amp;sref=http%3A%2F%2Fmininghumanities.com%2F2010%2F04%2F24%2Fcastanet-automatically-generating-a-browsing-structure-for-a-collection%2F">a category system</a> for <a title="Flickr Commons" href="http://www.flickr.com/commons">Flickr Commons</a> images based on the images&#8217;  tags which is then rendered using a Flamenco interface. I include a partial screen-shot below to give you a taste of what the navigation of images feels like a few levels down in the hierarchy. I love the classification of &#8216;Group Action&#8217; then filtered by a sub-classification of &#8216;Commerce&#8217;. The first images shown are of &#8216;horse trading&#8217; &#8211; with additional headings and images beneath them as well as additional filter options on the left.</p>
<p style="text-align: center;"><a title="Flickr Commons: group_action &gt; commerce" href="http://orange.sims.berkeley.edu/cgi-bin/flamenco.cgi/flickr/Flamenco?q=actX:322&amp;group=actX"><img class="aligncenter size-full wp-image-966" title="Flickr Commons Images via Canasta &amp; Flamenco" src="http://www.spellboundblog.com/wp-content/uploads/2010/04/flickr-canasta.jpg" alt="" width="547" height="308" /></a></p>
<p><strong>What If?</strong></p>
<p>What if we pulled all the English language archival descriptions from around the world as our original data set. If we used this data for topic modeling, our subjects clusters would be cross-institutional. Maybe we could map the local institution assigned subjects to the topic model generated topics for each collection and get a sort of automated crosswalk for finding related collections. If we used the local institution assigned subjects from the archival descriptions for Canasta style auto-classification, maybe we could generate a way to hierarchically browse collections topically.</p>
<p>Both MALLET and Flamenco are open source (I am not sure of the status of Castanet) and, as I discovered working on ArchivesZ, many institutions will share their archival description data for a good cause. So &#8211; is this a good cause? I need to tease these ideas out a bit more, but what do you all think of it at first blush? Feasible? Interesting? Worthwhile experiments?</p>
<p><em>Image Credits:</em> MALLET logo from <a title="MALLET Homepage" href="http://mallet.cs.umass.edu/index.php">MALLET homepage</a>. Images in screen shot from <a title="Flickr Commons" href="http://www.flickr.com/commons">Flickr Commons</a> with no known copyright.</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2010/04/27/topic-modeling-auto-classification-archival-description/">Topic Modeling, Auto-Classification and Archival Description</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2010/04/27/topic-modeling-auto-classification-archival-description/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>DH2009: Digital Curiosities and Amateur Collections</title>
		<link>http://www.spellboundblog.com/2009/06/29/dh2009-digital-curiosities/</link>
		<comments>http://www.spellboundblog.com/2009/06/29/dh2009-digital-curiosities/#comments</comments>
		<pubDate>Tue, 30 Jun 2009 02:25:33 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[access]]></category>
		<category><![CDATA[at risk records]]></category>
		<category><![CDATA[DH2009]]></category>
		<category><![CDATA[digital humanities]]></category>
		<category><![CDATA[digitization]]></category>
		<category><![CDATA[learning technology]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[outreach]]></category>
		<category><![CDATA[virtual collaboration]]></category>
		<category><![CDATA[web 2.0]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/2009/06/25/dh2009-digital-curiosities/</guid>
		<description><![CDATA[Session Title: Digital Curiosities: Resource Creation Via Amateur Digitisation Speaker: Melissa Terras Overview: Review of 100 virtual museum websites and multiple flickr groups plus surveys of amateur website creators, memory institutions and Arts &#38; Humanities academics leads to new perspective on digitization and creation of collections online by dedicated enthusiasts. Session Highlights Areas of &#8220;Amateur&#8221; [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/06/29/dh2009-digital-curiosities/">DH2009: Digital Curiosities and Amateur Collections</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Flickr Image from Curio Cabinet Group by mms0131" href="http://www.flickr.com/photos/mms0131/500142786/in/set-72157605079911413/"><img class="alignright size-full wp-image-617" title="Flickr Image from Curio Cabinet Group by mms0131" src="http://www.spellboundblog.com/wp-content/uploads/2009/06/curio-image.jpg" alt="curio-image" width="282" height="398" /></a><strong>Session Title:</strong> Digital Curiosities: Resource Creation Via Amateur Digitisation<br />
<strong>Speaker:</strong> <a title="Dr Melissa Terras" href="http://www.ucl.ac.uk/infostudies/melissa-terras/">Melissa Terras</a></p>
<p><strong>Overview:</strong> Review of 100 virtual museum websites and multiple flickr groups plus surveys of amateur website creators, memory institutions and Arts &amp; Humanities academics leads to new perspective on digitization and creation of collections online by dedicated enthusiasts.</p>
<p><strong>Session Highlights</strong></p>
<p>Areas of &#8220;Amateur&#8221; endeavor  have a long history of launching collections, such as:</p>
<ul>
<li>cabinet of curiosities</li>
<li>foundation of astronomical research</li>
<li>british flora and amateur botanists</li>
<li>weather observations</li>
<li>open source software movement</li>
</ul>
<p>Being an amateur doesn&#8217;t necessarily mean being bad at what you do!</p>
<p>Within the realm of self-defined museums some common topics often emerge:</p>
<ul>
<li>ephemera (advertising, packaging, nostalgia)</li>
<li>comics</li>
<li>technology &#8211; especially old tech, there is a surprising trend of being fascinated by technology approximately 10 years older than the collector</li>
<li>personal and &#8220;embarrassing&#8221; collections</li>
<li>genealogy</li>
</ul>
<p>For these self-defined museums the scope is self-defined &#8211; these are self-delineated collections. Virtual museums can document aspects of cultural heritage considered socially taboo or in some way too sensitive to collect. A great example of this is the <a title="Museum of Menstruation" href="http://www.mum.org/">Museum of Menstruation</a> which claims to have been created 14 years ago and is currently trying to establish a <a title="Future of MUM" href="http://www.mum.org/future.htm">public permenant display for the public</a>.</p>
<p>Platforms have evolved over the life of the web, starting with static html, then blogs and now Flickr images as a mode of presentation.</p>
<p>This is a list of successful amateur collections online:</p>
<ul>
<li><a title="Today's Inspiration" href="http://todaysinspiration.blogspot.com/">Today&#8217;s Inspiration</a> &#8211; illustration from the 40&#8242;s and 50&#8242;s</li>
<li><a title="JonWilliamson.com" href="http://jonwilliamson.com/">JonWilliamson.com</a> &#8211; advertising 1940s-1960s</li>
<li><a title="Pulp Fiction Flickr Group" href="http://www.flickr.com/groups/pulpfiction/pool/">Pulp Fiction Flickr Group</a> &#8211; 882 members who provide basic metadata and often label stuff within the image &#8211; currently contains 3,385 items.</li>
<li><a title="Curio Cabinet Flickr Group" href="http://www.flickr.com/groups/curiocabinet/">Curio Cabinet Flickr Group</a> &#8211; 1,206 members and 5,537 items</li>
</ul>
<p><a title="VADS (Visual Arts Data Service)" href="http://www.vads.ac.uk/">Visual Arts Data Service</a> (VADS) is a more traditional site created by a cultural heritage institution. It contains 100,000+ images copyright cleared for use in teaching, learning and research in the UK. VADS is a very detailed static source of images with metadata, but provides no interaction.</p>
<p>Amateurs do provide metadata, but it is intuitive metadata. It might not fit into rigid buckets of data, but that doesn&#8217;t meant that the metadata available isn&#8217;t useful.</p>
<p>What are the boundaries between amateur and professional? Work vs hobby?</p>
<p>Many of these amateur sites get much more traffic than most standard museum sites. More than 50% of museum digitized images are never visited.</p>
<p>Memory institutions are starting to put things into the wider online community:</p>
<ul>
<li><a title="Smithsonian Institution" href="http://www.si.edu/">Smithsonian</a>: photos in <a title="Flickr Commons: Smithsonian" href="http://www.flickr.com/photos/smithsonian/">Smithsonian Flickr Commons</a></li>
<li><a title="Tate Online" href="http://www.tate.org.uk/">Tate</a>: The <a title="How We Are Now" href="http://www.tate.org.uk/britain/exhibitions/howweare/slideshow.shtm">How We Are Now</a> project invited the public to contribute photos to the <a title="Flickr: How We Are Now Group" href="http://www.flickr.com/groups/howwearenow/">How We Are Flickr Group</a>. The images were <a title="Flickr Photos Streamed in the Tate" href="http://www.flickr.com/photos/tategallery/507813139/in/set-72157600238798389/">streamed to screens</a> within the <a title="How We Are: Photographing Britain" href="http://www.tate.org.uk/britain/exhibitions/howweare/default.shtm">How We Are: Photographing Britain exhibit</a> and 40 photos were chosen to be included as the last set of photos in the physical exhibit.</li>
<li><a title="Victoria &amp; Albert Museum" href="http://www.vam.ac.uk/">Victoria &amp; Albert Museum</a>: created a <a title="Flickr: Photos from Victoria &amp; Albert Museum" href="http://www.flickr.com/groups/va_museum/">Flicrk group of photos taken at the V&amp;A museum</a> along with a long list of other <a title="V&amp;A Flickr Groups and Streams" href="http://www.vam.ac.uk/activ_events/do_online/flickr_group/index.html">V&amp;A Flickr groups and streams</a></li>
<li>Oxford University&#8217;s <a title="Oxford Great War Archive" href="http://www.oucs.ox.ac.uk/ww1lit/gwa">Great War Archive</a>: contains 6,500 items contributed by the public and related to the First World War.</li>
<li><a title="Facebook" href="http://www.facebook.com/">Facebook</a> and <a title="Twitter" href="http://twitter.com/">Twitter</a> are being used more often for informing the community about their collections</li>
</ul>
<p>Much of amateur research has been driven by advances in technology. A great example of this is the advent of affordable <a title="Wikipedia: metal detector" href="http://en.wikipedia.org/wiki/Metal_detector">metal detectors</a> led to dramatic changes in archaeology. The internet and Web 2.0 technology are arming a whole new generation of enthusists who can find one another and collaborate more easily than might ever have been dreamed of 20 years ago.</p>
<p><strong>Next Steps &amp; Conclusions</strong></p>
<p>Future research will involve looking at the psychology of collection: archives vs collections. For now it is important to realize that institutions are not the only hosts of &#8220;worthwhile&#8221; digital objects. Pro-am (aka, pro-amateur) are doing better with using web 2.0 &amp; getting more traffic.</p>
<p>What can memory institutions learn from this?</p>
<ul>
<li>interact with user communities</li>
<li>use the &#8216;grand central stations&#8217; of flickr, twitter, facebook</li>
<li>usability of flickr is better than what most memory institutions build for themselves</li>
</ul>
<p><strong>My Thoughts</strong></p>
<p>This session considers the ways cultural memory institution can take advantage of the web by looking at what the successful enthusiasts are achieving. This research-backed approach confirms what I would have expected. Libraries, museums and archives are leaving a lot on the table when it comes to putting their collections online. Sites run by non-professionals are doing an amazing job of drawing in new audiences, keeping people around and then initiating conversation within that audience.</p>
<p>The Flickr Commons is a big step forward, but it isn&#8217;t the only option. There are also varying opinions about <a title="Flckr Commons Discussion: Question re Crowdsourcing: fail or win?" href="http://www.flickr.com/groups/flickrcommons/discuss/72157620593449864/">how successful the crowdsourcing aspect of the Flickr Commons is for memory institutions</a>. A lot of this goes back to to a core question &#8220;how do we know if we have succeeded?&#8221;. There is much to be said for setting out clear goals when launching online initiatives. Is your goal increased traffic to your site or crowdsourcing of metadata? A great example of an initiative whose goal is clearly collection of crowdsourced metadata is the <a title="German Federal Archives, Crowdsourcing &amp; the Wikimedia Commons" href="http://www.spellboundblog.com/2009/01/26/german-federal-archives-crowdsourcing-wikimedia-commons/">German Federal Archives who chose to use the Wikimedia Commons for their photo metadata initiative</a>.</p>
<p>If you are trying to extend your mission of providing access to materials to the public, then how do you measure success? Putting your materials in what Melissa called &#8220;grand central stations&#8221; (or what I have also heard termed &#8220;public crosswalks&#8221;) definitely increases the chances of serendipitous discovery by new individuals. That said, we can see from the successful blogs mentioned above that tackling a niche with enthusiasm and consistent posting can go a long way to building a following. JonWilliamson.com seems to have only launched back in November of 2008 with a post featuring a <a title="JonWilliamson.com: Scotch Tape Christmas ad from 1951" href="http://jonwilliamson.com/template_permalink.asp?id=88">Scotch Tape Christmas ad from 1951</a>. The author posted in May of 2009 that his <a title="JonWilliamson.com: 100,000 Hits n Flickr" href="http://jonwilliamson.com/template_archives_cat.asp?cat=25">images in Flickr had surpassed 100,000 views</a>.</p>
<p>To conclude this post I leave you with a list of inspirational digitized collections online that were created by various cultural heritage institutions:</p>
<ul>
<li><a title="Publishers' Bindings Online" href="http://bindings.lib.ua.edu/">Publishers&#8217; Bindings Online</a> &#8211; discussed in <a title="SAA2007: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building (Session 707)" href="http://www.spellboundblog.com/2007/09/22/saa2007-publishers%E2%80%99-bindings-online-digitization-collaboration-standardization-and-community-building-session-707/">SAA2007&#8242;s Session: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building</a>, a multi-institutional project that includes <a title="PBO Galleries" href="http://bindings.lib.ua.edu/gallery2.html">galleries</a> of topical images combined with an essay that gives the images context. Two of my favorites are:
<ul>
<li><a title="From Domestic Goddesses to Suffragists: The Story of Women Told on Bookbindings, 1820-1920" href="http://bindings.lib.ua.edu/gallery/women.html">From Domestic Goddesses to Suffragists: The Story of Women Told on Bookbindings, 1820-1920</a></li>
<li><a title="Indians, the Frontier, and the West in American Bookbindings" href="http://bindings.lib.ua.edu/gallery/west.html">Indians, the Frontier, and the West in American Bookbindings</a></li>
</ul>
</li>
<li><a title="Calisphere" href="http://www.calisphere.universityofcalifornia.edu/">Calisphere</a> &#8211; more than 150,000 digitized items <span>organized for easy use by K-12 teachers. This is especially interesting in that it represents items already available in <a title="Online Archive of California" href="http://oac4.cdlib.org/">Online Archive of California</a>, but organized in a way to make them easy to find and use with their target audience in mind.</span></li>
<li><span><a title="Yiddish Books Online" href="http://yiddishbookcenter.org/+yb">Yiddish Books Online</a> &#8211; A project by the <a title="National Yiddish Book Center" href="http://www.yiddishbookcenter.org">National Yiddish Book Center</a> that uses the Internet Archive as a platform to host </span>11,000 digitized out-of-print Yiddish books. This project is a nice cross between a branded custom site and a grand-central station</li>
</ul>
<p>Have a favorite online collection website? Please share it in the comments below.</p>
<p><strong><span style="font-weight: normal;"><em>As is the case with all my session summaries from <a title="Digital Humanities 2009" href="http://www.mith2.umd.edu/dh09/">DH2009</a>, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my <a title="Contact Jeanne" href="../2009/06/25/contact/">contact form</a>.</em></span></strong></p>
<p><strong><em>Image credit:</em></strong> <a rel="cc:attributionURL" href="http://www.flickr.com/photos/mms0131/">http://www.flickr.com/photos/mms0131/</a> / <a rel="license" href="http://creativecommons.org/licenses/by-nc-nd/2.0/">CC BY-NC-ND 2.0</a></p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/06/29/dh2009-digital-curiosities/">DH2009: Digital Curiosities and Amateur Collections</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/06/29/dh2009-digital-curiosities/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: University of Texas at San Antonio</title>
		<link>http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/</link>
		<comments>http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/#comments</comments>
		<pubDate>Wed, 13 May 2009 06:28:53 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=534</guid>
		<description><![CDATA[Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA&#8217;s Archives and Special Collections finding aids in XML format. With the current scripts, these are the fun tag stats: 1,684 total tags extracted 75% (1,266 tags) [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/">ArchivesZ Data Challenges: University of Texas at San Antonio</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="USTA Archives and Special Collections" href="http://www.lib.utexas.edu/taro/browse/browse_utsa1.html"><img class="alignright size-full wp-image-535" title="University of Texas San Antonio Archives and Special Collections" src="http://www.spellboundblog.com/wp-content/uploads/2009/05/logo-utsa.gif" alt="University of Texas San Antonio Archives and Special Collections" width="205" height="101" /></a></p>
<p><a title="Mark Shelstad" href="http://www.linkedin.com/pub/dir/mark/shelstad">Mark Shelstad</a>, head of <a title="Archives and Special Collections at University of Texas at San Antonio" href="http://www.lib.utsa.edu/archives/">Archives and Special Collections at University of Texas at San Antonio</a>, sent me a link to the <a title="TARO: UTSA" href="http://www.lib.utexas.edu/taro/utsa/utsa_xml.html">TARO</a> (Texas Archival Resources Online) page for <a title="USTA Archives and Special Collections" href="http://www.lib.utexas.edu/taro/browse/browse_utsa1.html">UTSA&#8217;s Archives and Special Collections finding aids</a> in XML format.</p>
<p>With the current scripts, these are the fun tag stats:</p>
<ul>
<li>1,684 total tags extracted</li>
<li>75% (1,266 tags) are associated with only one finding aid</li>
<li>3% (51 tags) are associated with 10 or more finding aids</li>
</ul>
<p><strong>Collection Size</strong></p>
<p>235 out of tne 253 collections ended up with a collection size of 0.</p>
<p>Consider the encoding of the collection size in the <a title="A Guide to the Women's Overseas Service League Records, 1910-2007" href="http://www.lib.utexas.edu/taro/utsa/00008/utsa-00008.html">Guide to the Women&#8217;s Overseas Service League Records, 1910-2007</a>:</p>
<pre>&lt;physdesc label="Extent:" encodinganalog="300$a"&gt;
    77 linear feet (approximately 44,000 items)
&lt;/physdesc&gt;</pre>
<p>Contrast this with one of the examples where the size of the collection was extracted properly by the current script:</p>
<pre>&lt;physdesc label="Extent:" encodinganalog="300$a"&gt;
    &lt;extent&gt;8.4 linear feet&lt;/extent&gt;
    (14 boxes)
&lt;/physdesc&gt;</pre>
<p>Sometimes it feels like a game of Where&#8217;s Waldo. In this case we are simply missing the set of &lt;extent&gt; tags  from the first example. Off I went to the EAD tag descriptions to find the <a title="LOC: physdesc tag library description" href="http://www.loc.gov/ead/tglib/elements/physdesc.html">guidelines for use of the &lt;physdesc&gt; tag</a>, where I found this overview of the tag:</p>
<p style="padding-left: 30px;">A wrapper element for bundling information about the appearance or construction   of the described materials, such as their dimensions, a count of their quantity   or statement about the space they occupy, and terms describing their genre,   form, or function, as well as any other aspects of their appearance, such as   color, substance, style, and technique or method of creation. The information   may be presented as plain text, or it may be divided into the &lt;dimension&gt;, &lt;extent&gt;, &lt;genreform&gt;,   and &lt;physfacet&gt; subelements.</p>
<p>Bad news for my script logic &#8211; both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the &lt;extent&gt; tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the &lt;physdesc&gt; tags without enclosing &lt;extent&gt; or &lt;dimension&gt; tags.</p>
<p><strong>Inclusive Dates</strong></p>
<p>Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of &lt;unitdate&gt; formats that my current script could not parse properly, including the examples below:</p>
<ul>
<li>1917-1980 (bulk 1920-1945)</li>
<li>1876-1903, 1914-1919, 1940-2002</li>
<li>1940s, 1970s-1990s</li>
</ul>
<p>Another encoding approach that could not be parsed was the one used for the finding aid of the <a title="Church Women United of San Antonio Records" href="http://www.lib.utexas.edu/taro/utsa/00046/utsa-00046.html">Church Women United of San Antonio Records</a>. In this case the &lt;unitdate&gt; tag is within the &lt;unittitle&gt; tag as seen here:</p>
<pre style="padding-left: 30px;">&lt;unittitle label="Title:" encodinganalog="245"&gt;
Church Women United of San Antonio Records,
&lt;unitdate label="Dates:" encodinganalog="245$a"&gt;1961-2005&lt;/unitdate&gt;
&lt;/unittitle&gt;</pre>
<p>Among the finding aids for which I did extract a range of inclusive date years, I also found issues with values like 1950s-1990s. The current script interpreted this to represent 1950 through 1990, but I believe it would be more properly translated as representing 1950 through 1999.</p>
<p><strong>General Code Fixes</strong></p>
<p>The University of Texas at San Antonio’s finding aids have provided additional examples of the following data and encoding issues already identified in earlier data sets:</p>
<ul>
<li>Inconsistent repository titles (26 different variations of &#8220;The University of Texas at San Antonio Library&#8221;)</li>
<li>Titles with embedded and tagged dates</li>
<li>Carriage return and tab characters that need to be removed</li>
<li>Emphasis within a title or abstract added via a tag (such as &lt;emph render=&#8221;italic&#8221;&gt;Storyletters&lt;/emph&gt; seen in <a title="A Guide to the Storyletters Records, 1991-2000" href="http://www.lib.utexas.edu/taro/utsa/00021/utsa-00021.html">A Guide to the Storyletters Records, 1991-2000</a>) which interrupts extraction of text at that point</li>
</ul>
<p><strong>Next Steps</strong></p>
<p>This is the last data set I am analyzing before tackling actual updates to the ArchivesZ data extraction script. My next step is to review and prioritize my long to do list for updates to this script. Most of what I have found in my examination of the data sets are ways in which my script was not smart enough to handle valid variations in encoding and the tabs, carriage returns, formatting tags and special characters found throughout everyone&#8217;s XML. Yes, there are some cases in which the data itself is less than optimal (such as non-standardized repository titles) or the values challenging (so many ways to describe the size of a collection!), but overall I am optimistic about how much more I can improve the extraction script before I have to resort to hand correcting records in the database.</p>
<p>Thanks to everyone for your patience with these data analysis posts. Onward to programming!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/">ArchivesZ Data Challenges: University of Texas at San Antonio</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/05/13/archivesz-data-challenges-university-of-texas-san-antonio/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Forest History Society</title>
		<link>http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/</link>
		<comments>http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/#comments</comments>
		<pubDate>Wed, 06 May 2009 21:30:48 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=470</guid>
		<description><![CDATA[Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address: Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/">ArchivesZ Data Challenges: Forest History Society</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Forest History Society" href="http://www.foresthistory.org"><img class="alignright size-full wp-image-471" title="The Forest History Society" src="http://www.spellboundblog.com/wp-content/uploads/2009/04/fhs_logo_small.jpg" alt="The Forest History Society" width="82" height="130" /></a><a title="Amanda Ross" href="http://fhsarchives.wordpress.com/author/amandatross/">Amanda Ross</a>, project archivist for the <a title="Forest History Society" href="http://www.foresthistory.org/">Forest History Society</a>, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:</p>
<ul>
<li>Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like <a title="William E. Towell Papers, 1941 - 1988" href="http://foresthistory.org/ead/Towell_William_E.html">William E. Towell Papers, 1941 &#8211; 1988</a>, my database ended up only with &#8220;William E Towell Papers,&#8221; based on this encoding:  &lt;titleproper&gt;Inventory of the William E. Towell Papers, &lt;date normal=&#8221;1941/1988&#8243;&gt;1941 &#8211; 1988&lt;/date&gt;&lt;/titleproper&gt;</li>
<li>Need to handle a conversion factor for  a size of  &#8220;1 folder&#8221; (as found in the <a title="Inventory of the Biltmore Forest School Images, 1890 - 1988" href="http://foresthistory.org/ead/Biltmore_Forest_School_Images.html">Inventory of the Biltmore Forest School Images, 1890 &#8211; 1988</a>)</li>
<li>My script chokes on the Inclusive Year format &#8220;1910 and 1931 &#8211; 1937&#8243; (as found in the <a title="Inventory of the Alfred Cunningham Papers, 1910 and 1931 - 1937" href="http://foresthistory.org/ead/Cunningham_Alfred.html">Inventory of the Alfred Cunningham Papers, 1910 and 1931 &#8211; 1937</a>)</li>
<li>The presence of a &lt;lb/&gt; character within the &lt;extent&gt; tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the <a title="Inventory of the DeWitt Nelson Papers, 1940 - 1976" href="http://foresthistory.org/ead/Nelson_DeWitt.html">Inventory of the DeWitt Nelson Papers, 1940 &#8211; 1976</a>)</li>
<li>Within the &lt;abstract&gt; tag, my script drops everything after an &lt;emph render=&#8221;doublequote&#8221;&gt; tag (making for a very short abstract in the case of the <a title="Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 - 1947" href="http://foresthistory.org/ead/Recknagel_Arthur_Bernard.html">Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 &#8211; 1947</a>).</li>
</ul>
<p>The most dramatic issue, seen across all the finding aids in this set, is that <strong>no</strong> subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of &lt;list&gt; and &lt;item&gt; tags as shown here:</p>
<pre>&lt;controlaccess&gt;
&lt;head&gt;Subject Headings&lt;/head&gt;
&lt;list type="simple"&gt;
&lt;item&gt;&lt;genreform source="lcnaf" encodinganalog="655"&gt;Audiotapes&lt;/genreform&gt;&lt;/item&gt;
&lt;item&gt;&lt;persname source="lcnaf" encodinganalog="600"&gt;Ainsworth, John H., 1909-&lt;/persname&gt;&lt;/item&gt;
&lt;item&gt;&lt;subject source="lcnaf" encodinganalog="650"&gt;Businessmen -- United States&lt;/subject&gt;&lt;/item&gt;</pre>
<p>This is in contrast with this example of encoding from <a title="ArchivesZ Data Challenges: Syracuse University Special Collections Research Center" href="http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/">Syracuse University</a>:</p>
<pre>&lt;controlaccess&gt;
&lt;head&gt;Subject and Genre Headings&lt;/head&gt;
&lt;subject encodinganalog="650" source="local"&gt;Adult education&lt;/subject&gt;
&lt;persname encodinganalog="600" source="lcnaf"&gt;Adolphson, L. H.&lt;/persname&gt;
&lt;persname encodinganalog="600" source="lcnaf"&gt;Bradford, Leland Powers, 1905-&lt;/persname&gt;</pre>
<p>Or this sample from <a title="ArchivesZ Data Challenges: Oregon State University Archives" href="http://www.spellboundblog.com/2009/02/22/archivesz-data-challenges-oregon-state-university/">Oregon State University</a>:</p>
<pre>&lt;controlaccess id="a12"&gt;
	 &lt;controlaccess&gt;
		  &lt;persname encodinganalog="600" source="local" rules="aacr2"
		  role="subject"&gt;Aitken, Frances Alva, 1889-1970.&lt;/persname&gt;
	 &lt;/controlaccess&gt;
	 &lt;controlaccess&gt;
		  &lt;corpname encodinganalog="610" source="local" role="subject"
		  rules="aacr2"&gt;Oregon Agricultural College. Class of 1910.&lt;/corpname&gt;
		  &lt;corpname source="lcnaf" encodinganalog="610" role="subject"&gt;Oregon
				Agricultural College--Students.&lt;/corpname&gt;
	 &lt;/controlaccess&gt;
	 &lt;controlaccess&gt;
		  &lt;geogname source="lcsh" role="subject" encodinganalog="651"&gt;Corvallis
				(Or.)&lt;/geogname&gt;
	 &lt;/controlaccess&gt;
	 &lt;controlaccess&gt;
		  &lt;subject encodinganalog="650" source="lcsh"&gt;Student
				activities--Oregon--Corvallis.&lt;/subject&gt;
	 &lt;/controlaccess&gt;</pre>
<p>Both the Syracuse and OSU examples are handled by the current state of the data extract script.</p>
<p>Amanda pointed me to the <a title="NCEAD Best Practice Guidelines for EAD 2002" href="http://www.ncecho.org/dig/ead2002.shtml">NCEAD Best Practice Guidelines for EAD 2002</a>. Down in <a title="APPENDIX G: HOW DO I ENCODE...?" href="http://www.ncecho.org/dig/ead2002.shtml#appendixG">Appendex G: How Do I Encode&#8230;</a>, the second question down is &#8220;What if I have multi-part scope notes, biographical notes or subject headings?&#8221; followed by exactly the &lt;list&gt; and &lt;item&gt; tag usage as is being done for the Forest History Society finding aids. This format clearly should be handled.</p>
<p>So, no fun tag stats for this run &#8211; but I hope to fix my ruby script so that the Forest History Society finding aids can be incorporated into the data set I use for testing version 2 of ArchivesZ. My ruby script to do list is getting quite long!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/">ArchivesZ Data Challenges: Forest History Society</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/05/06/archivesz-data-challenges-forest-history-society/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Another Thrilling Digital Adventure With Team Digital Preservation</title>
		<link>http://www.spellboundblog.com/2009/05/06/another-thrilling-digital-adventure-with-team-digital-preservation/</link>
		<comments>http://www.spellboundblog.com/2009/05/06/another-thrilling-digital-adventure-with-team-digital-preservation/#comments</comments>
		<pubDate>Wed, 06 May 2009 04:27:47 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[born digital records]]></category>
		<category><![CDATA[future-proofing]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[preservation]]></category>
		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=507</guid>
		<description><![CDATA[Thanks to Archivism.net for this animated gem from DigitalPreservationEurope. Somehow they manage to include digital preservation, trusted data repositories, metadata and refreshing storage media in their story of Team Digital Preservation vs Team Chaos. I really want a t-shirt with the Bit-Rot guy on it! This post is from from: Spellbound Blog.Another Thrilling Digital Adventure [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/06/another-thrilling-digital-adventure-with-team-digital-preservation/">Another Thrilling Digital Adventure With Team Digital Preservation</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Thanks to <a title="Archivism.net" href="http://archivism.net/journal/">Archivism.net</a> for this animated gem from <a title="Digital Preservation Europe" href="http://www.digitalpreservationeurope.eu/">DigitalPreservationEurope</a>. Somehow they manage to include digital preservation, trusted data repositories, metadata and refreshing storage media in their story of <a title="YouTube: Team Digital Preservation vs Team Chaos" href="http://www.youtube.com/watch?v=pbBa6Oam7-w">Team Digital Preservation vs Team Chaos</a>.</p>
<p><center><object width="490" height="298" data="http://www.youtube.com/v/pbBa6Oam7-w&amp;hl=en&amp;fs=1&amp;rel=0" type="application/x-shockwave-flash"><param name="allowFullScreen" value="true" /><param name="allowscriptaccess" value="always" /><param name="src" value="http://www.youtube.com/v/pbBa6Oam7-w&amp;hl=en&amp;fs=1&amp;rel=0" /><param name="allowfullscreen" value="true" /></object></center></p>
<p>I really want a t-shirt with the Bit-Rot guy on it!</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/05/06/another-thrilling-digital-adventure-with-team-digital-preservation/">Another Thrilling Digital Adventure With Team Digital Preservation</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/05/06/another-thrilling-digital-adventure-with-team-digital-preservation/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Utah Government Archives &amp; Records Service</title>
		<link>http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/</link>
		<comments>http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/#comments</comments>
		<pubDate>Sun, 26 Apr 2009 05:33:17 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[interface design]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=424</guid>
		<description><![CDATA[Gina Strack of the Utah State Archives and Records Service provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from EAD 1.0 finding aids. Their website includes a detailed account of the EAD [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/">ArchivesZ Data Challenges: Utah Government Archives &#038; Records Service</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Utah State Archives and Records Service" href="http://www.archives.state.ut.us/"><img class="alignright size-full wp-image-425" title="Utah dot Gov Logo" src="http://www.spellboundblog.com/wp-content/uploads/2009/03/utahgovlogoglow.png" alt="Utah dot Gov Logo" width="87" height="66" /></a><a title="Gina Strack" href="http://ginastrack.com/">Gina Strack</a> of the <a title="Utah State Archives and Records Service" href="http://www.archives.state.ut.us/"><span class="il">Utah</span> State Archives and Records Service</a> provided me with access to the XML of 1,196 EAD encoded finding aids. These EAD 2.0 XML files are a product of a grant funded project completed last year to migrate from <a title="EAD verion 1 finding aids" href="http://historyresearch.utah.gov/inventories/inventories-ac.htm">EAD 1.0 finding aids</a>. Their website includes a <a title="Utah State Archives EAD Project" href="http://archives.utah.gov/research/inventories/ead.html">detailed account of the EAD Project</a>.</p>
<p>These finding aids have helped me identify three types of ArchivesZ data challenges:</p>
<ul>
<li>strange characters</li>
<li>broad composite subjects</li>
<li>determination of accurate collection size</li>
</ul>
<p><strong>Strange and mysterious characters!</strong></p>
<p>These finding aids use a special character in the place of the standard Library of Congress double dash which normally appears between subsections of the subject heading.</p>
<p>An example subject from the Utah Government XML looks like this:</p>
<p style="padding-left: 30px;">Women—Suffrage—Utah.</p>
<p>Viewing the same subject in a pure text editor (such as <a title="Wikipedia: vi" href="http://en.wikipedia.org/wiki/Vi">vi</a>):</p>
<p style="padding-left: 30px;">Women&amp;#8212;Suffrage&amp;#8212;Utah.</p>
<p>By the time it gets into my database and is pulled out via a query in MySQL Query Browser it looks like this:</p>
<p style="padding-left: 30px;">Women√¢‚Ç¨‚ÄùSuffrage√¢‚Ç¨‚ÄùUtah.</p>
<p>Rather than just stripping out all instances of &amp;#8212;,  my plan is to replace them with the standard Library of Congress double dash. This will ensure that the existing code that breaks the subjects down to tags will still work.</p>
<p><strong>Composite Subjects</strong></p>
<p>When I say &#8220;composite subject&#8221; what I mean is a subject that includes multiple very disparate terms. Rather than the Library of Congress style subjects, all aspects of which relate to the collection in question, these composite subjects cover multiple subjects which are grouped together for convenience.</p>
<p>This is a list of some of the most popular subjects for the Utah Gov collections:</p>
<ul>
<li>Politics, Government, and Law</li>
<li>Business, Industry, Labor, and Commerce</li>
<li>Science, Technology, and Health</li>
<li>Arts, Humanities, and Social Sciences</li>
</ul>
<p>These subjects throw a monkey wrench into my theories about decomposing subjects based on commas. The collections to which these subjects are assigned likely fit in only one of the component themes. For example, the &#8220;Inventory of Publications from Department of Technology Services, 1993-2008&#8243; is assigned the subject &#8220;Science, Technology, and Health&#8221;. If I divide this subject into 3 separate tags, the Science and Health tags would be quite misleading.</p>
<p>So that leaves me a bit trapped. If I want to divide subjects such as &#8220;Art, Cuban, 20th century&#8221;, as I discuss in <a title="ArchivesZ Data Challenges: Syracuse University Special Collections Research Center" href="http://www.spellboundblog.com/2009/03/07/archivesz-data-syracuse-university-archives/">my Syracuse University post</a>, then I end up also dividing these umbrella subjects which separate such very divergent terms with commas.</p>
<p>This issue goes on my list of reasons to add a repository configuration file for use by the data extraction script.</p>
<p><strong>Accurate Collection Size</strong></p>
<p>In my quest to convert all sizes to linear feet &#8211; sizes such as these are challenging:</p>
<ul>
<li>0.20 cubic foot and 1 microfilm reel</li>
<li>0.35 cubic foot and 2 microfilm reels</li>
</ul>
<p class="label">I also have situations of sizes be specified in multiple sections of the finding aid. The <a title="Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991." href="http://images.archives.utah.gov/cdm4/item_viewer.php?CISOROOT=/ead&amp;CISOPTR=991&amp;CISOBOX=1&amp;REC=1">Inventory of ALERT Foundation records from Governor Bangerter, 1986-1991</a> has a collection level size of &#8220;0.50 cubic foot and 2 microfilm reels&#8221;, but further down in this finding aid I see this:</p>
<p class="label"><em><span class="label">series: </span>ALERT Foundation records </em></p>
<ul>
<li><span class="label">box 1, folder 1: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" Motion picture film reel, 16mm</li>
<li> <span class="label">box 1, folder 2: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" VHS videocassette</li>
<li> <span class="label">box 1, folder 3: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" VHS videocassette</li>
<li> <span class="label">box 1, folder 4: </span>Documentary: &#8220;&#8221;Letters from our Children,&#8221;" VHS videocassette</li>
</ul>
<p>When they said 2 microfilm reels &#8211; do they really mean a 16mm motion picture film reel and a VHS videocassette? Is there 1 VHS videocassette or 3? How sizes are specified in a specific repository&#8217;s finding aids is another possible candidate for a repository level configuration script.</p>
<p><strong>Tagging Statistics</strong></p>
<p>Finally, here are a few tag stats:</p>
<ul>
<li>Only 31 tags (1.5% of all Utah Government tags) are associated with 10 or more collections</li>
<li>1404 tags  (71.5%) are assigned to only a single collection</li>
<li>107 collections have been assigned only 1 tag</li>
<li>10 collections have no subjects</li>
</ul>
<p>Of course these statistics are based on the current incarnation of the data extraction script. After I modify the script, there will be a greater number of tags and (hopefully) more overlap of tags across multiple collections. These types of statistics should help me gauge how well my data extraction logic is working.</p>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/">ArchivesZ Data Challenges: Utah Government Archives &#038; Records Service</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/04/26/archivesz-data-challenges-utah-government-archives-records-service/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>ArchivesZ Data Challenges: Princeton University</title>
		<link>http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/</link>
		<comments>http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/#comments</comments>
		<pubDate>Mon, 23 Mar 2009 04:13:54 +0000</pubDate>
		<dc:creator>Jeanne</dc:creator>
				<category><![CDATA[ArchivesZ]]></category>
		<category><![CDATA[EAD]]></category>
		<category><![CDATA[metadata]]></category>

		<guid isPermaLink="false">http://www.spellboundblog.com/?p=395</guid>
		<description><![CDATA[I received a zip file of 1,771 EAD encoded finding aids from the kind EAD enthusiasts at the Seely G. Mudd Manuscript Library. These finding aids came from five divisions within Princeton&#8217;s Library: University Archives Public Policy Papers Manuscript Division Latin American Ephemera Collection Engineering Library So onward to the data issues and what they [...]<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/">ArchivesZ Data Challenges: Princeton University</a></p>
]]></description>
			<content:encoded><![CDATA[<p><a title="Princeton University Seely G. Mudd Manuscript Library" href="http://www.princeton.edu/mudd/"><img class="aligncenter size-full wp-image-406" title="Princeton University Seeley G. Mudd Manuscript Library" src="http://www.spellboundblog.com/wp-content/uploads/2009/03/princeton-mudd.jpg" alt="Princeton University Seeley G. Mudd Manuscript Library" width="501" height="96" /></a><br />
I received a zip file of 1,771 EAD encoded finding aids from the kind EAD enthusiasts at the <a title="Seely G. Mudd Manuscript Library" href="http://www.princeton.edu/~mudd/">Seely G. Mudd Manuscript Library</a>. These finding aids came from five divisions within <a title="Princeton University Library" href="http://library.princeton.edu/">Princeton&#8217;s Library</a>:</p>
<ul>
<li><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;"><a title="Princeton University Archives" href="http://www.princeton.edu/~mudd/finding_aids/archives.html">University Archives</a><br />
</span></span></li>
<li><a title="Princeton University: Public Policy Papers" href="http://www.princeton.edu/~mudd/finding_aids/policy.html"><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;">Public Policy Papers</span></span></a></li>
<li><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;"><a title="Princeton Manuscript Division" href="http://www.princeton.edu/~rbsc/department/manuscripts/index.shtml">Manuscript Division</a><br />
</span></span></li>
<li><a title="Princeton: Latin American Ephemera Collection" href="http://firestone.princeton.edu/latinam/ephemera.php"><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;">Latin American Ephemera Collection</span></span></a></li>
<li><span style="font-family: Arial; font-size: x-small;"><span style="font-size: 10pt; font-family: Arial;"><a title="Princeton Engineering Library" href="http://libblogs.princeton.edu/englib/?s=">Engineering Library</a><br />
</span></span></li>
</ul>
<p>So onward to the data issues and what they mean for my ever growing &#8216;script fix to-do list&#8217;.</p>
<p><strong>Repository Names</strong></p>
<p>As we saw with the Oregon State University finding aids, the finding aids from Princeton University had a wide range of different values for repository names. In the list below we spot some issues. Some end in periods, some do not. One has extra space (probably a carriage return) in the middle. One does not include Princeton in the repository name. Once we have many repositories&#8217; finding aids in ArchivesZ, a repository name of &#8216;Engineering Library&#8217; does not tell the user enough about where those collections can be found.</p>
<p>Here is the list of repository titles my script extracted:</p>
<ul>
<li><span class="il">Princeton</span> University Library. Department of Rare Books and Special Collections.</li>
<li>Engineering Library</li>
<li><span class="il">Princeton</span> University Library</li>
<li><span class="il">Princeton</span> University Library. Department of Rare                    Books and Special Collections.</li>
<li><span class="il">Princeton</span> University Library.</li>
</ul>
<p>My script can handle the extra period and the extra spaces, but the non-specific name would need to ultimately be fixed on the source side.</p>
<p><strong>Collection Size</strong></p>
<p>The current script assumes that there is only one extent value specified to express the size of the collection. Princeton&#8217;s finding aids showed me examples of multiple extent values. For example, the <a title="Christina Georgina Rossetti Collection" href="http://diglib.princeton.edu/ead/getEad?eadid=C0222">Christina Georgina Rossetti Collection</a> has both a collection level size of 0.4 linear feet (1 archival box) as well as a 2nd extent specification corresponding to a specific folder with the value of (1 poem, 3 drawings, 1 photo, 1 incomplete article). The script must be modified to only consider the collection level size.</p>
<p><strong>Complicated Titles</strong></p>
<p>The current script logic apparently does not handle what I would call &#8216;complicated collection titles&#8217;. For example, I ended up with &#8220;Edward Livingston Papers, &#8221; as the title for a collection with a full title of <a title="Edward Livingston Papers, 1683-1877 (bulk 1764-1836)" href="http://http://diglib.princeton.edu/ead/getEad?eadid=C0280">Edward Livingston Papers, 1683-1877 (bulk 1764-1836)</a>. This is the way that this title is encoded:<code><br />
&lt;unittitle encodinganalog="245$a" label="Title and dates: "&gt;Edward Livingston Papers, &lt;unitdate encodinganalog="245$f" normal="1683/1877" type="inclusive"&gt;1683-1877&lt;/unitdate&gt; (bulk &lt;unitdate encodinganalog="245$g" normal="1764/1836" type="bulk"&gt;1764-1836&lt;/unitdate&gt;)&lt;/unittitle&gt;</code></p>
<div id=":1aw" class="ii gt">
<p><strong>Too Many Tags</strong></p>
<p>The Engineering Library&#8217;s <a title="Department of Mechanical and Aerospace Engineering Technical Reports: Finding Aid" href="http://diglib.princeton.edu/ead/getEad?id=ark:/88435/qf85nb33h">Department of Mechanical and Aerospace Engineering Technical Reports: Finding Aid</a> has 522 tags assigned to it! Almost all of these are the names of the authors of the individual reports. This scenario goes on the list of reasons why I might choose to not include (at least for this version) persname subjects. The other option for handling this situation is to only use subjects assigned at the collection level and ignoring subjects assigned at lower unit/container levels. Without the author tags, this single collection ends up with this nice, reasonable list of tags:</p>
<ul>
<li>Fluid mechanics</li>
<li>Mechanical engineering</li>
<li>Combustion</li>
<li>Aerospace engineering</li>
<li>Propulsion systems</li>
</ul>
<p><strong>Year Challenges</strong><br />
I found two different issues related to year ranges:</p>
<ul>
<li><a title="Women in Argentina, VI, 1989-2001: Finding Aid" href="http://diglib.princeton.edu/ead/getEad?id=ark:/88435/2z10wq25w">Women in Argentina, VI, 1989-2001: Finding Aid</a>: The current script does not properly extract the inclusive dates which are encoded within the titleproper tags, but rather assumes that it will be encoded using a unitdate tag.</li>
<li>An assortment of finding aids include subjects which have year spans as part of the subject. When these subjects are decomposed into tags, we end up with tags like &#8217;1850-1950&#8242;. Since we have the time period communicated via the inclusive dates, I will likely just drop these portions of the subjects rather than create a tag for each unique year span.</li>
</ul>
<p><strong>General Code Fixes</strong></p>
<p>It is reassuring at this point to spot the same issues with data from multiple repositories. Here are data and code logic issues that I have seen elsewhere that are revalidated by Princeton&#8217;s finding aids:</p>
<ul>
<li>Need to strip /n &amp; /t characters</li>
<li>Need to break subjects up based on commas</li>
<li>Need to drop final periods from repository names, subjects and titles</li>
<li>The designation of size in volumes, as in &#8220;793 volumes&#8221;. I need to pick an approach for translating from volumes to linear feet</li>
</ul>
<p>The script to-do list is still getting longer, but I am not done cycling through new institutions&#8217; XML files to find new issues. Want to share your institution’s EAD finding aids in XML format with the ArchivesZ project? Please drop me a line via <a title="Contact Jeanne" href="../contact/">my contact form</a>.</p>
<p><em>Image Credit: Top image from the <a title="Seeley G. Mudd Manuscript Library" href="http://www.princeton.edu/mudd/">Seeley G. Mudd Manuscript Library homepage</a>.</em></div>
<p>This post is from from: <a href="http://www.spellboundblog.com">Spellbound Blog</a>.<br/><br/><a href="http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/">ArchivesZ Data Challenges: Princeton University</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.spellboundblog.com/2009/03/23/archivesz-data-challenges-princeton-university/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

