Chapter 9 of Partners for Preservation is ‘Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations’ by Dr. Vetria Byrd. This is the second chapter of Part III: Data and Programming. I originally had envisioned a chapter focused on the ways that standardization, controlled vocabularies, and consistent documentation could increase the re-use of data. All these things help people, separated by either space or time, to understand and leverage the work of others. Scientific communities around the world have led a lot of this work. The work of archivists to preserve data in a meaningful way is made easier by it. ...
Caffè Lena opened in Saratoga Springs, NY in May of 1960. Since then, the coffee house has kept its doors open featuring predominately performances by folk musicians. Often the performers were at the start of their careers. The café has featured now familiar songwriters including Bob Dylan, Arlo Guthrie, Ani DiFranco, and Kate and Anna McGarrigle - to name just a few. After the death of the founder, Lena Spencer, in 1989 Caffè Lena was converted to a non-profit institution.
The Caffè Lena History Project has launched an online searchable database for the complete Caffè Lena collection. The processing of this collection was made possible with support from The Andrew W. Mellon Foundation, administered through the Council on Library and Information Resources’ Cataloging Hidden Special Collections and Archives Project. The digitization of the material was made possible through generous funding from the EMC Corporation.
The Grateful Dead Archive Online threw open its virtual doors in late June, 2012. This project has gotten a lot of attention from both the archives community and the Grateful Dead community. I got a message from my husband shortly after it went online directing me to the envelope shown above from the fan art section of the site. This was the envelope I helped decorate for our mail order ticket request sent back in January of 1992. The theory was that if you made your envelope beautiful, it was more likely to get pulled out of the pile of orders vying for a limited number of tickets. It worked for us this time – we plan to upload images of the tickets we received from that order (yes, we still have them!). ...
In case you haven’t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions’ finding aids to extract data for my ArchivesZ project, I am here to tell you that this work is VERY important. With better standards in place we will have a better foundation upon which to create interesting new tools and services to support archivists and researchers. ...
In my presentation at the Spring 2010 Mid-Atlantic Regional Archives Conference (MARAC), Whirlwind Tour of Visualization-Land, I showed some screenshots of a tool called Gridworks. At the time, Gridworks was not available to the general public. The good news is that earlier this month Gridworks 1.0 was officially released and you can get Gridworks right now.
For those of you who didn’t see my presentation, Gridworks is tool you run locally on your computer via a web browser. It permits you to load ‘grid-shaped data’ for examination, filtering and data cleanup. That makes is sound so much less exciting than it is. The best way to get a sense of what you can do is to watch the Gridworks Videos. ...
The official title for this session is “Discovery Tools for Archival Collections: Getting the Most Out of Your Metadata” and was divided into two presentations with introduction and question moderation by Jaime L. Margalotti, senior assistant librarian in Special Collections at the University of Delaware.
Introduction to Metadata Standards
Michael Bolam, metadata librarian for digital production, is in charge of all the metadata for all the collections at the Digital Research Library at the University of Pittsburgh. He is not an archivist – but does know where the archives is at Pitt! He has put lots of archival material online through digitization and assignment of metadata. ...
In an example of Twitter serendipity, @silverasm‘s (Aditi Muralidharan) tweet pointed me to @historying‘s blog post about Topic Modeling. In this post Cameron Blevins explains the results of using the topic modeling feature of UMass Amherst‘s MAchine Learning for LanguagE Toolkit (MALLET) on the text of Martha Ballard’s Diary.
I have spent lot of time thinking about how to generate thematic overviews of groups of archival collections. My information visualization project, ArchivesZ, aims to provide ways of understanding aggregated archival description data, both from a single institution or across institutional boundaries. Now I find myself wondering if text mining with a tool like MALLET might generate smart topic groupings more elegantly than fighting with the wide range of non-standardized collection subjects. ...
Session Title: Digital Curiosities: Resource Creation Via Amateur Digitisation
Speaker: Melissa Terras
Overview: Review of 100 virtual museum websites and multiple flickr groups plus surveys of amateur website creators, memory institutions and Arts & Humanities academics leads to new perspective on digitization and creation of collections online by dedicated enthusiasts.
Areas of “Amateur” endeavor have a long history of launching collections, such as: ...
Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA’s Archives and Special Collections finding aids in XML format.
With the current scripts, these are the fun tag stats:
- 1,684 total tags extracted
- 75% (1,266 tags) are associated with only one finding aid
- 3% (51 tags) are associated with 10 or more finding aids
235 out of tne 253 collections ended up with a collection size of 0.
Consider the encoding of the collection size in the Guide to the Women’s Overseas Service League Records, 1910-2007:
<physdesc label="Extent:" encodinganalog="300$a"> 77 linear feet (approximately 44,000 items) </physdesc>
Contrast this with one of the examples where the size of the collection was extracted properly by the current script:
<physdesc label="Extent:" encodinganalog="300$a"> <extent>8.4 linear feet</extent> (14 boxes) </physdesc>
Sometimes it feels like a game of Where’s Waldo. In this case we are simply missing the set of <extent> tags from the first example. Off I went to the EAD tag descriptions to find the guidelines for use of the <physdesc> tag, where I found this overview of the tag:
A wrapper element for bundling information about the appearance or construction of the described materials, such as their dimensions, a count of their quantity or statement about the space they occupy, and terms describing their genre, form, or function, as well as any other aspects of their appearance, such as color, substance, style, and technique or method of creation. The information may be presented as plain text, or it may be divided into the <dimension>, <extent>, <genreform>, and <physfacet> subelements.
Bad news for my script logic – both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the <extent> tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the <physdesc> tags without enclosing <extent> or <dimension> tags.
Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of <unitdate> formats that my current script could not parse properly, including the examples below:
- 1917-1980 (bulk 1920-1945)
- 1876-1903, 1914-1919, 1940-2002
- 1940s, 1970s-1990s
Another encoding approach that could not be parsed was the one used for the finding aid of the Church Women United of San Antonio Records. In this case the <unitdate> tag is within the <unittitle> tag as seen here:
<unittitle label="Title:" encodinganalog="245"> Church Women United of San Antonio Records, <unitdate label="Dates:" encodinganalog="245$a">1961-2005</unitdate> </unittitle> ...
Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:
- Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like William E. Towell Papers, 1941 – 1988, my database ended up only with “William E Towell Papers,” based on this encoding: <titleproper>Inventory of the William E. Towell Papers, <date normal=”1941/1988″>1941 – 1988</date></titleproper>
- Need to handle a conversion factor for a size of “1 folder” (as found in the Inventory of the Biltmore Forest School Images, 1890 – 1988)
- My script chokes on the Inclusive Year format “1910 and 1931 – 1937” (as found in the Inventory of the Alfred Cunningham Papers, 1910 and 1931 – 1937)
- The presence of a <lb/> character within the <extent> tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the Inventory of the DeWitt Nelson Papers, 1940 – 1976)
- Within the <abstract> tag, my script drops everything after an <emph render=”doublequote”> tag (making for a very short abstract in the case of the Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 – 1947).
The most dramatic issue, seen across all the finding aids in this set, is that no subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of <list> and <item> tags as shown here:
<controlaccess> <head>Subject Headings</head> <list type="simple"> <item><genreform source="lcnaf" encodinganalog="655">Audiotapes</genreform></item> <item><persname source="lcnaf" encodinganalog="600">Ainsworth, John H., 1909-</persname></item> <item><subject source="lcnaf" encodinganalog="650">Businessmen -- United States</subject></item> ...