metadata | Spellbound Blog

Chapter 9: Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations by Dr. Vetria Byrd

January 27, 2019 1 Comment

Chapter 9 of Partners for Preservation is ‘Sharing Research Data, Data Standards and Improving Opportunities for Creating Visualisations’ by Dr. Vetria Byrd. This is the second chapter of Part III: Data and Programming. I originally had envisioned a chapter focused on the ways that standardization, controlled vocabularies, and consistent documentation could increase the re-use of data. All these things help people, separated by either space or time, to understand and leverage the work of others. Scientific communities around the world have led a lot of this work. The work of archivists to preserve data in a meaningful way is made easier by it.

Luckily for all of us, my hunt for contributing authors brought Dr. Vetria Byrd to this project. Her professional focus on visualization led me to approach the topic of sharing data and data standards from a different direction.

I am a very visual thinker. I truly believe that with a large enough whiteboard, you could plan (or explain) anything. Those who are familiar with my research back in graduate school may recall my visualization project, ArchivesZ, focused on visualizing archival descriptive information. So, when I realized that this chapter could talk about both data standards and visualization, I was sold.

The introduction to the chapter explains:

“This chapter looks at the collaborative nature of sharing the underlying data that propels the system, rather than focusing on systems and services. It provides an overview of the visualisation process, and discusses the challenge of sharing research data and ways data standards can increase opportunities for creating and sharing visualisations, while also increasing visualisation capacity building among researchers and scientists.”

A single visualization is often reliant on multiple sets of data that have been analyzed, linked, and summarized over multiple iterations to generate the final product. Take the xkcd webcomic featured at the top of this post. The citation within the webcomic itself reads “based on map data from US Drought Monitor/NOAA/Richard Tinker”. Digging a bit deeper, I found my way to the US Drought Monitor website which provides easy access to data and maps. You can learn more about the data included on the site and read the details about how they calculate the drought classifications.

I was able to quickly generate this chart, showing California Drought data over time. While it certainly makes it clear that there has been a dramatic increase of drought over time, it does not communicate the same information as the maps in the webcomic above.

I think this is a great example of how different ways of visualizing information can fundamentally change our understanding of something. Documentation of and transparency in sharing data is key. It gives the tools to a broader audience of creative individuals who can then increase the visibility of the original work and build upon it.

Bio:
Dr. Vetria Byrd is an Assistant Professor of Computer Graphics Technology and Director of the Byrd Data Visualization Lab in the Polytechnic Institute at Purdue University’s main campus in West Lafayette, Indiana. Dr. Byrd is introducing and integrating visualization capacity building into the undergraduate data visualization curriculum. She is the founder of the Broadening Participation in Visualization (BPViz) Workshop. She served as a steering committee member on the Midwest Big Data Hub (2016-2018). She has taught data visualization courses on national and international platforms as an invited lecturer of the International High Performance Computing Summer School (IHPCSS). Her visualization webinars on Blue Waters, a petascale supercomputer at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, introduce data visualization to audiences around the world. As described in her invited plenary talk featured on HPC Wire Dr. Byrd utilizes data visualization as a catalyst for communication, a conduit collaboration and as a platform to broaden participation of underrepresented groups in data visualization. Dr. Byrd’s research interests include data visualization, data analytics, data integration, visualizing heterogeneous data and the science of learning and incorporating data visualization at the curriculum level and everyday practice.

Image source: xkcd comic: California. https://www.xkcd.com/1410/

PS: If you haven’t yet discovered xkcd (self-described as “A webcomic of romance, sarcasm, math, and language.”) – you are in for a treat!

Caffè Lena History Project’s Searchable Database

May 13, 2014

Caffè Lena opened in Saratoga Springs, NY in May of 1960. Since then, the coffee house has kept its doors open featuring predominately performances by folk musicians. Often the performers were at the start of their careers. The café has featured now familiar songwriters including Bob Dylan, Arlo Guthrie, Ani DiFranco, and Kate and Anna McGarrigle – to name just a few. After the death of the founder, Lena Spencer, in 1989 Caffè Lena was converted to a non-profit institution.

The Caffè Lena History Project has launched an online searchable database for the complete Caffè Lena collection. The processing of this collection was made possible with support from The Andrew W. Mellon Foundation, administered through the Council on Library and Information Resources’ Cataloging Hidden Special Collections and Archives Project. The digitization of the material was made possible through generous funding from the EMC Corporation.

This collection is physically in many places, but the Omeka based website serves as a centralized index for browsing and discovering the rich set of papers, audio recordings, photographs and ephemera documenting the history of this performance space. The database was architected by Monte Evans, who managed to bring together all the large and disparate data sets and organize them within Omeka. This database is the third part of the overall history project, which also produced a three-CD set of audio recordings and a lavish book documenting the history of the coffee house through stories and many previously unseen photos.

Over the course of more than 10 years, this project has been a labor of love by Jocelyn Arem to hunt for long lost caches of materials – in attics and garages and archives across the country. Then her efforts moved on to digitization and promotion of the amazing materials she found. She was supported by many people (in addition to the funders) who helped move the work forward — including dedicated community members, volunteers, the Caffè Lena Board of Directors, and friends around the country who shared their expertise and support.

The video above is the ‘trailer’ for the project and does a great job introducing both Caffè Lena and the history project through interviews, photographs and audio clips.

The Database

The contents of the website are divided into three sections — recordings, ephemera and photographs. While it is possible to search across the full array of contents, there are additional options for filtering by tags and sorting within each sub-section of the database.

Of all the materials gathered during this project, the audio recordings were the most elusive. In many cases it required detective work to track down recordings that were remembered by some and forgotten by others. Recordings donated from as far away as Ontario and Ohio, and digitized by the GRAMMY Award-winning Magic Shop Studio in NYC. Jocelyn Arem shared with me one of her favorite stories of serendipity during the search for recordings – that of the ZBS Radio Series tapes. A former ZBS producer in Panama, Robert Durand, contacted them because he heard about the project online. He connected them with an engineer at ZBS in New York. As a result they received the edited Caffè Lena ZBS collection. However they always wondered where the unedited tapes had ended up. A few months later, Jocelyn followed a trail to another engineer who had left a note at the Caffè years earlier saying he had tapes to donate. Along with former Caffè Lena board member (and audio tape donor) Dick Kavanaugh, she drove to the mountains of upstate, NY to retrieve the tapes from this new donor and lo and behold … there was the unedited Caffè Lena tape collection!

In the Recordings section of the site, you can find descriptive information about 514 recording now held by the Library of Congress’s American Folklife Center.. The browse interface lets you filter by using a drop-down list of artist names. Occasionally the entries includes short audio samples you can listen to online, such as this one for Kate and Anna McGarrigle recorded during the 20th Anniversary Music Festival described on the flyer shown here. One thing that surprised me is that if sometimes an image will lead to a multi-page PDF. In the recording’s section this was common and the PDFs of the tape transfer reports include detailed notes about the recording.

The Ephemera section of the site features 32 boxes of materials from five separate collections, listed below.

the Lena Spencer Papers, Performer Files and Jan Nargi Collections — all held by the Saratoga Springs History Museum,
the Board of Directors Collection held by the American Folklife Center
the Lively Lucys Coffeehouse Collection held by Skidmore College

Each collection can be browsed individually, with their own dedicated options of filtering by tags. Here it was a bit more obvious that the image you saw was likely just the first page of a multi-page folder digitized and presented as a single PDF.

Finally, the Photographs section includes over 6,000 black and white photographs made by Joe Alper at Caffè Lena between 1960-1967. These images were catalogued and digitized by Edward Elbers and are held by the Joe Alper Photo Collection LLC. This section lets you filter by ‘artist’. For example, there are 296 photos of Bob Dylan.

Tagging and Controlled Vocabularies

One of the recurring challenges for those tagging content from multiple sources is the different versions of terms that mean the same thing. Controlled vocabularies are hard to enforce across different collections held by different organizations or individuals. In the case of the Caffè Lena History Project, I noticed that there were different values used for tagging across the materials. For example, the values used to tag and populate the list of recording artists’ names exactly match the names listed on the tape transfer reports. In some cases the same artist may be listed in multiple ways. For example, there are entries for The McGarrigles, Kate McGarrigle and Anna McGarrigle. Another example can be found in the tagging for materials related to Bob Dylan. Bob Dylan, bob, Dylan and Dylan Bob are all tags that give you different subsets of the search results that just searching on Dylan provides.

These types of issues are often compounded when materials come from so many different sources and through many different hands along the way. That said, the combination of artists lists, tags and search functions make it easy to discover materials related to your favorite folk music artists. Just keep in mind that looking for multiple versions of a performer’s name might help you find more materials.

Other Virtual Archives

The approach of creating a single website to unify materials that are not co-located but do all relate to a single unifying theme reminded me of two earlier projects: The Publishers’ Bindings Online (PBO) and the Greene & Greene Virtual Archive.

The Publishers’ Bindings Online project now features over 13,000 images online of over 5,000 book bindings from 1815 to 1930. These books are held in libraries at multiple institutions and the project’s success at (and challenges with) using a single unified vocabulary for tagging was discussed in detail during a session at SAA in 2007: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building. I used the subject vocabulary to find a book cover featuring polar bears.

The Greene & Greene Virtual Archives presents materials of the southern California design firm Greene & Greene. Active from 1894-1922, they are associated with the architecture and craftsmanship of the American Arts and Crafts Movement. The Greene & Greene website presents images and metadata of a selected set of 4000 items held by four different repositories. They also provide links to the full descriptions of materials held by each of the repositories. The search functionality on the website is geared towards exploration of individual architectural projects, but also permits advanced search by topic, repository, location, document type and date. Unlike PBO and Caffè Lena, this site doesn’t expose the tagging that lets the results be returned by these groupings.

My Personal Connection

Finally, I would like to share my personal connection to this project. Jocelyn contacted me over a year ago while in the final stages of working on the book to ask if my father had taken a particular photo of Loudon Wainwright III. My father was his manager at the start of his career and did take many photos of him, though not the one in question.

After the launch of the site, I was curious to see what might be in the database related to Loudon and perhaps my father. I found this great photo in the Ephemera file for Wainwright Loudon. My father is the gentleman on the left with the mustache!

More about Caffè Lena

The New York Times published a great article back in 2013 that talks all about the history project and the two products which preceded the online database. In September 2013, the history project created a three CD box set featuring the best of the historical audio material: Live at Caffe Lena:Music From America’s Legendary Coffeehouse,1967-2013.

The book was released in October 2013. Soon to be available via a second printing direct from the publisher, you can still find copies of the first printing of Caffe Lena: Inside America’s Legendary Folk Music Coffeehouse on Amazon.com.

There is a traveling exhibit that can be brought to your local venue with tons of details available online. You can also subscribe to an online newsletter. If you have a project for which you would like to use any of the materials (audio included) – there is a form for making licensing requests.

Of course, one of the best sources of information about Caffè Lena and its founder are the materials featured on the history project website.

Finally, do you have materials to donate to the archive? You can contact the Caffè Lena History Project team directly via this online form.

Image Credits: Courtesy of the Caffè Lena Collection and the Saratoga Springs History Museum

Grateful Dead Archive Online: First Impressions

July 9, 2012 3 Comments

The Grateful Dead Archive Online threw open its virtual doors in late June, 2012. This project has gotten a lot of attention from both the archives community and the Grateful Dead community. I got a message from my husband shortly after it went online directing me to the envelope shown above from the fan art section of the site. This was the envelope I helped decorate for our mail order ticket request sent back in January of 1992. The theory was that if you made your envelope beautiful, it was more likely to get pulled out of the pile of orders vying for a limited number of tickets. It worked for us this time – we plan to upload images of the tickets we received from that order (yes, we still have them!).

A little digging shows that the site is built on the open source Omeka platform. The prominent milestones timeline was built using the Neatline suite of Omeka plugins. The Omeka platform gives the site creator a lot of flexibility in what data is used to manage the collection.

The amount of metadata that the GDAO staff have populated on their 45,000 digitized items is quite impressive. They have tied the materials into the logical structure dictated by the Grateful Dead’s concerts. You can search for items related to a specific venue or a specific show by zooming in to locations on a map of the world to pick out individual venues where the Dead played. A wealth of media from the Internet Archive is tied into the site so that it is easy to find using the standard search mechanisms and cross linking based on metadata. The artists section features both photographers and poster artists. Two exhibits are in place for the site launch – one on Europe ’72 and the other on the Posters of the Grateful Dead Archive.

The resolution of the scanned fan art is amazing. Take a look at how far I could zoom in to the bluebell I drew way back when. I wonder what their default scanning resolution was?

The site also invites the Grateful Dead community to contribute items to the collection. They have a wish list for content to flesh out gaps they see in what they have.

These are the types available for selection when contributing content:

Audio
Image
Video
Your Story
Poster
Ticket
Laminate
Backstage Pass
Article

For each of the contributions, the user is asked for a mandatory Title, and optional Description, Date of Show, Venue Name and Venue Location. The contribution page also prompts the user to enter their name, e-mail, copyright and license.

For license, the user is given three options and encouraged to select one of the broader Creative Commons licenses rather than the more restrictive default license only granting rights to the University of California.

I am contributing this work and irrevocably grant a non-exclusive, perpetual, royalty-free, worldwide license for this work to the University of California Regents to display, distribute, reproduce, perform, or create derivatives works based upon it.
I am contributing this item under a Creative Commons Attribution (CC BY) License. Others are free to share, remix, or make commercial use of the work as long as they credit me.
I am contributing this item under a Creative Commons Attribution-NonCommercial (CC BY_NC) License. Others are free to share or remix the work noncommercially, as long as they credit me.

For the ‘Your Story’ option, the user is also prompted with the following:

How did you become a Deadhead?
What is your favorite Dead show, and why?
What is your favorite Dead song, and why?
What is your favorite aspect of the Dead scene?
What, if anything, do you think is important about the Dead, and about the Dead phenomenon?

I really loved that they provide a phone number which you can call and leave up to a three minute message which they will then transcribe for you. This looks to be an example (the first according to the comment) of a ‘Your Story’ submission.

The Advanced Search page gives us a full list of formats:

Album Cover
Article
Backstage Pass
Envelope
Fan Art
Fan Tape
Fanzine
Image
Laminate
Newsletter
Notebook
Poster
Program
Story
T-Shirt
Ticket
Sound
Video
Website

The search results let you filter by item type, creator name, venue, year and subject. I wish I could see a full list of the subjects. They seem to be a mix of named individuals associated with the band, events and song titles.

The Grateful Dead Archive Online is a great example of what can be done with good planning and the staff necessary to follow through with the vision. I appreciated the thought that clearly went into the copyright and license issues – both for content being contributed as well as content owned by the archive. I also see evidence of efforts to build a sense of community. The ‘Your Story’ contribution form specifically mentions that contributors should consider carefully what they share and how it might reflect on others. Each item offers the option to post comments as well as to add tags. It will be interesting to how these communal aspects of the site grow over time. Many archives have to work to build community – but the Grateful Dead fan community has a long and strong history.

Finally – as I mentioned above, the item level description is impressive. I was amazed to note that the envelope shown above was linked to the two shows the request was for, the creation date was tied to the post mark, the extent was the envelope’s measurements and the citation included the name of the creator from the return address. And yes – we know that they misspelled Smyth as Smith. We have already let them know about the typo and received a prompt response with a promise to fix the spelling.

Support EAD Tagging Research

December 6, 2010

In case you haven’t seen this request via other channels, please consider supporting the research effort described below into how different organizations encode finding aids using EAD. As someone who has dug into the gory details of eleven institutions’ finding aids to extract data for my ArchivesZ project, I am here to tell you that this work is VERY important. With better standards in place we will have a better foundation upon which to create interesting new tools and services to support archivists and researchers.

Is part of your job is to encode finding aids in EAD? Then please ask if you can send a dozen of them to the researchers on this project!

Seeking EAD records from repositories that have implemented EAD

Standards have been entering the archival lexicon at a fast pace to ensure data reliability, enable data aggregation, and manage data over the long term. However, we have not yet examined the use of these standards across the archival community. As we move into the next phase of standards-creation, a broad look at current implementations will help to inform the next generation of these standards. To do this, Kathy Wisser (Simmons College) and Jackie Dean (UNC Chapel Hill) are conducting research on EAD tag usage in the encoding community.

This project is intended to inform the TS-EAD revision process of the standard, and results will be disseminated through traditional publication avenues.

We are seeking a sample of encoded finding aids from institutions that have implemented EAD. If you are willing to participate in this project, please submit via electronic mail 12 to 15 finding aids to eadtagresearch@gmail.com by December 15, 2010.

The goal of the project is to identify encoding behavior and not to evaluate the quality of the encoding or the content of the finding aid. We will be noting the presence and absence of elements and attributes and the way that elements are used within the context of an EAD instance.

All results will be anonymized; no institution-specific information will be linked to the results. Institutions willing to participate will be acknowledged.

In order to obtain an accurate account of the use of the standard, we are looking for EAD instances from as many institutions as possible. We hope you will consider contributing to this effort.

If you have any questions about the project, please contact:

Kathy Wisser (Simmons College – wisser@simmons.edu)

Jackie Dean (UNC Chapel Hill – jdean@email.unc.edu)

Gridworks: Super Data Cleanup and Exploration Tool

May 29, 2010

In my presentation at the Spring 2010 Mid-Atlantic Regional Archives Conference (MARAC), Whirlwind Tour of Visualization-Land, I showed some screenshots of a tool called Gridworks. At the time, Gridworks was not available to the general public. The good news is that earlier this month Gridworks 1.0 was officially released and you can get Gridworks right now.

For those of you who didn’t see my presentation, Gridworks is tool you run locally on your computer via a web browser. It permits you to load ‘grid-shaped data’ for examination, filtering and data cleanup. That makes is sound so much less exciting than it is. The best way to get a sense of what you can do is to watch the Gridworks Videos.

What sort of data do I think there is in archives to be pumped into Gridworks? How about collection descriptive data and electronic record datasets? Since all the data is kept locally, you don’t need to worry about uploading your data to some anonymous server in order to work with it. It all stays safely on your local computer the whole time.

A quick list of things that Gridworks can do:

Cluster data to find values that are almost the same so you can normalize your data (for example – NYC vs N.Y.C.)
Create instant facetted browsing based on any column in your data
Provide scatterplots of the values from any two numeric columns as well as a way to spot the most interesting combinations across many possible columns
Reconcilliation and validation of values based on data from within Freebase.com
Pull data from Freebase.com based on a matched column – such as the population of a country, if you have a column in your dataset with country specified
Splitting data within a cell based on a specified delimiter
Application of regular expressions and other simple code to data to create new columns

This list just scratches the surface, but it should give you a decent idea of the power of Gridworks. Even if the only feature you ever use is the one which lets you cluster and update your data to remove the ‘almost the same’ values, Gridworks can save you hours of painstaking data cleanup.

Why is data cleanup exciting? Because once you have nice clean data with all the attributes that are usefull to have for your data set – then you can start playing with the data in visualization tools! So go watch some Gridworks Videos, get Gridworks for yourself and start playing with data. It is free and it makes working with data fun!

MARAC Spring 2010: Hurray for Archival Metadata (Session S2)

May 7, 2010

The official title for this session is “Discovery Tools for Archival Collections: Getting the Most Out of Your Metadata” and was divided into two presentations with introduction and question moderation by Jaime L. Margalotti, senior assistant librarian in Special Collections at the University of Delaware.

Introduction to Metadata Standards

Michael Bolam, metadata librarian for digital production, is in charge of all the metadata for all the collections at the Digital Research Library at the University of Pittsburgh. He is not an archivist – but does know where the archives is at Pitt! He has put lots of archival material online through digitization and assignment of metadata.

The best definition he has found of metadata, good for all audiences: “Metadata consists of statements we make about resources to help us find, identify, use, manage, evaluate and preserve them” Marty Kurth – Head of Metadata Services, Cornell University Libraries

Reviewed examples of metadata for images, text documents and archival collections. There is also data related to the business of scanning and making content available – administrative/behind the scene. Standards let you take your data and use it for other purposes.

Overview of alphabet soup of metadata standards:

MARC: bibliographic information in machine-readable form (a MAchine-Readable Cataloging record).
Dublin Core: the goal of Dublin Core was to create a core set of metadata fields that could be used across platforms, across various disciplines.
MARCXML: schema for representing MARC in XML. Makes it easy to convert to and from MARC without loosing any data. May have more data than you need. MARCXML is not very ‘human readable’. You need to recall all the code numbers for the different data elements. Can be exported from Archivist Toolkit.
MODS: Metadata Object Description Schema – sort of a ‘MARCXML light’. Tries to be a step between MARCXML (robust & complicated) and Dublin Core (really simple). May result in compacting multiple MARCXML fields into single MODS fields. May loose some of the granularity of the data. The tags ARE human readable. The tag is the word ‘author’ – not a number. Also can be exported in Archivists Toolkit.
ONIX: ONline Information eXchange – standard used by the book publishing industry. XML-based standard for making available intellectual property in published form, both physical & digital. Data created by the publisher. They use different ways of representing authors, keywords..etc in comparison to LOC and library cataloging.
METS: Metadata Encoding & Transmission Standard. XML standard wrapper for describing divergent types of content within a digital library. The metadata for books, images, collections etc keep this data in different formats – METS lets you bring them together.
OAI-PMH: Not a metadata standard – but rather a protocol for sharing metadata. Gives us a way to pull baseline information about a digital object out of a database and put it out somewhere where it can be harvested and used.

Examples of projects built on shared metadata:

Worldcat.org: Has everything that is shared with OCLC. They do expose their records to google and yahoo harvesting.
OAIster: Searches a harvested data set – it is not going live out on the web. The OAIster records are also available in Worldcat. Example: search for Pittsburgh City Photographer (that is a provider of data). Most digitization software will generate an OAIster harvestable version. In his example we see that address and location get compressed into Notes. This is because there is not always a place in Dublin Core that maps to the level of detail you collect at your local institution. http://www.oclc.org/us/en/oaister/default.htm – has the info about contributing your content for crawling.
Archive Grid: The goal is to pull in finding aids from many sources. It is a service – requires some sort of subscription and payment to see the data. Uses Lucene for searching. The content in Archive Grid is now available in Worldcat. To participate – see http://www.oclc.org/us/en/archivegrid/default.htm

Google and Yahoo do index OAIster and WorldCat, so that is one path to being found in search engines.

MARC Records for Archival Materials in WorldCat Local

Jennifer MacDonald from the University of Delaware presented a cataloger’s perspective of a WorldCat Local environment. She is a “concerned enthusiast” with regard to metadata. The University of Delaware was the first institution to buy WorldCat Local. She ended up on the WorldCat Local Special collections and Archives Task Force. The task force made their final report in 2008 and got a response from OCLC in 2009. They did get some immediate changes based on their feedback – like moving the 520 “summary” data element higher in the display. For some problems the task force identified, such as Archival Materials that were not being identified properly (Internet Resource is the type for all OAI records), it is hard to tell if the issue has been fixed.

She showed some screenshots from WorldCat local to show what data elements are there and how they are organized. In the FirstSearch screenshot (only available at the school), Notes and General Info holds a mishmash of content from various data elements consolidated into single fields. The task force asked for the “Browse” feature but apparently this feature is dead. They got no response from OCLC to this request in their report.

If you use the University of Delaware instance of WorldCat Local to search for walter penn shipley and drill down to the detail record display for the Walter Penn Shipley Papers you will see what was shown during the session. This display is customizable at the institution level in WorldCat Local. Some data is shown. You see lots of Web 2.0 options to add your own data, but the display is missing some of the data from the original MARC record. The full MARC record is indexed for keyword search, but since some of it is not displayed, users may not be able to determine why a record was returned.

Fields missing from the WorldCat Local display:

351 – Organization and Arrangement of Materials
545 – biographical note
506 – restrictions on access
540 – Use of materials – with link to an askspec page: http://www.lib.udel.edu/cgi-bin/askspec.cgi
525 – preferred citation form – and this is where the manuscript number is
655 – some of the parts of the genre terms are missing
656 – occupation

OCLC says that they have not included all this because people don’t want this displayed. Given that local organization is already deciding what to show, the task force would prefer the option to displayable all data elements. Due to this missing data, Jennifer prefers the FirstSearch interface – but this option is not always available at all institutions. You should take advantage of the Web 2.0 features. Archivist can create an account on WorldCat Local and add data elements.

Questions and Answers

QUESTION: You talk about having the metadta in a format that is accessible to harvesting. What I have is a bunch of CDs with images on them that have a folder and descriptor structure. Is there a metadata harvester that can go in and pull that metadata out? New York Stock Exchange photographer sent these.

ANSWER (Michael): So the metadata you are looking to extract is the filename and descriptors? You could have someone write a little script and extract what you need. I would hand it to the guy I work with because he writes perl. If then you made that available via your website – then people could find it. To get it into a database – it is just a small script.

QUESTION: Are there any specifically useful webinars/seminars for becoming familiar with these formats for skillbuilding?

ANSWER (Michael): Tons on the web. The LoC websites are very useful. You may have heard the term ‘crosswalking’ – that is where you take one format and turn it into another. Looking at the crosswalks can make it much easier to understand how a format you understand maps to one you are trying to learn about. Shareable Metadata – metadata for you and me. Not online yet – but someone in the audience said the plan is to post the materials. There have been a couple of books and ALA publications. Most of the ones I know of are about 10 years old. Jaime: SAA has a good workshop series.

QUESTION: One of the first things you said was to take data out of EAD and you didn’t go into detail in that. Were you talking about DAO tagged items?

ANSWER (Michael): I was just talking about reusing data in a new environment. For example, we just started digitizing manuscripts and each item is becoming an individual digital object. The only metadata we have is in the EAD finding aid – so we are using that data to make descriptive data about the digital objects. We are going to create a MODS or METS record for every digital object. Jaime: We use EAD to make MODS records. She has been manually extracting EAD data as Dublin Core data for ContentDM.

My QUESTION: What format does OAIster want?

ANSWER (Michael): OAIster is just harvesting Dublin Core. You can share MODS and other metadata types and you may find other aggregators that are expecting their users to work in a more detailed environment. You may publish more data elements for other harvesters as well – but OAIster will only pull the Dublin Core data elements.

QUESTION: We are working on a digitization project to digitize local historical societies, museums and libraries. Might the catalogers be able to deal with MODS or will the loss of granularity be a problem?

ANSWER (Michael): I am not a MODS expert. MARC is very granular. Maybe look at the MARCXML – MODS crosswalk?

QUESTION: At the University of Delaware, do you have any other systems?

ANSWER (Jennifer): When we first got WorldCat Local you had to know the URL to get to the library. That changed fast! The patrons couldn’t find anything. Jaime: In WorldCat Local you cannot scope the search to specific sub-collections.

QUESTION: Thank you Jennifer for your remarks. Is there a problem with catalogers trying to ‘sneak’ data elements into other places – are standards in danger?

ANSWER (Jennifer): I would hope we wouldn’t move 524 data into a 500 field just to get it displayed. There is some danger of loosing the granularity by pushing everything to Dublin Core. I don’t know how real that danger is at this point.

QUESTION: A political question for Jennifer: Who has the clout to push for changes with OCLC?

ANSWER (Jennifer): I think leaning encouraging users to give feedback is important. We were told that users don’t want that “we have proven that users don’t want that”. Users need to make comments about their challenges in dealing with the interface. FROM AUDIENCE: The strongest is to say that you are looking at Sky River. FROM AUDIENCE: Make your data more discoverable outside the catalog world – internal websites and Google. Jaime: We are working hard to make MARC records to push access to our collections. The push is to make the data available in as many locations as possible.

QUESTION: Are these all different levels of subscriptions? Are they trying to push people to buy more subscriptions?

ANSWER (Jennifer): There is a sense that WorldCat Local is pushed at local public libraries. Yes – WorldCat Local is something they have to pay for. Michael: With Archive Grid you are going a step further – EVERYTHING in the finding aid is indexed. Every search I did in there returned thousands of records. Then I filtered by institution – and it never loaded. FROM AUDIENCE: I think they are revamping Archive Grid – but I don’t know how far they are in the process. Michael: I love the detail – you don’t have to dig through other data to find something useful. Depending on the institution – and how they are allowing their data to be harvested – you may see less information. Jaime: You have to actively work with OCLC to get Archive Grid to pick up your data.

QUESTION: We are tinkering with users adding tags – are you having any success with people adding tags?

ANSWER (Jaime): No – it isn’t something we have dealt with. WorldCat Local does let you add stuff like that.

QUESTION: Will OCLC provide that UGC (user generated content) back to the institution?

ANSWER: We wouldn’t know.

QUESTION: Have they provided access to the user studies?

ANSWER: Yes – but it is based on watching individuals use the tools.

Image Credit: Statue representing Research by Henry Hering from image of the interior of the Field Museum of Natural History interior.

As is the case with all my session summaries from MARAC, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Topic Modeling, Auto-Classification and Archival Description

April 27, 2010 4 Comments

In an example of Twitter serendipity, @silverasm‘s (Aditi Muralidharan) tweet pointed me to @historying‘s blog post about Topic Modeling. In this post Cameron Blevins explains the results of using the topic modeling feature of UMass Amherst‘s MAchine Learning for LanguagE Toolkit (MALLET) on the text of Martha Ballard’s Diary.

I have spent lot of time thinking about how to generate thematic overviews of groups of archival collections. My information visualization project, ArchivesZ, aims to provide ways of understanding aggregated archival description data, both from a single institution or across institutional boundaries. Now I find myself wondering if text mining with a tool like MALLET might generate smart topic groupings more elegantly than fighting with the wide range of non-standardized collection subjects.

Topic Modeling with MALLET

To get a sense of what MALLET generates, see the excerpt below from Blevins’s post:

With some tinkering, MALLET generated a list of thirty topics comprised of twenty words each, which I then labeled with a descriptive title. Below is a quick sample of what the program “thinks” are some of the topics in the diary:

MIDWIFERY: birth deld safe morn receivd calld left cleverly pm labour fine reward arivd infant expected recd shee born patient

CHURCH: meeting attended afternoon reverend worship foren mr famely performd vers attend public supper st service lecture discoarst administred supt

DEATH: day yesterday informd morn years death ye hear expired expird weak dead las past heard days drowned departed evinn

GARDENING: gardin sett worked clear beens corn warm planted matters cucumbers gatherd potatoes plants ou sowd door squash wed seeds

He goes on to explain that “MALLET also allows us to track those topics across the text.” What if, instead of text mining a diary, we pumped the descriptions of every archival collection from a single institution into MALLET. Of course we would need a good list of stop words including such common terms as archives, history, sources and records. But I wonder how the topics MALLET suggests would compare to the official subjects associated with each collection? Could this give us a broad overview of the topics covered by a specific repository and give us a new way to build paths to the collections based on topic?

Auto-Classification Using Castanet

Text miner Aditi Muralidharan also posted recently on this theme in Castanet: automatically generating a browsing structure for a collection and explains:

Castanet automatically carves a sub-structure from the hierarchical concept dictionary, WordNet (http://wordnet.princeton.edu), and matches items in the collection to one or many appropriate places within that hierarchy. Then, after some automated trimming and flattening, the result is a hierarchical browsing system.

I have heard of Castanet before via the Flamenco Search Interface Project. Apparently Muralidharan did a project using Castanet last summer to create a category system for Flickr Commons images based on the images’ tags which is then rendered using a Flamenco interface. I include a partial screen-shot below to give you a taste of what the navigation of images feels like a few levels down in the hierarchy. I love the classification of ‘Group Action’ then filtered by a sub-classification of ‘Commerce’. The first images shown are of ‘horse trading’ – with additional headings and images beneath them as well as additional filter options on the left.

What If?

What if we pulled all the English language archival descriptions from around the world as our original data set. If we used this data for topic modeling, our subjects clusters would be cross-institutional. Maybe we could map the local institution assigned subjects to the topic model generated topics for each collection and get a sort of automated crosswalk for finding related collections. If we used the local institution assigned subjects from the archival descriptions for Canasta style auto-classification, maybe we could generate a way to hierarchically browse collections topically.

Both MALLET and Flamenco are open source (I am not sure of the status of Castanet) and, as I discovered working on ArchivesZ, many institutions will share their archival description data for a good cause. So – is this a good cause? I need to tease these ideas out a bit more, but what do you all think of it at first blush? Feasible? Interesting? Worthwhile experiments?

Image Credits: MALLET logo from MALLET homepage. Images in screen shot from Flickr Commons with no known copyright.

DH2009: Digital Curiosities and Amateur Collections

June 29, 2009 3 Comments

Session Title: Digital Curiosities: Resource Creation Via Amateur Digitisation
Speaker: Melissa Terras

Overview: Review of 100 virtual museum websites and multiple flickr groups plus surveys of amateur website creators, memory institutions and Arts & Humanities academics leads to new perspective on digitization and creation of collections online by dedicated enthusiasts.

Session Highlights

Areas of “Amateur” endeavor have a long history of launching collections, such as:

cabinet of curiosities
foundation of astronomical research
british flora and amateur botanists
weather observations
open source software movement

Being an amateur doesn’t necessarily mean being bad at what you do!

Within the realm of self-defined museums some common topics often emerge:

ephemera (advertising, packaging, nostalgia)
comics
technology – especially old tech, there is a surprising trend of being fascinated by technology approximately 10 years older than the collector
personal and “embarrassing” collections
genealogy

For these self-defined museums the scope is self-defined – these are self-delineated collections. Virtual museums can document aspects of cultural heritage considered socially taboo or in some way too sensitive to collect. A great example of this is the Museum of Menstruation which claims to have been created 14 years ago and is currently trying to establish a public permenant display for the public.

Platforms have evolved over the life of the web, starting with static html, then blogs and now Flickr images as a mode of presentation.

This is a list of successful amateur collections online:

Today’s Inspiration – illustration from the 40’s and 50’s
JonWilliamson.com – advertising 1940s-1960s
Pulp Fiction Flickr Group – 882 members who provide basic metadata and often label stuff within the image – currently contains 3,385 items.
Curio Cabinet Flickr Group – 1,206 members and 5,537 items

Visual Arts Data Service (VADS) is a more traditional site created by a cultural heritage institution. It contains 100,000+ images copyright cleared for use in teaching, learning and research in the UK. VADS is a very detailed static source of images with metadata, but provides no interaction.

Amateurs do provide metadata, but it is intuitive metadata. It might not fit into rigid buckets of data, but that doesn’t meant that the metadata available isn’t useful.

What are the boundaries between amateur and professional? Work vs hobby?

Many of these amateur sites get much more traffic than most standard museum sites. More than 50% of museum digitized images are never visited.

Memory institutions are starting to put things into the wider online community:

Smithsonian: photos in Smithsonian Flickr Commons
Tate: The How We Are Now project invited the public to contribute photos to the How We Are Flickr Group. The images were streamed to screens within the How We Are: Photographing Britain exhibit and 40 photos were chosen to be included as the last set of photos in the physical exhibit.
Victoria & Albert Museum: created a Flicrk group of photos taken at the V&A museum along with a long list of other V&A Flickr groups and streams
Oxford University’s Great War Archive: contains 6,500 items contributed by the public and related to the First World War.
Facebook and Twitter are being used more often for informing the community about their collections

Much of amateur research has been driven by advances in technology. A great example of this is the advent of affordable metal detectors led to dramatic changes in archaeology. The internet and Web 2.0 technology are arming a whole new generation of enthusists who can find one another and collaborate more easily than might ever have been dreamed of 20 years ago.

Next Steps & Conclusions

Future research will involve looking at the psychology of collection: archives vs collections. For now it is important to realize that institutions are not the only hosts of “worthwhile” digital objects. Pro-am (aka, pro-amateur) are doing better with using web 2.0 & getting more traffic.

What can memory institutions learn from this?

interact with user communities
use the ‘grand central stations’ of flickr, twitter, facebook
usability of flickr is better than what most memory institutions build for themselves

My Thoughts

This session considers the ways cultural memory institution can take advantage of the web by looking at what the successful enthusiasts are achieving. This research-backed approach confirms what I would have expected. Libraries, museums and archives are leaving a lot on the table when it comes to putting their collections online. Sites run by non-professionals are doing an amazing job of drawing in new audiences, keeping people around and then initiating conversation within that audience.

The Flickr Commons is a big step forward, but it isn’t the only option. There are also varying opinions about how successful the crowdsourcing aspect of the Flickr Commons is for memory institutions. A lot of this goes back to to a core question “how do we know if we have succeeded?”. There is much to be said for setting out clear goals when launching online initiatives. Is your goal increased traffic to your site or crowdsourcing of metadata? A great example of an initiative whose goal is clearly collection of crowdsourced metadata is the German Federal Archives who chose to use the Wikimedia Commons for their photo metadata initiative.

If you are trying to extend your mission of providing access to materials to the public, then how do you measure success? Putting your materials in what Melissa called “grand central stations” (or what I have also heard termed “public crosswalks”) definitely increases the chances of serendipitous discovery by new individuals. That said, we can see from the successful blogs mentioned above that tackling a niche with enthusiasm and consistent posting can go a long way to building a following. JonWilliamson.com seems to have only launched back in November of 2008 with a post featuring a Scotch Tape Christmas ad from 1951. The author posted in May of 2009 that his images in Flickr had surpassed 100,000 views.

To conclude this post I leave you with a list of inspirational digitized collections online that were created by various cultural heritage institutions:

Publishers’ Bindings Online – discussed in SAA2007’s Session: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building, a multi-institutional project that includes galleries of topical images combined with an essay that gives the images context. Two of my favorites are:
- From Domestic Goddesses to Suffragists: The Story of Women Told on Bookbindings, 1820-1920
- Indians, the Frontier, and the West in American Bookbindings
Calisphere – more than 150,000 digitized items organized for easy use by K-12 teachers. This is especially interesting in that it represents items already available in Online Archive of California, but organized in a way to make them easy to find and use with their target audience in mind.
Yiddish Books Online – A project by the National Yiddish Book Center that uses the Internet Archive as a platform to host 11,000 digitized out-of-print Yiddish books. This project is a nice cross between a branded custom site and a grand-central station

Have a favorite online collection website? Please share it in the comments below.

As is the case with all my session summaries from DH2009, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Image credit: http://www.flickr.com/photos/mms0131/ / CC BY-NC-ND 2.0

ArchivesZ Data Challenges: University of Texas at San Antonio

May 13, 2009 2 Comments

Mark Shelstad, head of Archives and Special Collections at University of Texas at San Antonio, sent me a link to the TARO (Texas Archival Resources Online) page for UTSA’s Archives and Special Collections finding aids in XML format.

With the current scripts, these are the fun tag stats:

1,684 total tags extracted
75% (1,266 tags) are associated with only one finding aid
3% (51 tags) are associated with 10 or more finding aids

Collection Size

235 out of tne 253 collections ended up with a collection size of 0.

Consider the encoding of the collection size in the Guide to the Women’s Overseas Service League Records, 1910-2007:

<physdesc label="Extent:" encodinganalog="300$a">
    77 linear feet (approximately 44,000 items)
</physdesc>

Contrast this with one of the examples where the size of the collection was extracted properly by the current script:

<physdesc label="Extent:" encodinganalog="300$a">
    <extent>8.4 linear feet</extent> 
    (14 boxes)
</physdesc>

Sometimes it feels like a game of Where’s Waldo. In this case we are simply missing the set of <extent> tags from the first example. Off I went to the EAD tag descriptions to find the guidelines for use of the <physdesc> tag, where I found this overview of the tag:

A wrapper element for bundling information about the appearance or construction of the described materials, such as their dimensions, a count of their quantity or statement about the space they occupy, and terms describing their genre, form, or function, as well as any other aspects of their appearance, such as color, substance, style, and technique or method of creation. The information may be presented as plain text, or it may be divided into the <dimension>, <extent>, <genreform>, and <physfacet> subelements.

Bad news for my script logic – both versions are valid! This is a great example of how valid encoding can still present challenges. While in this example it seems just as easy to parse the version with the <extent> tags as without, it will only be through examination of a much broader sample of data that we can determine how much of a problem we have on our hands with this scenario of size data included in the <physdesc> tags without enclosing <extent> or <dimension> tags.

Inclusive Dates

Twenty of the UTSA collections came through with no years. When I examined the data, I found an assortment of <unitdate> formats that my current script could not parse properly, including the examples below:

1917-1980 (bulk 1920-1945)
1876-1903, 1914-1919, 1940-2002
1940s, 1970s-1990s

Another encoding approach that could not be parsed was the one used for the finding aid of the Church Women United of San Antonio Records. In this case the <unitdate> tag is within the <unittitle> tag as seen here:

<unittitle label="Title:" encodinganalog="245">
Church Women United of San Antonio Records,
<unitdate label="Dates:" encodinganalog="245$a">1961-2005</unitdate>
</unittitle>

Among the finding aids for which I did extract a range of inclusive date years, I also found issues with values like 1950s-1990s. The current script interpreted this to represent 1950 through 1990, but I believe it would be more properly translated as representing 1950 through 1999.

General Code Fixes

The University of Texas at San Antonio’s finding aids have provided additional examples of the following data and encoding issues already identified in earlier data sets:

Inconsistent repository titles (26 different variations of “The University of Texas at San Antonio Library”)
Titles with embedded and tagged dates
Carriage return and tab characters that need to be removed
Emphasis within a title or abstract added via a tag (such as <emph render=”italic”>Storyletters</emph> seen in A Guide to the Storyletters Records, 1991-2000) which interrupts extraction of text at that point

Next Steps

This is the last data set I am analyzing before tackling actual updates to the ArchivesZ data extraction script. My next step is to review and prioritize my long to do list for updates to this script. Most of what I have found in my examination of the data sets are ways in which my script was not smart enough to handle valid variations in encoding and the tabs, carriage returns, formatting tags and special characters found throughout everyone’s XML. Yes, there are some cases in which the data itself is less than optimal (such as non-standardized repository titles) or the values challenging (so many ways to describe the size of a collection!), but overall I am optimistic about how much more I can improve the extraction script before I have to resort to hand correcting records in the database.

Thanks to everyone for your patience with these data analysis posts. Onward to programming!

ArchivesZ Data Challenges: Forest History Society

May 6, 2009 2 Comments

Amanda Ross, project archivist for the Forest History Society, sent me 57 EAD finding aids to include in the ArchivesZ project. These are the data challenges that the current data extraction script does not address:

Titles with embedded tags or punctuation. Generally the script drops anything after it hits either, so rather than a title like William E. Towell Papers, 1941 – 1988, my database ended up only with “William E Towell Papers,” based on this encoding: <titleproper>Inventory of the William E. Towell Papers, <date normal=”1941/1988″>1941 – 1988</date></titleproper>
Need to handle a conversion factor for a size of “1 folder” (as found in the Inventory of the Biltmore Forest School Images, 1890 – 1988)
My script chokes on the Inclusive Year format “1910 and 1931 – 1937” (as found in the Inventory of the Alfred Cunningham Papers, 1910 and 1931 – 1937)
The presence of a <lb/> character within the <extent> tag, used to force a line break, is preventing my script from extracting any size information at all (as found in the Inventory of the DeWitt Nelson Papers, 1940 – 1976)
Within the <abstract> tag, my script drops everything after an <emph render=”doublequote”> tag (making for a very short abstract in the case of the Inventory of the Arthur Bernard Recknagel Auxiliary Photograph Collection, 1911 – 1947).

The most dramatic issue, seen across all the finding aids in this set, is that no subject data was extracted from any of the finding aids. My working theory for the moment is that this is due to the use of <list> and <item> tags as shown here:

<controlaccess>
<head>Subject Headings</head>
<list type="simple">
<item><genreform source="lcnaf" encodinganalog="655">Audiotapes</genreform></item>
<item><persname source="lcnaf" encodinganalog="600">Ainsworth, John H., 1909-</persname></item>
<item><subject source="lcnaf" encodinganalog="650">Businessmen -- United States</subject></item>

This is in contrast with this example of encoding from Syracuse University:

<controlaccess>
<head>Subject and Genre Headings</head>
<subject encodinganalog="650" source="local">Adult education</subject>
<persname encodinganalog="600" source="lcnaf">Adolphson, L. H.</persname>
<persname encodinganalog="600" source="lcnaf">Bradford, Leland Powers, 1905-</persname>

Or this sample from Oregon State University:

<controlaccess id="a12">
	 <controlaccess>
		  <persname encodinganalog="600" source="local" rules="aacr2"
		  role="subject">Aitken, Frances Alva, 1889-1970.</persname>
	 </controlaccess>
	 <controlaccess>
		  <corpname encodinganalog="610" source="local" role="subject"
		  rules="aacr2">Oregon Agricultural College. Class of 1910.</corpname>
		  <corpname source="lcnaf" encodinganalog="610" role="subject">Oregon
				Agricultural College--Students.</corpname>
	 </controlaccess>
	 <controlaccess>
		  <geogname source="lcsh" role="subject" encodinganalog="651">Corvallis
				(Or.)</geogname>
	 </controlaccess>
	 <controlaccess>
		  <subject encodinganalog="650" source="lcsh">Student
				activities--Oregon--Corvallis.</subject>
	 </controlaccess>

Both the Syracuse and OSU examples are handled by the current state of the data extract script.

Amanda pointed me to the NCEAD Best Practice Guidelines for EAD 2002. Down in Appendex G: How Do I Encode…, the second question down is “What if I have multi-part scope notes, biographical notes or subject headings?” followed by exactly the <list> and <item> tag usage as is being done for the Forest History Society finding aids. This format clearly should be handled.

So, no fun tag stats for this run – but I hope to fix my ruby script so that the Forest History Society finding aids can be incorporated into the data set I use for testing version 2 of ArchivesZ. My ruby script to do list is getting quite long!

Category: metadata