SAA2007: Content Aggregation, Shareable Metadata and Access (Session 607)

Focusing on the challenges of sharing metadata to support content aggregation and access, SAA2007 Session 607’s official title was The Dynamics in the Aggregate: Shareable Metadata and Next-Generation Access Systems. Bill Landis, the Head of Arrangement, Description, & Metadata Coordinator at Yale University Library’s Manuscripts and Archives division, began the session by stressing that while it is hard to predict the future it seems obvious that there will be an increase in the aggregation of content. Google is one type of aggregator. Many institutions are using the standards of the Open Archives Initiative (OAI) to both publish and to harvest data. This session considered shareable metadata and how it can support or hinder content aggregation and access. A pointer was give to the Best Practices for OAI Data Provider Implementations and Shareable Metadata joint initiative of the Digital Library Federation and the National Science Digital Library.

Introduction to Shareable Metadata and Interoperability

The first speaker, Sarah Shreeves, started the panel off with her presentation titled The Dynamics of Sharing: Introduction to Shareable Metadata and Interoperability (follow the link to view the full set of slides). Sarah is not an archivist, but she has extensive experience with metadata aggregation.

She began with the assumption that “we” (libraries/archives/museums/cultural organizations) cannot afford to think about our collections only in the context of our local community. There is no way to know where your metadata is going to end up – either grouped with other things or pulled out of your collection into single atomized items.

Why share content? It benefits our users, supports one-stop searching, brings together distributed collections and supports mashups . Sharing helps us and increases our exposure. We have to do this – we cannot assume that our users will come in through the front door. Lorcan Dempsey uses the phrase In The Flow to mean how to get your content “out” into the world where users will find it.

Keys to Shareability or Interoperability:

You need the technical side (Z39.50, OAI PMH, RSS …etc)
Organization commitment of resources (people, training, time, priority)
Standards.. lots and lots of standards

There are two main ways to share metadata. The first is known as federated search. In this model a user searches from a single central location. That query is sent to distributed database and the answers are sent back for the central query source to assemble the results. Z39.50 and Search/Retrieval via URL ( SRU) are examples of technology used to perform federated searches.

The second way of sharing metadata is known as the metadata aggregation model. In this scenario, metadata is pulled from many places into a single location. This is what search engines, union catalogs, OAI PMH, RSS and Atom do. It provides an opportunity to massage and normalize the data. Once users find what they are looking for – they are often redirected to the original source of the item.

A major challenge of the metadata aggregation model is “the ability to perform a search over diverse sets of metadata records and obtain meaningful results” Priscilla Caplan (in Metadata Fundamentals for All Librarians). This is hard because we are not used to what metadata looks like outside our local context. Sarah then showed lots of different examples so the audience could see how different the metadata is.

Metadata is not monolithic. It can be a view projected from a single information object. It is possible to create multiple views appropriate for different uses. Each view will affect the granularity of description, choice of vocabularies, and choice of formats.

You can customize the format of your metadata depending on the context of how the metadata will be consumed. This might sound scary, hard and overwhelming – but Sarah is confident that we can do this in smart ways. She believes that we should be able to lobby for the features we need to support different views.

Sarah’s list of attributes of ‘shareable metadata’:

is quality metadata
promotes search interoperability
is human understandable outside of its local context
must be useful outside its local context – an aggregator can actually build services based on the data in the records provided – example was geographic data that can be used to put the items on a map
preferably is machine processable – Subject clustering – machine created – but still needs lots of human intervention to make it work
provides enough contextual information – the Theodore Roosevelt collection didn’t have a Roosevelt subject term because the title of the collection was assumed to be enough. She also mentioned a map that didn’t include the fact that it was a map in it’s metadata
is consistent across a collection – ie, same date field, same controlled vocabulary.. this is within a single collection
is coherent
is true to its content but also its audience – different views for different perspectives
conforms to standards – descriptive, technical, etc

There are some safe assumptions you can make. Users often get to your data through shared records – not through your front door. Users either don’t know about your collection or won’t remember. Shared records can lead users to local environments where the full context is available. Users are often entering through deep links that may bypass the introductory information that provides the larger context for a collection.

Implementing Shareable Metadata Practices

Jenn Riley, of the Inquiring Librarian Blog, gave the 2nd presentation: Implementing Shareable Metadata Practices in a Diverse University Environment. Jenn has a grand vision of what we are trying to achieve with all these efforts to share metadata. We needs lots of different ways to discover the data.. lots of environments.

We need machine-readable descriptive metadata, definitions of properties of shareable metadata in various communities (this is the focus of this session), and protocols and systems that use them for sharing that make it automatic. We also need online delivery of content too, but that is a big challenge and out of scope for this session.

Archives and digital libraries face different challenges in implementing standard practices related to shareable metadata. Archives are unique, making the notion of a single workflow model not possible. They are not a ‘homogeneous body’. Archives need to figure out how to support the expanding view of the mission to meet the needs of online users and make more services available. They need to find resources to provide appropriate description as well as technical implementation – and need time and money and skills in order to do this. On the other hand, digital library practice assumes content is digitized– that there will be ‘stuff’ at the end. Metadata-only workflows are not common. Digital libraries usually assume item-level description, but this is often not the case, and concepts of provenance and original order are largely foreign.

Communities need to agree on key definitions to bridge the gulf between digital libraries and archives. Digital libraries need to understand that Encoded Archival Description (EAD) is not a metadata format, EAD is a markup language.

The good news is that aggregations are not out to replace archives specific discovery systems. We don’t have to give up the local robust environment, we can and need to do both.

Key shareable metadata principles for archives:

Context: need enough context so the user can figure out if the record is useful for them. At the same time – too much repeated info can cause issues too.
Content: what is the appropriate granularity for shared records from archives — this choice needs to be done per usage and per audience.

Possible strategies include the creation of collection-level records only, creation of an aggregator that understands multi-level descriptions, the design of multi-level descriptions carefully for future item/file-level view, linking to digital objects from the lowest level of description in the finding aid and description at the item level.

Jenn then discussed the experiences at Indiana University’s Digital Library Program :

They have a new EAD finding aid website
the new system is more faithful to encoding with less ‘helpful’ fixed presentation
mutual learning process about archival descriptive practices
many decisions made about when encoding should be changed when systems should be changed
results of this process: RE-ENGINEERING! New template, report card, better previewing capability — new template for EAD that supports new data we didn’t have before… report card built on schema-tron and encoder can preview how their encoding is really working and preview what the final product finding aid will look like
some EAD files link to digital objects
soon there will be item-level OAI records (Dublin Core and MODS) for digitized items linked from finding aids
central Digital Library repository that allows EAD as the *master* metadata format
new workflow that permits links from any level of a multi-level description in EAD

The more you put stuff online – the more you attract the sort of attention that gets you more money to put more stuff online. Jenn suggests lobbying of software vendors for a better support of EAD.. don’t settle for Dublin core. We need to discuss with our user communities about the need for an archives-specific aggregators and consider the multi-level description.

Libraries and archives are learning from one another. The item centric view can be too narrow – but it can help with re-engineering. More structure in finding aids can be a good thing. Archives can show libraries why expertise in descriptive practice is still necessary — maybe those who are running out of things to catalog on the library side can spend some time describing over on the archives side?

Archival Frameworks for Shareable Metadata

Kelcy Shepherd, Digital Interfaces Librarian at the University of Massachusetts Amherst , gave the final presentation of the session: “Archival Standards and Tools: A Framework for Shareable Metadata”.

The first framework Kelcy addressed was Describing Archives: A Content Standard (DACS). What about DACS is applicable to sharing metadata? It is compatible for use with controlled vocabularies. It can make sure that our access points will work well with access points from other metadata communities. Since DACS is output agnostic, you can create the data and then use that data to generate different views or formats. A single set of DACS based data can produce printed finding aids, EAD finding aids, MARC 21 or MODS records.

In order to produce each of these different views from a single original format, you must a crosswalk. A crosswalk maps individual elements from one data format to corresponding elements of another. Unfortunately, crosswalks come with their own challenges:

granularity
missing elements
single element on one side that would need to be split into multiple elements on the other side

You need expertise in both standards addressed by a each crosswalk in order to do this well.

Next Kelcy discussed Encoded Archival Description (EAD). EAD is a data structure standard, machine readable format for encoding archival descriptions. It allows archivists to share the data across institutions. If you want to re-purpose a finding aids metadata, the data needs to be in a machine readable format. EAD gives you this. You can convert an EAD encoded finding aid into a Metadata Object Description Schema (MODS) document using an XSLT stylesheet and a crosswalk. The stylesheet may take a lot of work (especially for use across many finding aids), but there is a big payoff. Once the work is done a single stylesheet can be used across many many finding aids.

The Archivists’ Toolkit was cited as an example of a tool that can let you output multiple formats from a single set of data. It can produce EAD, MARC, MODS and Dublin Core records.

Tools can support efforts – but it all comes back to quality archival description. The best tool in the world will never make bad content into good content. If data is inconsistent – you have to manually go back and clean it up. I particularly liked Kelcy’s point about ensuring that your data doesn’t need the screen labels you to make sense. If you don’t consider this, when you export that data into a new format or view the data can loose it’s meaning.

Her concluding point was that if you don’t have the tech skills or support, work on your content.. use DACS… get your data in order and it will pay off later.

Questions and Answers

Question: How does this work when you are trying to share your metadata with communities that use different controlled vocabularies – thinking about the single EAD that generates MODS and MARC .etc etc…

Answer: Aggregators often they don’t use subject headings. This is nearly impossible to do in OAI – people use lots of different controlled vocabularies.. and sometimes no controlled vocabulary at all. There are experiments being done with subject clustering. Algorithms are used to cluster like things together – but it still requires human intervention to make sure the clusters make sense.

On the other hand – if you are using a standard vocabulary, there is work being done to map from one standard to another. An example of this is the OCLC Metadata switch project .

Question : What about social taggging?

Jenn : We are in no position to turn down metadata.

Sarah: DSpace has a concept of community. There is a way to let a community organically build their own controlled vocabulary as they go – new contributions are provided choices of terms that have been used before.

Bill talked about the article about Michaelson where they gave the same finding aids to 40 archivists to use LCSH for picking subject headings. The result was 0 consistency! Every single archivist picked different subject headings.

Jordan: PennTags is an example of an effort to combine social tagging with traditional classification. It shows tagging not as competition but as another way to get user generated descriptive information. It is an example of a way to ‘get into the flow’.

Sarah: Google will now use OAI PMH as a site map for indexing, but it throws away the metadata.

Jenn: Dlib – representing digital collections on wikipedia article.

Bill: PennTags is acting as an aggregation system to pull siloed information together.

Question: In some cases EAD data is flattened down for all items so that each item has all the context data and only one field is different on each? Is this an indication that the mapping have been better?

Answer: It is a problem – can be a problem…ultimately it is all about use and audience.

My Thoughts

I came away from this session with my head whirling with ideas. I was so pleased to hear people talk about concrete examples. We need more examples of challenges and real world benefits to further efforts to aggregate, publish and share archival content and it’s metadata. None of this is easy, but each project will give us new lessons and add to the growing set of best practices.

I truly believe that the sooner we tackle these thorny problems, the sooner we will start seeing the impact in improved access to archival records. The sooner we deal with it, the less we will be adding data that will have to be fixed later.

For anyone who has been following my blog – you will already know about my ArchivesZ project from last spring. One of the big struggles we had was figuring out how to make the subject term metadata ‘useful’ for aggregation and visualization. Another example of the challenges and benefits to shareable metadata is the SAA presentation about Publisher’s Bindings Online .

I had one last sentence in my notes from this session – an idea for a Facebook application that would let you feature your favorite archival image or record. This would be an amazing example of getting archival records ‘in the flow’ and showing up in surprising new places where no-one is ‘looking’ for records. Hey – maybe I should prod the Footnote people with this idea. It might be right up their alley!

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

SAA2007: Content Aggregation, Shareable Metadata and Access (Session 607)

1 Comment