search | Spellbound Blog

Clustering Data: Generating Organization from the Ground Up

May 14, 2008 2 Comments

My trip to the 2008 Information Architecture Summit (IA Summit) down in Miami has me thinking a lot about helping people find information. In this post I am going to examine clustering data.

Flickr Tag Clusters
Tag clusters are not new on Flickr – they were announced way back in August of 2005. The best way to understand tag clusters is to look at a few. Some of my favorites are the water clusters (shown in the image above). From this page you can view the reflection/nature/green cluster, the sky/lake/river cluster, the blue/beach/sun cluster or the sea/sand/waves cluster.

So what is going on here? Basically Flickr is analyzing groupings of tags assigned to Flickr images and identifying common clusters of tags. In our water example above – they found four different sets of tags that occurred together and distinctly apart from other sets of tags. The proof is in the pudding – the groupings make sense. They get at very subtle differences even though the mass of data being analyzed is from many different individuals with many different perspectives.

Tag clusters are very powerful and quite different from tag clouds. Tag clouds, by their nature, are a blunt instrument. They only show you the most popular tags. Take a look at the tag cloud for the Library of Congress photostream on Flickr. I do learn something from this. I get a sense of the broad brush topics, time periods and locations. But if you look at the full list of Library of Congress Flickr tags you see what a small percentage the top 150 really are (and yes.. that page does takes a while to load). Who else is now itching to ask Flickr to generate clusters within the LOC tag set?

Steve.Museum
Another example of cultural heritage images being tagged is the Steve Museum Art Museum Social Tagging Project which lets individuals tag objects from museums via Steve Tagger. It resembles the Library of Congress on Flickr project in that it includes existing metadata with each image and permits users to add any tags they deem appropriate. I think it would be fascinating to contrast the traffic of image taggers on Steve.Museum vs Flickr for a common set of images. Is it better to build a custom interface that users must seek out but where you have complete control over the user experience and collected data? Or is it better to put images in the already existing path of users familiar with tagging images? I have no answers of course. All I know is I wish I could see the tag clusters one could generate off the Steve.Museum tag database. Perhaps someday we will!

Del.icio.us Tags
del.icio.us related tags Del.icio.us, a web service for storing and tagging your bookmarks online, supports what they call ‘related tags’ and ‘tag bundles’. If you view the page for the tag ‘archives’ – you will see to the far right a list of related tags like those shown in the image here. What is interesting is that if I look at my own personal tag page for archives I see a much longer list of related tags (big surprise that I have a lot of links tagged archives!) and I am given the option of selecting additional tags to filter my list of links via a combination of tags.

Del.icio.us’s ‘tag bundles’ let me create my own named groupings of tags – but I must assemble these groups manually rather than have them generated or suggested. On the plus side, Del.icio.us is very open about publishing its data via APIs and therefore supporting third party tools. I think my favorite off that list for now has to be MySQLicious which mirrors your del.icio.us bookmarks into a MySQL database. Once those tags are in a database, all you need are the right queries to generate the clusters I want to see.

Clusty: Clustered Search Results
An example of what this might look like for search results can be seen via the search engine Clusty.com from the folks over at Vivisimo. For example – try a search on the term archives. This is one of those search terms for which general web searching is usually just infuriating. Clusty starts us with the same top 2 results as a search for archives on Google does, but it also gives us a list of clusters on the left sidebar. You can click on any of those clusters to filter the search results.

Those groups don’t look good to you? Click the ‘remix’ link in the upper right hand corner of the cluster list and you get a new list of clusters. In a blog post titled Introducing Clustering 2.0 Vivisimo CEO Raul Valdes-Perez explains what happens when you click remix:

With a single click, remix clustering answers the question: What other, subtler topics are there? It works by clustering again the same search results, but with an added input: ignore the topics that the user just saw. Typically, the user will then see new major topics that didn’t quite make the final cut at the last round, but may still be interesting.

I played for a while.. clicking remix over and over. It was as if it was slicing and dicing the facets for me – picking new common threads to highlight. I liked that I wasn’t stuck with what someone else thought was the right way to group things. It gave me the control to explore other groupings.

Ontology is Overrated
Clay Shirky’s talk Ontology is Overrated: Categories, Links and Tags from the spring of 2005 ties a lot of these ideas together in a way that makes a lot of sense to me. I highly recommend you go read it through – but I am going to give away the conclusion here:

It’s all dependent on human context. This is what we’re starting to see with del.icio.us, with Flickr, with systems that are allowing for and aggregating tags. The signal benefit of these systems is that they don’t recreate the structured, hierarchical categorization so often forced onto us by our physical systems. Instead, we’re dealing with a significant break — by letting users tag URLs and then aggregating those tags, we’re going to be able to build alternate organizational systems, systems that, like the Web itself, do a better job of letting individuals create value for one another, often without realizing it.

I currently spend my days working with controlled vocabularies for websites, so please don’t think I am suggesting we throw it all away. And yes, you do need a lot of information to reach the critical mass needed to support the generation of useful clusters. But there is something here that can have a real and positive impact on users of cultural heritage materials actually finding and exploring information. We can’t know how everyone will approach our records. We can’t know what aspects of them they will find interesting.

There Is No Box
Archivists already know that much of the value of records is in the picture they paint as a group. A group of records share a context and gives the individual records meaning. Librarians and catalogers have long lived in a world of shelves. A book must be assigned a single physical location. Much has been made (both in the Clay Shirky talk and elsewhere) that on the web there is no shelf.

What if we take the analogy a step further and say that for an online archives there is no box? Of course, just as with books, we still need our metadata telling us who created this record originally (and when and why and which record comes before it and after it) – but picture a world where a single record can be virtually grouped many times over. Computer programs are only going to get better at generating clusters, be they of user assigned tags or search results or other metdata. From where I sit, the opportunity for leveraging clustering to do interesting things with archival records seems very high indeed.

Of Pirates, Treasure Chests and Keys: Improving Access to Digitized Materials

April 23, 2008 1 Comment

Dan Cohen posted yesterday about what he calls The Pirate Problem. Basically the Pirate Problem can be summed up as “there are ways of acting and thinking that we can’t understand or anticipate.” Why is that a ‘Pirate Problem’? Because a pirate pub opened near his home and rather than folding shortly thereafter due to lack of interest from the ‘very serious professionals’ who populate DC suburbs – the pub was a rousing success due to the pirate aficionados who came out of the woodwork to sing sea shanties and drink grog. This surprising turn of events highlighted for him the fact that there are many ways of acting and thinking (some people even know all the words to sea shanties without needing sheet music).

Dan recently delivered the keynote speech at a workshop at the University of North Carolina at Chapel Hill. The workshop brought together dozens of historians to talk about how the 16 million archival documents of the Southern Historical Collection (SHC) should be put online. He devoted his keynote “to prodding the attendees into recognizing that the future of archives and research might not be like the past” and goes on in his post to explain:

The most memorable response from the audience was from an award-winning historian I know from my graduate school years, who said that during my talk she felt like “a crab being lowered into the warm water of the pot.” Behind the humor was the difficult fact that I was saying that her way of approaching an archive and understanding the past was about to be replaced by techniques that were new, unknown, and slightly scary.

This resistance to thinking in new ways about digital archives and research was reflected in the pre-workshop survey of historians. Extremely tellingly, the historians surveyed wanted the online version of the SHC to be simply a digital reproduction of the physical SHC.

Much of the stress of Dan’s article is on fear of new techniques of analysis. The choppy waters of text mining and pattern recognition threaten to wash away traditional methods of actually reading individual pages and “most historians just want to do their research they way they’ve always done it, by taking one letter out of the box at a time”.

I certainly like the idea of new technologically based ways of analyzing large sets of cultural heritage materials, but I also believe that reading individual letters will always be important. The trick is finding the right letter!

And of course – we still need the context. It isn’t as if when we digitize major collections like the SHC that we are going to scan and OCR each page without regard to which box it came out of. We can’t slice and dice archival records and manuscripts into their component parts to feed into text analysis with no way back to the originals.

I like to imagine the combination of all the new technology (be it digitization, cross collection searching, text mining or pattern recognition) as creating keys to different treasure chests. Humanities scholars are treasure hunters. Some will find their gems through careful reading of individual passages. Others will discover patterns spread across materials now co-existing virtually that before digitization would have been widely separated by space and time. Both methods will benefit from the digitization of materials and the creation of innovative search and text analysis tools. Both still require an understanding of a material’s origin. The importance of context isn’t going anywhere – we still need to know which box the letter came from (and in a perfect world, which page came before and which came after). I want scholars to still be able to read one page from the box – I just want them to be able to do it from home in the middle of the night if they are so inclined with their travel budget no worse for wear.

Dan ties his post together by pointing out that:

… in Chapel Hill I was the pirate with the strange garb and ways of behaving, and this is a good lesson for all boosters of digital methods within the humanities. We need to recognize that the digital humanities represent a scary, rule-breaking, swashbuckling movement for many historians and other scholars.

In my opinion, the core message should be that we just found more locked treasure chests – and for those who are interested, we have some new keys that just might open those locks. I enjoyed the Pirate metaphor (obviously) and I appreciate that there are real issues here relating to strong discomfort with the fast changing landscape of technology, but I have to believe that if we do something that prevents historians from being able to read one letter at a time we are abandoning the treasure chests that are already open for the new ones for which we haven’t yet found the right keys. I am greedy. I want all the treasure!

Image credit: key to anything by Stoker Studios via flickr

Using WWI Draft Registration Cards for Research: NARA Records Provide Crucial Data

November 23, 2007 1 Comment

In the HealthDay article Having Lots of Kids Helps Dads Live to 100, a recent study was described that examined what increased the chances of a man living past 100.

A young, trim farmer with four or more children: According to a new study, that’s the ideal profile for American men hoping to reach 100 years of age. The research, based largely on data from World War I draft cards, suggests that keeping off excess weight in youth, farming and fathering a large number of offspring all help men live past a century.

The article mentions that this research was “spurred by the fact that a treasure trove of information about 20th-century American males has now been put online”. The study was based out of the University of Chicago’s Center on Aging. The paper, New Findings on Human Longevity Predictors, includes the following reference:

Banks, R. (2000). World War I Civilian Draft Registrations. [database on-line]. Provo, UT, Ancestry.com.

With an account on Ancestry.com, you too could examine the online database of World War I Draft Registration Cards. This Ancestry.com page notes the source of the original data as:

United States, Selective Service System. World War I Selective Service System Draft Registration Cards, 1917-1918. Washington, D.C.: National Archives and Records Administration. M1509, 4,582 rolls

NARA’s page for the World War I Selective Service System Draft Registration Cards, M1509 includes similar background information to what can be found on the Ancestry.com page, but of course – no access to the actual records.

It is frustrating to a study based on archival records that is making the news, but that does not make it clear to the reader that archival records were the source for the research. As I discussed at length in my post Epidemiological Research and Archival Records: Source of Records Used for Research Fails to Make the News, I feel that it is very important to take every opportunity to help the general public understand how archival records are supporting research that impacts our understanding of the world around us. I appreciate that partnering with 3rd parties to get government records digitized is often the only option – but I want people to be clear about why those records still exist in the first place.

Photo Credit: US. National Archives, World War I Photographs, 1918. Army photographs. Battle of St. Mihiel-American Engineers returning from the front; tank going over the top; group photo of the 129th Machine gun Battalion, 35th Division before leaving for the front; views of headquarters of the 89th Division next to destroyed bridge; Company E, 314th Engineers, 89th Division, and making rolling barbed wire entanglements. NAIL Control Number: NRE-75-HAS(PHO)-65

SAA2007: Content Aggregation, Shareable Metadata and Access (Session 607)

November 9, 2007 1 Comment

Focusing on the challenges of sharing metadata to support content aggregation and access, SAA2007 Session 607’s official title was The Dynamics in the Aggregate: Shareable Metadata and Next-Generation Access Systems. Bill Landis, the Head of Arrangement, Description, & Metadata Coordinator at Yale University Library’s Manuscripts and Archives division, began the session by stressing that while it is hard to predict the future it seems obvious that there will be an increase in the aggregation of content. Google is one type of aggregator. Many institutions are using the standards of the Open Archives Initiative (OAI) to both publish and to harvest data. This session considered shareable metadata and how it can support or hinder content aggregation and access. A pointer was give to the Best Practices for OAI Data Provider Implementations and Shareable Metadata joint initiative of the Digital Library Federation and the National Science Digital Library.

Introduction to Shareable Metadata and Interoperability

The first speaker, Sarah Shreeves, started the panel off with her presentation titled The Dynamics of Sharing: Introduction to Shareable Metadata and Interoperability (follow the link to view the full set of slides). Sarah is not an archivist, but she has extensive experience with metadata aggregation.

She began with the assumption that “we” (libraries/archives/museums/cultural organizations) cannot afford to think about our collections only in the context of our local community. There is no way to know where your metadata is going to end up – either grouped with other things or pulled out of your collection into single atomized items.

Why share content? It benefits our users, supports one-stop searching, brings together distributed collections and supports mashups . Sharing helps us and increases our exposure. We have to do this – we cannot assume that our users will come in through the front door. Lorcan Dempsey uses the phrase In The Flow to mean how to get your content “out” into the world where users will find it.

Keys to Shareability or Interoperability:

You need the technical side (Z39.50, OAI PMH, RSS …etc)
Organization commitment of resources (people, training, time, priority)
Standards.. lots and lots of standards

There are two main ways to share metadata. The first is known as federated search. In this model a user searches from a single central location. That query is sent to distributed database and the answers are sent back for the central query source to assemble the results. Z39.50 and Search/Retrieval via URL ( SRU) are examples of technology used to perform federated searches.

The second way of sharing metadata is known as the metadata aggregation model. In this scenario, metadata is pulled from many places into a single location. This is what search engines, union catalogs, OAI PMH, RSS and Atom do. It provides an opportunity to massage and normalize the data. Once users find what they are looking for – they are often redirected to the original source of the item.

A major challenge of the metadata aggregation model is “the ability to perform a search over diverse sets of metadata records and obtain meaningful results” Priscilla Caplan (in Metadata Fundamentals for All Librarians). This is hard because we are not used to what metadata looks like outside our local context. Sarah then showed lots of different examples so the audience could see how different the metadata is.

Metadata is not monolithic. It can be a view projected from a single information object. It is possible to create multiple views appropriate for different uses. Each view will affect the granularity of description, choice of vocabularies, and choice of formats.

You can customize the format of your metadata depending on the context of how the metadata will be consumed. This might sound scary, hard and overwhelming – but Sarah is confident that we can do this in smart ways. She believes that we should be able to lobby for the features we need to support different views.

Sarah’s list of attributes of ‘shareable metadata’:

is quality metadata
promotes search interoperability
is human understandable outside of its local context
must be useful outside its local context – an aggregator can actually build services based on the data in the records provided – example was geographic data that can be used to put the items on a map
preferably is machine processable – Subject clustering – machine created – but still needs lots of human intervention to make it work
provides enough contextual information – the Theodore Roosevelt collection didn’t have a Roosevelt subject term because the title of the collection was assumed to be enough. She also mentioned a map that didn’t include the fact that it was a map in it’s metadata
is consistent across a collection – ie, same date field, same controlled vocabulary.. this is within a single collection
is coherent
is true to its content but also its audience – different views for different perspectives
conforms to standards – descriptive, technical, etc

There are some safe assumptions you can make. Users often get to your data through shared records – not through your front door. Users either don’t know about your collection or won’t remember. Shared records can lead users to local environments where the full context is available. Users are often entering through deep links that may bypass the introductory information that provides the larger context for a collection.

Implementing Shareable Metadata Practices

Jenn Riley, of the Inquiring Librarian Blog, gave the 2nd presentation: Implementing Shareable Metadata Practices in a Diverse University Environment. Jenn has a grand vision of what we are trying to achieve with all these efforts to share metadata. We needs lots of different ways to discover the data.. lots of environments.

We need machine-readable descriptive metadata, definitions of properties of shareable metadata in various communities (this is the focus of this session), and protocols and systems that use them for sharing that make it automatic. We also need online delivery of content too, but that is a big challenge and out of scope for this session.

Archives and digital libraries face different challenges in implementing standard practices related to shareable metadata. Archives are unique, making the notion of a single workflow model not possible. They are not a ‘homogeneous body’. Archives need to figure out how to support the expanding view of the mission to meet the needs of online users and make more services available. They need to find resources to provide appropriate description as well as technical implementation – and need time and money and skills in order to do this. On the other hand, digital library practice assumes content is digitized– that there will be ‘stuff’ at the end. Metadata-only workflows are not common. Digital libraries usually assume item-level description, but this is often not the case, and concepts of provenance and original order are largely foreign.

Communities need to agree on key definitions to bridge the gulf between digital libraries and archives. Digital libraries need to understand that Encoded Archival Description (EAD) is not a metadata format, EAD is a markup language.

The good news is that aggregations are not out to replace archives specific discovery systems. We don’t have to give up the local robust environment, we can and need to do both.

Key shareable metadata principles for archives:

Context: need enough context so the user can figure out if the record is useful for them. At the same time – too much repeated info can cause issues too.
Content: what is the appropriate granularity for shared records from archives — this choice needs to be done per usage and per audience.

Possible strategies include the creation of collection-level records only, creation of an aggregator that understands multi-level descriptions, the design of multi-level descriptions carefully for future item/file-level view, linking to digital objects from the lowest level of description in the finding aid and description at the item level.

Jenn then discussed the experiences at Indiana University’s Digital Library Program :

They have a new EAD finding aid website
the new system is more faithful to encoding with less ‘helpful’ fixed presentation
mutual learning process about archival descriptive practices
many decisions made about when encoding should be changed when systems should be changed
results of this process: RE-ENGINEERING! New template, report card, better previewing capability — new template for EAD that supports new data we didn’t have before… report card built on schema-tron and encoder can preview how their encoding is really working and preview what the final product finding aid will look like
some EAD files link to digital objects
soon there will be item-level OAI records (Dublin Core and MODS) for digitized items linked from finding aids
central Digital Library repository that allows EAD as the *master* metadata format
new workflow that permits links from any level of a multi-level description in EAD

The more you put stuff online – the more you attract the sort of attention that gets you more money to put more stuff online. Jenn suggests lobbying of software vendors for a better support of EAD.. don’t settle for Dublin core. We need to discuss with our user communities about the need for an archives-specific aggregators and consider the multi-level description.

Libraries and archives are learning from one another. The item centric view can be too narrow – but it can help with re-engineering. More structure in finding aids can be a good thing. Archives can show libraries why expertise in descriptive practice is still necessary — maybe those who are running out of things to catalog on the library side can spend some time describing over on the archives side?

Archival Frameworks for Shareable Metadata

Kelcy Shepherd, Digital Interfaces Librarian at the University of Massachusetts Amherst , gave the final presentation of the session: “Archival Standards and Tools: A Framework for Shareable Metadata”.

The first framework Kelcy addressed was Describing Archives: A Content Standard (DACS). What about DACS is applicable to sharing metadata? It is compatible for use with controlled vocabularies. It can make sure that our access points will work well with access points from other metadata communities. Since DACS is output agnostic, you can create the data and then use that data to generate different views or formats. A single set of DACS based data can produce printed finding aids, EAD finding aids, MARC 21 or MODS records.

In order to produce each of these different views from a single original format, you must a crosswalk. A crosswalk maps individual elements from one data format to corresponding elements of another. Unfortunately, crosswalks come with their own challenges:

granularity
missing elements
single element on one side that would need to be split into multiple elements on the other side

You need expertise in both standards addressed by a each crosswalk in order to do this well.

Next Kelcy discussed Encoded Archival Description (EAD). EAD is a data structure standard, machine readable format for encoding archival descriptions. It allows archivists to share the data across institutions. If you want to re-purpose a finding aids metadata, the data needs to be in a machine readable format. EAD gives you this. You can convert an EAD encoded finding aid into a Metadata Object Description Schema (MODS) document using an XSLT stylesheet and a crosswalk. The stylesheet may take a lot of work (especially for use across many finding aids), but there is a big payoff. Once the work is done a single stylesheet can be used across many many finding aids.

The Archivists’ Toolkit was cited as an example of a tool that can let you output multiple formats from a single set of data. It can produce EAD, MARC, MODS and Dublin Core records.

Tools can support efforts – but it all comes back to quality archival description. The best tool in the world will never make bad content into good content. If data is inconsistent – you have to manually go back and clean it up. I particularly liked Kelcy’s point about ensuring that your data doesn’t need the screen labels you to make sense. If you don’t consider this, when you export that data into a new format or view the data can loose it’s meaning.

Her concluding point was that if you don’t have the tech skills or support, work on your content.. use DACS… get your data in order and it will pay off later.

Questions and Answers

Question: How does this work when you are trying to share your metadata with communities that use different controlled vocabularies – thinking about the single EAD that generates MODS and MARC .etc etc…

Answer: Aggregators often they don’t use subject headings. This is nearly impossible to do in OAI – people use lots of different controlled vocabularies.. and sometimes no controlled vocabulary at all. There are experiments being done with subject clustering. Algorithms are used to cluster like things together – but it still requires human intervention to make sure the clusters make sense.

On the other hand – if you are using a standard vocabulary, there is work being done to map from one standard to another. An example of this is the OCLC Metadata switch project .

Question : What about social taggging?

Jenn : We are in no position to turn down metadata.

Sarah: DSpace has a concept of community. There is a way to let a community organically build their own controlled vocabulary as they go – new contributions are provided choices of terms that have been used before.

Bill talked about the article about Michaelson where they gave the same finding aids to 40 archivists to use LCSH for picking subject headings. The result was 0 consistency! Every single archivist picked different subject headings.

Jordan: PennTags is an example of an effort to combine social tagging with traditional classification. It shows tagging not as competition but as another way to get user generated descriptive information. It is an example of a way to ‘get into the flow’.

Sarah: Google will now use OAI PMH as a site map for indexing, but it throws away the metadata.

Jenn: Dlib – representing digital collections on wikipedia article.

Bill: PennTags is acting as an aggregation system to pull siloed information together.

Question: In some cases EAD data is flattened down for all items so that each item has all the context data and only one field is different on each? Is this an indication that the mapping have been better?

Answer: It is a problem – can be a problem…ultimately it is all about use and audience.

My Thoughts

I came away from this session with my head whirling with ideas. I was so pleased to hear people talk about concrete examples. We need more examples of challenges and real world benefits to further efforts to aggregate, publish and share archival content and it’s metadata. None of this is easy, but each project will give us new lessons and add to the growing set of best practices.

I truly believe that the sooner we tackle these thorny problems, the sooner we will start seeing the impact in improved access to archival records. The sooner we deal with it, the less we will be adding data that will have to be fixed later.

For anyone who has been following my blog – you will already know about my ArchivesZ project from last spring. One of the big struggles we had was figuring out how to make the subject term metadata ‘useful’ for aggregation and visualization. Another example of the challenges and benefits to shareable metadata is the SAA presentation about Publisher’s Bindings Online .

I had one last sentence in my notes from this session – an idea for a Facebook application that would let you feature your favorite archival image or record. This would be an amazing example of getting archival records ‘in the flow’ and showing up in surprising new places where no-one is ‘looking’ for records. Hey – maybe I should prod the Footnote people with this idea. It might be right up their alley!

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

SAA2007: Archives and E-Commerce, Three Case Studies (Session 404)

October 24, 2007 4 Comments

Diane Kaplan, of Yale University Library’s Manuscripts and Archives unit, started off Session 404 (officially titled Exploring the Headwaters of the Revenue Stream) by thanking everyone for showing up for the last session of the day. This was a one hour session that examined ways to generate new funds through e-commerce . Three different e-commerce case studies were presented, followed by a short question and answer period.

University of Wyoming’s American Heritage Center

Mark Shelstad‘s presentation, “Show Me the Money: Or: How Do We Pay for This?”, detailed the approach taken by the University of Wyoming‘s American Heritage Center (AHC) to find alternate revenue streams. After completing a digitization project in the fall of 2004, the AHC had to figure out how to continue their project after their original grant money ran out.

Since they didn’t have a lot of in-house resources, they chose Zazzle.com for their effort to profit from their existing high resolution images. They can earn up to 17% from the sales through a combination of affiliate sales and profits from the sale of products featuring American Heritage Center images.

They had a lot of good reasons for choosing Zazzle.com. Zazzle.com already had an existing ‘special collections’ area, meaning that their images would have a better chance of being found by those interested in their offerings (for example – take a look at the Library of Congress Vintage Photos store). Zazzle.com also did not require an exclusive license to the images. The American Heritage Center Zazzle on-line store opened in 2005.

Currently they are making about $30 a month in royalties from 200 images. Mark pointed out that everyone needs to keep in mind that the major photo provider, Corbis, has yet to turn a profit in online photo sales. He also mentioned a website called Cogteeth.com that lets you click on any image and use those images on t-shirts, mugs.. etc.

Near the end of his talk, Mark shared an amazing idea to create a non-profit that would be a joint organization for featuring and selling products using archival images. I love it! It is easy to see that many archives are small and don’t have the infrastructure to create and run their own e-commerce websites. At the same time, general sites that let anyone set up a store to sell items with custom images on them threaten to loose the special nature of historical images in the shuffle. Even the special collections section of Zazzle lumps the American Heritage Center and the Library of Congress collections with Disney and Star Wars. I would love to see this idea grow!

Minnesota Historical Society

Kathryn Otto of the Minnesota Historical Society (MHS) spoke next. She first gave an overview of traditional services provided by MHS for a fee, such as photocopies, reader-printer copies, microfilm sales, media sales, inter-library loan fees, classes and photograph sales. MHS also earned income via standard use fees and research services.

The first e-commerce initiative at MHS was the sale of Minnesota State Death Certificates from 1904 – 2001. Made available via the Minnesota Death Certificate Index they provide the same data as Ancestry.com, but the MHS index provides a better search interface. They have had users tell them that they couldn’t find something on Ancestry.com – but that they were able to find what they needed on the MHS site.

To their existing Visual Resources Database, MHS also added a buy button for most images. Extra steps were added into the standard buy process to deal with the addition of a use fee depending on how the purchaser claims the image will ultimately be used. One approach that did not work for them was to offer expensively printed pre-selected images. The historical society sells classes online and can handle member vs non-member rates. TheVeterans Graves Registration Index is a tiny database that was created by reusing the interface used for the death certificates.

The Birth Certificate Index provides “single, non-certified copies of individual birth certificates reproduced from the originals” via the website.. while “[o]fficial, certified copies of these birth certificates are available through the Minnesota Department of Health.” The MHS site provides much faster and easier service than the Department of Health as can be seen from this page detailing how to order a non-certified copy of a birth record from the DOH – which requires printing, filling out and either faxing or snail mailing a form.

Features to keep in mind as you branch into in e-commerce:

Statistics – Consider the types of statistics you want. Their system just gave them info about orders – not how much they made.
Sales tax – Figure out how is it handled
Postage/Handling fees – Look at the details! The MHS Library-Archives was stuck with the Museum Store’s postage rates because the e-commerce system could not handle different fees for different types of objects.
Can’t afford credit card fees? Consider PayPal.
Advertise what you are selling on your own website.

Godfrey Memorial Library, Middletown, CT

The final panelist was Richard Black, Director of the Godfrey Memorial Library in Middletown, Connecticut. The Godfrey is a small, non-profit, genealogical research library with approximately 120,000 genealogical items. They currently have 5 full time staff and 60 volunteers.

Services they provide:

an online subscription portal for genealogical research (see a list of resources available to subscribers )
quick search service – 30 minutes of search by a volunteer
American Genealogical Biographical Index (AGBI) – currently Ancestry.com has right to use the AGBI for 1 more year
Access to OCLC WorldCat
…and more

About 3 years ago they had exhausted all of their endowment money and faced the strong possibility of closing the doors. They were down to one full time librarian and a few volunteers and were dependent mostly on donations and some minor income from other sources/services.

They had only a few options open to them:

find more money from other sources
merge with another library
close the doors
sell some of the content
others??

The first approach to raise funds was to create a subscription website. The Godfrey acquired Heritage Quest census records and added other databases as resources allowed. Subscriptions were sold for $35 a year. The board thought they might be lucky to get 100 subscriptions.. but they actually got approximately 14,000!

Now the portal provides access to sites for which a premium has been paid (so that subscribers don’t have to pay), sites that are available free on the Internet (but made easier to find) and sites unique to Godfrey, including digitized material in the library and other material that has been made available to them. They just added 95,000 Jewish grave-sites – brought to them by a local rabbi. Another recent addition was a set of transcriptions of a grave-site made as an Eagle Scout project. They also negotiated to have their books digitized for them for free. The company performing the digitization will pay a royalty to Godfrey as the books are used.

The costs to acquire data for the portal includes $60,000 a year for access to premium sites, the cost to digitize and transcribe unique content (there are opportunities to partner and reduce costs) and the cost to acquire patrons. The efforts of the Godfrey staff and volunteers is ‘free’ – but costs time.

The Godfrey subsequently lost access to the Heritage Quest material. This was like taking the anchor store out of the corner of a mall. It forced them to diversify their revenue streams and watch for new opportunities.

Current revenue source distribution:

online portal 45%
annual appeal 10%
patron requests 5%
contract services 35% (OCLC analytical cataloging that they do)
misc 5%

The endowment funds have been restored and the Godfrey’s staff is now growing again.

Questions

Question: Did you meet resistance in your institutions?
Answer: No.. Minnesota said they had such success that the 2 questions they here now are A) What do we put online next? B) How long can they protect their income from the rest of the institution?

Question: (From someone from a NJ archives) Is there a way to do e-commerce with government records and not have the money ‘stolen’ from them?
Answer: Minnesota – The department of health was happy for death and birth certificates business to go away? They do worry about the future when they might try to make a marriage index – because that territory is already ‘owned’ by a group that wants to keep that income.

Question: When you charge for use fees – are there people who don’t pay them?
Answer: Minnesota: Probably – no way to really know.
Mark (American Heritage Center): Our images are public domain – they can do what they like with them.

Question: Do you brand your images?
Answer: Mark: Yes.. a logo and URL goes with the images.

My Thoughts

I was particularly impressed by how much information was conveyed in the course of the 1 hour session. My personal highlights were:

As I mentioned above, I want Mark’s idea for a non-profit to sell co-located products based on archival images to gain support and momentum.
I was pleased by the point that the MHS makes money from their Minnesota Death Certificate Index partly due to their improved and powerful search interface. The data is available elsewhere – but they made it easier to find information, so they will become the destination of choice for that information.
The Godfrey’s story is inspirational. In an age when we hear more and more often about archives and libraries being forced to cut back services due to funding shortfalls, it is great to hear about a small archives that pulled themselves back from the brink of disaster by brave experimentation.

These three case studies gave a great glimpse of some of the ways that archives can get on the e-commerce bandwagon. There is no magic here – just the willingness to dig in, figure out what can be done and try it. That said – there is definitely lots of room to learn from others successes and mistakes. The more real world success and failure stories archives share with the archival community about how to ‘do’ e-commerce, the easier it will be for each subsequent project to be a success.

Blog Action Day: A Look At Earth Day as Archived Online

October 15, 2007 1 Comment

In honor of this year’s Blog Action Day theme of discussing the environment, I decided to see what records the Internet had available about the history of Earth Day.

I started by simply Googling Earth Day. In a new browser window I opened the Internet Archive’s Wayback Machine. These were to be my two main avenues for unearthing the way that Earth Day was represented on the internet over the years.

Wikipedia’s first version of an Earth Day page was created on December 16th, 2002. This is the current Earth Day page as of the creation of this post – last updated about a week ago.

The current home page for the Earthday Network appears identical to the most recent version stored in the Wayback Machine, dated June 29, 2007 – until you notice that the featured headline on the link to http://www.earthdaynetwork.tv is different.

The site that claims to be ‘The Official Site of International Earth Day’ is EarthSite.org. The oldest version from the Wayback Machine is from December of 1996. This version shows a web visitor counter perpetually set to 1,671. Earth day ten years ago was scheduled for March 20th, 1997. If you scroll down a bit on the What’s New page you can read the 1997 State of the World Message By John McConnell (attributed as the founder of Earth Day).

The U.S. Government portal for Earth Day was first archived in the Internet Archive on April 6, 2003. The site, EarthDay.gov, hasn’t changed much in the past 4 years. The EPA has an Earth Day page of it’s own, that was first archived in early 1999. No clear way to know if that actually means that the EPA’s Earth Day page is older or if it was just found earlier by the Internet Archives ambitious web crawlers.

Envirolink.org, with the tagline “The Online Environmental Community”, was first archived back in 1996. You can see on the Wayback Machine page for Environlink.org, has a fairly full ten years worth of web page archiving.

Next I wanted to explore what the world of government records might produce on the subject. A quick stop over at Footnote.com to search for “Earth Day” didn’t yield a terribly promising list of results (no surprise there – most of their records date to before the time period we are looking for). Next I tried searching in Archival Research Catalog (ARC) over on the U.S. National Archives website. I got 15 hits – all fairly interesting looking… but none of them linked to digitized content. A search in Access to Archival Databases (AAD) system found 2 hits – one to some sort of contract between the EPA and a Fairfax Virginia company named EARTH DAY XXV from 1995 and the other a State Department telegram including this passage:

THIS NATION IS COMMITTED TO STRIVING FOR AN ENVIRONMENT THAT NOT ONLY SUSTAINS LIFE, BUT ALSO ENRICHES THE LIVES OF PEOPLE EVERYWHERE – – HARMONIZING THE WORKS OF MAN AND NATURE. THIS COMMITMENT HAS RECENTLY BEEN REINFORCED BY MY PROCLAMATION, PURSUANT TO A JOINT RESOLUTION OF THE CONGRESS, DESIGNATING MARCH 21, 1975 AS EARTH DAY, AND ASKING THAT SPECIAL ATTENTION BE GIVEN TO EDUCATIONAL EFFORTS DIRECTED TOWARD PROTECTING AND ENHANCING OUR LIFE-GIVING ENVIRONMENT.

I also thought to check the Government Printing Office’s (GPO) website for the Public Papers of the Presidents of the United States. Currently it only permits searching back through 1991 online – but my search for “Earth Day” did bring back 50 speeches, proclamations and other writings by the various presidents.

Frustrated by the total scattering of documents without any big picture, I headed back to Google – this time to search the Google News Archive for articles including “Earth Day” published before 1990. The timeline display showed me articles mostly from TIME, the Washington Post and the New York Times – some of which claimed I would need to pay in order to read.

Back again to do one more regular Google search – this time for earth day archive. This yielded an assortment of hits – and just above the fold I found my favorite snapshot of Earth Day history. The TIME Earth Day Archive Collection is a selection of the best covers, quotes and articles about Earth Day – from February 2, 1970 to the present. This is the gold mine for getting perspective on Earth Day as it has been perceived and celebrated in the United States. The covers are brilliant! If I had started this post early enough, I would have requested permission to include some here.

With the passionate title Fighting to Save the Earth from Man, the first article in the TIME Earth Day Collection begins by quoting then President Nixon’s first State of the Union Address:

The great question of the seventies is, shall we surrender to our surroundings, or shall we make our peace with nature and begin to make reparations for the damage we have done to our air, to our land, and to our water?

Fast forward to the recent awarding of the Nobel Peace Prize for 2007 to the Intergovernmental Panel on Climate Change (IPCC) and Al Gore and I have to image that the answer to that question of if we were ready to make peace with nature asked so long ago was ‘Not Yet’.

Overall, this was an interesting experiment. The hunt for ‘old’ (such as it is in the fast moving world of the Internet) data about a topic online is a strange and frustrating experience. Even with the Wayback Machine, I often found myself with only part of the picture. Often the pages I tried to view were missing images or other key elements. Sometimes I found a link to something tantalizing, only to realize that the target page was not archived (or is so broken as to be of no use). The search through government records and old newspaper stories did produce some interesting results – but again seemed to fail to produce any sense of the big picture of Earth Day over the years.

The TIME Collection about Earth Day was assembled by humans and arranged nicely for examination by those interested in the subject. It is properly named a ‘collection’ (in the archival sense) because it is not the pure output of activities surrounding Earth Day, but rather a selected snapshot of related articles and images that share a common topic. That said, it is my fervent hope that websites such as these appear more and more. I suspect that the lure of attracting more readers to their websites with existing content will only encourage more content creators with a long history to join in the fun. If other do it as well as TIME has seemed to in this case, it will be a win/win situation for everyone.

SAA2007: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building (Session 707)

September 22, 2007 2 Comments

Session 707 of SAA2007 in Chicago discussed many aspects of the project that created Publishers’ Bindings Online (PBO). The full title of this session was The Anatomy of a Collaborative Digital Project and Lessons Learned in the Realms of Access, Outreach, and Creative Success: A Multi-Disciplinary Look at Publishers’ Bindings Online, 1815-1930: The Art of Books. The presenters have kindly posted the full slide deck from their panel online. In this post I attempt to capture the main points of the presentation and Q&A discussion of PBO.

Who Spoke?

Jessica Lacher-Feldman (session chair) – University of Alabama, PBO project manager

Amy Rudersdorf – now at North Carolina State University, Digital production coordinator, NCSU special Collections, but was at University of Wisconsin, Madison during PBO project

Kristy Dixon – University of Alabama , PBO staff

PBO Project Overview

PBO was made possible by a 3 year Institute of Museum and Library Services (IMLS) grant. Originally awarded in 2003, the grant was extended once (and I think they mentioned additional funding being applied for). The primary grant funded the digitization of 10,000 images from up to 5000 book bindings. Ultimately 10,570 images were added to PBO and made searchable by metadata. The bindings selected included books from 1815-1930, primarily US titles and mostly in English.

Their guiding vision was of “giving something to the world that is both needed and useful” (and really beautiful). And they succeeded! PBO is a lot more than 10,000+ digitized book bindings. The project strived to make the information available in many different ways, including via:

a web-based database
online exhibits & galleries,
vodcasts and podcasts
web-based tutorials
virtual and real exhibits
presentations & class lectures
opportunities to adapt the project to other disciplines – history, book arts, librarianship, literature.. K-12 and more

Technology and Processes

The division of labor for PBO was split between the University of Alabama and the University of Wisconsin, Madison.

Many extensions to the OCLC SiteSearch based database were made by the UWDCC (UW Digital Collections Center) digital production center at the University of Wisconsin, Madison .

They went through an overview of the participants and staff – who did what.. what skills were needed and what was brought by the two institutions to the collaboration. They acknowledged their fabulous advisory group including Sue Allen – “the expert on publisher’s bindings”. Individuals from outside their teams contribute based on their special interest and knowledge about a specific individual (this contribution is still ongoing).

Working in collaboration forced them to wrestle with many challenges including:

staff in two locations – most of whom had never met
“long distance relationships are hard”
they had to work hard to ensure that all were ‘equally-valued participants’
standards – you need ground rules from the outset

Collaboration & Description

“Every pair of eyes are different”. PBO tapped into the resource of the ‘young fertile minds’ to power the project out of the local MLS programs at both institutions. Even with a detailed description form – there was confusion over subject headings and overlap – especially when those selecting subject headings were grad students who might not know the official terms for things. For example, the list of terms might include Ouroboros – but the students might not know this it is the term for a snake eating it’s own tail.

Ultimately they had to do quality control at a single location. They spent a LOT of time on this.

Their top tips for cultivating continuity for virtual project teams:

write into your grants money for travel (they stressed that your grant includes funds to support people meeting each other)
continuous communication is critical
‘shared working group website’ available online
email, conference calls and instant messaging (IM) for communication
regular reporting to each other
being project manager means that you have to be on top of everything – you need to be the glue
focus on the deliverables – use planning tools and timelines

They discovered that IM was key to developing trust between the two institutions.

Metadata – the core of the project

The key to their metadata approach was to consider a book less as a ‘bibliographic object’ and more as an ‘art object’.

They called books in PBO ‘objects’ but still kept the bibliographic metadata. They used Dublin Core by pulling the MARC data into the Dublin Core structure. As part of this they took all the subjects from the bibliographic info and moved it to the Dublin Core description and labeled it ‘book topic’. Then they used the ‘Subjects’ portion of the Dublin Core record to describe the binding and talk about what the images are OF. This is where the subject terms from the controlled vocabulary were added.

These are the steps of their metadata workflow process:

selection from collections of note – faculty, consultants and library staff did this step
description – used a paper form, described the books on paper and joined that description to what was in the MARC record – done by the grad students and library staff
metadata entry – entry of data through an online form – done by students (overseen by library staff) actually ended up being cheaper to manually enter the MARC data (rather than automated extraction)
quality control – content, grammar, spelling – done by library staff (took a lot more time than anyone expected)
no live update between their working Filemaker Pro database and the final SiteSearch database
record ownership – indicated in the identifier field (with a special code in the identifier) AND in the Submitter field

A lot of description went into this project.

They needed to develop a controlled vocabulary for the project. To do this they first worked with content specialists to develop a list. They used Library of Congress Subject Headings (LCSH) terms where they could, as well as Getty Art and Architecture Thesaurus. Then they added some local terms. The controlled vocabulary list evolved with the project and is the foundation of all teaching, search and more.

The speaker showed an example of the controlled vocabulary – the terms really are a window into the past. Users can browse the controlled vocabulary through the front end.

On the description paper form they had a list of ‘binding themes’ for those doing the description to pick from. A lot of work was done to get the huge list of themes onto a single page. Ultimately they had to provide some fill in the blank extension fields. For example, rather than believing they had listed every useful trade or profession, there was a section on the list labeled: Profession/Trade – _______________ with the expectation that those describing a binding might need to fill in the blank.

Digitization and The Database

Generally two scans were taken from each book, but sometimes as many as five. What did they scan? Front cover, spine, back cover and end papers.

There were two different image reformatting standards at the two institutions – 300 DPI vs 600 DPI. Both used a black background when scanning. All books were presented in as in condition – some have front/back covers missing. After the scanning they began with master TIFs and then transformed them to JPGs in three sizes in 72 DPI.

The presentation showed screen shots of:

simple search
brief view record in search results — which includes subjects
full record view – including display of all images associated with the book object record
gallery view – thumbnail, title and indication if there are one or more images related to the title
guided search (advanced search)
clickable subject headings

All the images in PBO are freely available for download.

With an eye to digital preservation, all the original uncompressed TIF images are archived in triplicate to digital archive tape and stored in three different locations. The metadata is stored with images in both text and SGML format (which is what SiteSearch works with). The full process documents are available on the project site.

Future Growth

The PBO team is talking to Louisiana State University (LSU) to figure out how PBO can grow. LSU would need to work and live with the way PBO works and learn their processes. They are talking to other institutions – if you are interested in adding content to PBO, please contact them.

The Richard Minsky Collection has been purchased and is being added to the project. This is a rich collection that was gathered to create a catalog. PBO has the catalog and all of Minsky’s research that goes with the collection. The goal is to feed as much of this rich data into PBO as possible. They are working with individual scholars and collectors to find other avenues for growth.

Value Added Components

One of the focuses of PBO has been to look beyond the digital images themselves to creating value added components for their user community.

A tutorial for users is provided, including information about how to email a record. A comprehensive bibliography has been created and is used by scholars. The page prompts users to submit feedback so the bibliography is a live document.

Over 30 galleries have been created – organizing access to essays and additional info by topic. Types of galleries include:

Galleries on Bindings and Book binding techniques – these are not really related to individual book objects – but give more information, for example Silver & Gold: The Art of Metal Stamping
Galleries on Collections – for example the Wade Hall Collection of Southern History and Culture
Galleries on Artistic Styles and Movements – a narrative approach provides information on the historical roots of the movements and show how the bindings fit into the movements
Galleries on History – they have 11 of these galleries,including major historical events, literature and culture of the time
Galleries on Literature

Links to trusted information outside of PBO’s site are shown whenever possible. For example – links to the full text of books are provided via Project Gutenberg. Throughout the site’s text link to sources such as the Library of Congress, .gov sites, PBS and so forth can be found.

Canned searches are provided to make it easy for users to explore content. An example of this is the Silver & Gold: The Art of Metal Stamping search that will find every binding with either silver or gold stamping. This is in contrast with making users figure out the right syntax to submit the search criteria themselves.

The Teaching Tools portion of the site provides sample lesson plans on all sorts of topics. They worked with some high school history teachers via focus groups and got feedback about what they needed and wanted. The Industrial Revolution lesson plan was created based on that feedback.

The research tools that were created as a result of the PBO project and are made available online are:

glossary – 456 terms defined using ten major authorities
bibliography of print & web resources
controlled vocabulary for subject headings
publishers map – an interactive map that includes 2123 publishers so far
tutorials on various subjects

Signed or Designer bindings is a new resource to which scholars continue to contribute new information.

Through collaboration with teaching faculty they developed the presentation such as Indians, the Frontier, and the West in American Bookbindings. This presentation will eventually be podcast on the PBO site. It talks about how these books inspired people to move west and inspired kids to read.

Another podcast is on the way addressing the representation of Uncle Tom’s Cabin. It will discuss how the book was it marketed to different groups – Yiddish, German… etc. There already exists a gallery and essay on Uncle Tom’s Cabin .

Conclusions

The team has been very pleased by the tangible scholarly impact of PBO. They have seen extensive collaboration with the university community, new research, and promotion of the use of special collections materials in the classroom using digital resources. They point to PBO as showing a path to preserve these increasingly fragile books by moving out of the general stacks and into special collections – with a result of increased access to the book and decreased handling.

The presenters avowed that PBO could never have been created by their team alone – working with consultants and advisers was the key to their success. They needed input from experts and others to help PBO grow and keep it sustainable. This interaction makes the project strong – it has it’s own legs and won’t cease to exist when the money disappears.

Publicity and outreach got attention on the PBO project from the very beginning. They made documenting their experiences and making recommendations about how to market digital projects part of the original plan in their grant proposals. These documents were part of their deliverables. They even published a white paper about PBO and outreach.

PBO uses Google Analytics so they can see where their users are coming from. Also it makes cool talking points for your reports and fun things to tell the Dean!

I think the best conclusion to my summary of the presentation portion of this session is the list of points on the final slide titled “Beyond the grant: Room to Grow”:

Potential future contribution from other repositories in the US and abroad…
Potential future collaboration with teaching faculty at UA and beyond
With additional collections, the database and the project will only grow stronger
Potential as a web portal, clearing house, or consortium
Additional potential funding opportunities, scholarship, and ways to highlight collections, resources, knowledge, and abilities

Questions and Answers

Keep in mind throughout this section that I am summarizing and paraphrasing the questions and their answers. Please do not take any statements as full and complete quotes. In cases where I missed too much of the question or answer I generally skipped including it in the list below. If you are anxious to know exactly what was said, you would need to buy and listen to the conference recordings for this session.

Question: Who maintains the website and who makes decisions about how things are going to get updated?
Answer: UA maintains the static web pages and UW maintains the database. The project manager has been in charge.. made prototypes of new design and sent it around for feedback. They have standards for colors in their handbooks.

Question: If the grant funding dried up right now would the project be sustainable?
Answer: There is support from the institutions… for example, it is just one project of many at UW.

Question: How did you get such good scans of the book spines?
Answer : At UW they used blocks or boxes to prop up the books and laid black foam core on top on flatbed scanners. At UA – they used black paper covered blocks in combination with overhead scanners.

Question: How did you get the full cover scans?
Answer: They very carefully lay the cover flat – so the pages sticking are sticking up.

Question: Who customized SiteSearch – OCLC or UW?
Answer: UW did the work – they had one and a half dedicated IT staff to do the customizations.

Question : Have you had to negotiate copyright issues for bindings from the late end of the time range of the project
Answer: No.

Question : Are you aware of others doing similar projects? Have you been approached and or are looking for others who want to contribute?
Answer: Yes. Right now they are working with LSU and are not actively seeking out new participants. There are plans to grow the project eventually.

Question: Did you think about the fact that you were creating your own online publication?
Answer: They didn’t realize it ahead of time – they didn’t realize how powerful the database was going to be to fuel their ability to build further on the work.

Question: Can you search for ‘young people’s covers’ – is there metadata for what age groups might enjoy specific books?
Answer: It depends on if it was part of the descriptive information, but you can search on ‘boys’ or ‘girls’ or ‘juvenile’ and gain useful results.

Question: Can you talk about the work behind the MARC to Dublin Core migration?
Answer: In some ways it was easier than they thought it would be – so many of the fields transfer directly from MARC to Dublin Core.. it was the revelation about the book as art object that made them realize the work they needed to do. Building the controlled vocabularies was where the heavy lifting occurred. It involved going through giant spread sheets with subject terms in alphabetical order looking for typos and working toward consistency (ie, use plurals). The spreadsheet didn’t show how many items used each term – it was hard to know how many changes would be needed.

Question: Do you get hits from the standard online catalog into PBO?
Answer: This is not happening now. They would love to build a better connection between the OPAC and PBO in both institutions.

Question: How did you make decisions when there were disagreements?
Answer: “I don’t remember any more.. it was all so beautiful…” <laughter > . There were no big issues about standards. There were more issues about the grant and things like how many images or books they were supposed to scan. In some cases it was easy because they were in charge of very different project areas – each team had “their own little fiefdom”.

Question: Do you think you might sell images to generate revenue?
Answer: They have considered it. The have made a calendar and a poster, but gave them away. They also have used images for making holiday cards. They don’t see selling images as a main goal right now.

Question: Have you considered pursuing online collaborative methods for work with the scholars and collectors?
Answer: No, but they think that would be useful to explore.

My Thoughts

I loved the energy and connection displayed by the presenters. It was fun to see a team of people who clearly were so proud of their work and pleased by its reception. I was personally intrigued by the highlighted challenge of coming up with (and painstakingly validating) their controlled vocabulary for subjects. I firmly believe that the topic of subject terms and their standardization across repositories will only grow in importance. For those interested in some of what is being done on this front – take a look at both the UK based High Level Thesaurus (HILT) and the Simple Knowledge Organisation Systems Core (SKOS) project. I suspect many will be intrigued by the SKOS use case titled An integrated view to medieval illuminated manuscripts.

Even given the mammoth effort required to create a shared controlled vocabulary, it is clear that the benefits they have reaped from this effort are still being discovered. The speakers mentioned on multiple occasions how pleased (and surprised) they were to realize how powerful their database of metadata has proven to be. All the amazing value added features build on this ‘heavy lifting’.

While it will be rare for such item level attention to be given to most archival documents, PBO sets the bar high for what can be done via collaboration across institutions. Their dedication to sharing their lessons learned is a fine example of what all big projects who are forging new frontiers could be doing. Finally – it is the weight of all the value added elements (galleries, tutorials, lesson plans.. and the list goes on) that have raised what could have been just a set of classified images in a database to being an active community with a growing draw for many types of users from around the world.

ArchivesZ: Visualizing Archival Collections

May 13, 2007 8 Comments

Announcing ArchivesZ – a tool for visualizing archival collections. This prototype is the final project for my information visualization class. It is a web based tool designed to support exploration of aggregated data about archival collections – inspired by the availability of structured data in EAD encoded finding aids.

For visual thinkers who just want to see what this is about – take a look at our 5 minute video demonstration. I don’t yet have a version online for folks to play with – but that is in the works.

This is the official blurb we came up with to describe the project:

ArchivesZ is an information visualization tool designed to support search, understanding and exploration of of archival and manuscript collections. The tool addresses one of the major challenges facing those who work with archival records – the need to understand the scope and quantity of available records. Since archival collections are unique, vary dramatically in record quantity and are organized based on the records creators it can be a great challenge for users to gain perspective concerning the available records across multiple collections. ArchivesZ leverages a unique dual sided histogram to support exploration of the multiple subjects assigned to each collection. As subject terms are selected, the dual sided histogram chart is generated to display related subjects. The tool combines the dual sided histogram with a more traditional histogram displaying year data to permit tightly coupled, multi-dimensional browsing of subject and time period metadata. By representing the distribution of subjects and time periods using the metric of total aggregate linear feet of associated collections, ArchivesZ permits users to get a better sense of total available research materials than they would by viewing a standard search result list.

If you are curious about what a ‘dual-sided histogram’ actually is (or just want to read more about the process and ideas that led us to the current incarnation of ArchivesZ) take a look at our final paper about ArchivesZ.

There is a very long list of features I would like to add or improve but of course there is only so much you can do in the few weeks available for a project like this. Some of our ideas are detailed at the end of the paper I linked to above. I plan to continue working on ArchivesZ and I welcome all feedback – either as comments to this post or via email to jeanne AT spellboundblog.com.

Visualizing Archival Collections

April 8, 2007 1 Comment

As I mentioned earlier, I am taking an Information Visualization class this term. For our final class project I managed to inspire two other classmates to join me in creating a visualization tool based on the structured data found in the XML version of EAD finding aids.

We started with the XML of the EAD finding aids from University of Maryland’s ArchivesUM and the Library of Congress Finding Aids. My teammates have written a parser that extracts various things from the XML such as title, collection size, inclusive dates and subjects. Our goal is to create an innovative way to improve the exploration and understanding of archival collections using an interactive visualization.

Our main targets right now are to use a combination of subjects, years and collection size to give users a better impression of the quantity of archival materials that fit various search criteria. I am a bit obsessed about using the collection size as a metric for helping users understand the quantity of materials. If you do a search for a book in a library’s catalog – getting 20 hits usually means that you are considering 20 books. If you consider archival collections – 20 hits could mean 20 linear feet (20 collections each of which is 1 linear foot in size) or it could mean 2000 linear feet (20 collections each of which is 100 linear feet in size). Understanding this difference is something that visualization can help us with. Rather than communicating only the number of results – the visualization will communicate the total size of collections assigned each of the various subjects.

I have uploaded 2 preliminary screen mockups one here and the second here trying to get at my ideas for how this might work.

Not reflected in the mock-ups is what could happen when a user clicks on the ‘related subject’ bars. Depending on where they click – one of two things could happen. If they click on the ‘related subject’ bar WITHIN the boundaries of the selected subject (in the case above, that would mean within the ‘Maryland’ box), then the search would filter further to only show those collections that have both the ‘Maryland’ and newly ‘added’ tag. The ‘related subjects’ list and displayed year distribution would change accordingly as well. If, instead, the user clicks on a ‘related subject’ bar OUTSIDE the boundary of the selected subject — then that subject would become the new (and only) selected subject and the displayed collections, related subjects and years would change accordingly.

So that is what we have so far. If you want to keep an eye on our progress, our team has a page up on our class wiki about this project. I have a ton of ideas of other things I would love to add to this (my favorite being a map of the world with indications of where the largest amount of archival materials can be found based on a keyword or subject search) – but we have to keep our feet on the ground long enough actually build something for our class project. This is probably a good thing. Smaller goals make for a greater chance of success.

Google, Privacy, Records Managment and Archives

March 16, 2007 6 Comments

BoingBoing.net posted on March 14 and March 15 about Google’s announcement of a plan to change their log retention policy . Their new plan is to strip parts of IP data from records in order to protect privacy. Read more in the AP article covering the announcement.

For those who are not familiar with them – IP addresses are made up of sets of numbers and look something like 192.39.288.3. To see how good a job they can do figuring out the location you are in right now – go to IP Address or IP Address Guide (click on ‘Find City’).

Google currently keeps IP addresses and their corresponding search requests in their log files (more on this in the personal info section of their Privacy Policy). Their new plan is that after 18-24 months they will permanently erase part of the IP address, so that the address no longer can point to a single computer – rather it would point to a set of 256 computers (according to the AP article linked above).

Their choice to permanently redact these records after a set amount of time is interesting. They don’t want to get rid of the records – just remove the IP addresses to reduce the chance that those records could be traced back to specific individuals. This policy will be retroactive – so all log records more than 18-24 months old will be modified.

I am not going to talk about how good an idea this is.. or if it doesn’t go far enough (plenty of others are doing that, see articles at EFF and Wired: 27B Stroke 6 ). I want to explore the impact of choices like these on the records we will have the opportunity to preserve in archives in the future.

With my ‘archives’ hat on – the bigger question here is how much the information that Google captures in the process of doing their business could be worth to the historians of the future. I wonder if we will one day regret the fact that the only way to protect the privacy of those who have done Google searches is to erase part of the electronic trail. One of the archivist tenants is to never do anything to the record you cannot undo. In order for Google to succeed at their goal (making the records useless to government investigators) – it will HAVE to be done such that it cannot be undone.

In my information visualization course yesterday, our professor spoke about how great maps are at tying information down. We understand maps and they make a fabulous stable framework upon which we can organize large volumes of information. It sounds like the new modified log records would still permit a general connection to the physical geographic world – so that is a good thing. I do wonder if the ‘edited’ versions of the log records will still permit the grouping of search requests such that they can be identified as having been performed by the same person (or at least from the same computer)? Without the context of other searches by the same person/computer, would this data still be useful to a historian? Would being able to examine the searches of a ‘community’ of 256 computers be useful (if that is what the IP updates mean).

What if Google could lock up the unmodified version of those stats in a box for 100 years (and we could still read the media it is recorded on and we had documentation telling us what the values meant and we had software that could read the records)? What could a researcher discover about the interests of those of us who used Google in 2007? Would we loose a lot by if we didn’t know what each individual user searched for? Would it be enough to know what a gillion groups of 256 people/computers from around the world were searching for – or would loosing that tie to an individual turn the data into noise?

Privacy has been such a major issue with the records of many businesses in the past. Health records and school records spring to mind. I also find myself thinking of Arthur Anderson who would not have gotten into trouble for shredding their records if they had done so according to their own records disposition schedules and policies. Googling Electronic Document Retention Policy got me over a million hits. Lots of people (lawyers in particular) have posted articles all over the web talking about the importance of a well implemented Electronic Document Retention Policy. I was intrigued by the final line of a USAToday article from January 2006 about Google and their battle with the government over a pornography investigation:

Google has no stated guidelines on how long it keeps data, leading critics to warn that retention could be for years because of inexpensive data-storage costs.

That isn’t true any longer.

For me, this choice by Google has illuminated a previously hidden perfect storm. That the US government often request of this sort of log data is clear, though Google will not say how often. The intersection of concerns about privacy, government investigations, document retention and tremendous volumes of private sector business data seem destined to cause more major choices such as the one Google has just announced. I just wonder what the researchers of the future will think of what we leave in our wake.

Category: search