Spellbound Blog

My New Daydream: A Hosting Service for Digitized Collections

September 20, 2006 3 Comments

In her post Predictions over on hangingtogether.org, Merrilee asked “Where do you predict that universities, libraries, archives, and museums will be irresistibly drawn to pooling their efforts?” after reading this article.

And I say: what if there were an organization that created a free (or inexpensive fee-based) framework for hosting collections of digitized materials? What I am imagining is a large group of institutions conspiring to no longer be in charge of designing, building, installing, upgrading and supporting the websites that are the vehicle for sharing digital historical or scholarly materials. I am coming at this from the archivists perspective (also having just pondered the need for something like this in my recent post: Promise to Put It All Online ) – so I am imagining a central repository that would support the upload of digitized records, customizable metadata and a way to manage privacy and security.

The hurdles I imagine this dream solution removing are those that are roughly the same for all archival digitization projects. Lack of time, expertise and ongoing funding are huge challenges to getting a good website up and keeping it running – and that is even before you consider the effort required to digitize and map metadata to records or collections of records. It seems to me that if a central organization of some sort could build a service that everyone could use to publish their content – then the archivists and librarians and other amazing folks of all different titles could focus on the actual work of handling, digitizing and describing the records.

Being the optimist I am I of course imagine this service as providing easy to use software with the flexibility for building custom DTDs for metadata and security to protect those records that cannot (yet or ever) be available to the public. My background as a software developer drives me to imagine a dream team of talented analysts, designers and programmers building an elegant web based solution that supports everything needed by the archival community. The architecture of deployment and support would be managed by highly skilled technology professionals who would guarantee uptime and redundant storage.

I think the biggest difference between this idea and the wikipedias of the world is that there would be some step required for an institution to ‘join’ such that they could use this service. The service wouldn’t control the content (in fact would need to be super careful about security and the like considering all the issues related to privacy and copyright) – rather it would provide the tools to support the work of others. While I know that some institutions would not be willing to let ‘control’ of their content out of their own IT department and their own hard drives, I think others would heave a huge sigh of relief.

There would still be a place for the Archons and the Archivists’ Toolkits of the world (and any and all other fabulous open-source tools people might be building to support archivists’ interactions with computers), but the manifestation of my dream would be the answer for those who want to digitize their archival collection and provide access easily without being forced to invent a new wheel along the way.

If you read my GIS daydreams post, then you won’t be surprised to know that I would want GIS incorporated from the start so that records could be tied into a single map of the world. The relationships among records related to the same geographic location could be found quickly and easily.

Somehow I feel a connection in these ideas to the work that the Internet Archive is doing with Archive-IT.org. In that case, producers of websites want them archived. They don’t want to figure out how to make that happen. They don’t want to figure out how to make sure that they have enough copies in enough far flung locations with enough bandwidth to support access – they just want it to work. They would rather focus on creating the content they want Archive-It to keep safe and accessible. The first line on Archive-It’s website says it beautifully: “Internet Archive’s new subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise.”

So, the tag line for my new dream service would be “DigiCollection’s new subscription service, Digitize-It, allows institutions to upload, manage and search their own digitized collections through a user friendly web application, without requiring any technical expertise.”

Just Promise to Put It All Online

September 19, 2006

As reported in Inside Higher Ed’s article Harming the Historical Record, the NEH Guidelines for Scholarly Editions Grants have been updated. The crucial passage is as follows:

In keeping with the goals of the NEH Digital Humanities Initiative, the Scholarly Editions Program requires that applicants employ digital technology in the preparation, management, and online publication of all critical and documentary editions. Projects that include TEI (Text Encoding Initiative) conformant transcription and offer free online access are encouraged and will be given preference. (emphasis mine)

Offering free online access is encouraged (not required) and the description of the Digital Humanities Initiative does sound inspiring. It includes this sentence:

[The] NEH is interested in fostering the growth of digital humanities and lending support to a wide variety of projects, including those that deploy digital technologies and methods to enhance our understanding of a topic or issue; those that study the impact of digital technology on the humanities–exploring the ways in which it changes how we read, write, think, and learn; and those that digitize important materials thereby increasing the public’s ability to search and access humanities information.

I love a lot of the sites that they list as having been sponsored by the NEH (such as Valley of the Shadow and Maryland Institute for Technology in the Humanities), and I am always one for “increasing the public’s ability to search and access humanities information”, but it is so frustrating that the glamour of digital access to records would cross over into requirements for funding Scholarly Editions. The core goal of these grants are described as “[to] support the preparation by a team of at least two editors and staff of texts and documents that are currently inaccessible or available in inadequate editions.”

I feel strongly that this sort of expectation for digitization is rarely set with an full understanding of all the other elements that need consideration, ongoing support and financial backing.

First of all – what does it mean to be ‘put online’. While I understand that they likely desire online access to some version of the scholarly edition created with the grant funding – the requirement is still very vague. One could easily wonder if we are talking about images of the records? Transcriptions of the text of the records? What sort of supporting data must be provided? It isn’t as if you can just upload 10,000 scans of records and create a single page with links to them and call it a day. Of course no-one would think that interface was useful, but it could certainly could be considered as being online. Will the grants provide some provision for supporting the online sites in subsequent years? Websites need hardware, bandwidth and support from IT personnel. Unfortunately there is no accepted, open-source, freely hosted solution for serving up digital records. Some institutions have been experimenting with using Flickr as a Digital Collection Host – but that entire topic (and all the issues inherent therein) is fodder for another post in the future.

Next let us consider copyright and privacy issues. Many archival collections are kept, supported and maintained by an archival institution that does NOT in fact retain the copyright to the records. To demand that a project promise to publish all records for free online would unfairly punish collections that do not have the right to publish all the records online even if they have secured the rights to publish records in books. On the privacy side – archivists must often restrict access to certain records or selected series of a collection due to the private information about individuals included in those records or series. This presents yet another challenge to blanket digitization requirements.

The Inside Higher Ed article went on to mention that by requiring free publication of records online, the NEH is removing creative ways for institutions to find additional funding to support their important work. “Virginia is a major player in archival series, publishing — among others — the Papers of George Washington and the Papers of James Madison . Much material from those projects is placed online, free, Kaiserlian said. But Virginia is also selling site licenses to libraries to enable them to have access to everything, while supporting the work that goes into the project.” As the sources for funding for humanities projects such as these are shrinking every day it is unfortunate that a grant might force institutions to consider the income they would loose if they apply for a grant with such strings attached.

Browsing through the rest of the NEH Digital Humanities Initiative website I did find a lot to be enthusiastic and hopeful about – such as the Digital Humanities Start-Up Grants :

NEH’s Digital Humanities Start-Up Grants will encourage scholars with bright new ideas and provide the funds to get their projects off the ground. Some projects will be practical, others completely blue sky. Some will fail while others will succeed wildly and develop into important projects. But all will incorporate new ways of studying the humanities.

I love it. I want to see what they fund. I want to participate. I want that grant to still exist when I am done with my graduate degree and have more focus to my ideas.

Browsing the sidebar of the main Digital Humanities Initiative page you can see how they are presenting all their grants in the context of being digital in some way. If I want to “create digital humanities tools for analyzing and manipulating humanities data” I should apply for either a Reference Materials Grant or a Research and Development Grant. If I want to “develop a Web site or other digital project for a general public audience” I should apply for a Special Projects Grant. And if I want to “create a digital or online version of a scholarly edition” I should apply for a Scholarly Editions Grant. In some ways it just feels as if they added a ‘digital’ element to all their grants without any other major restructuring, not that I am an expert on the history of NEH grants.

I wonder – if I had arrived at the NEH Digital Humanities Initiative page without prior knowledge of how these grants have been used in the past on existing projects, would I have ended up with a post with similar questions, but less frustration? Less stress over what appears to be perceived (if the quotes in the article at the start of this post are to believed) as a major change to the structure of a grant that many doing fine and important work have come to depend upon? That said – I still think that the issues of vague online access expectations, the challenges related to privacy and copyright and the lack of ongoing funding to support websites and their patrons are real and worth consideration.

GIS, Access, Archives and Daydreams

September 13, 2006 9 Comments

Today in my Information Structure class, our topic was Entity Relationship Modeling. While this is a technique that I have used frequently over the many years I have been designing Oracle databases, it was interesting to see a slightly different spin on the ideas. The second half of class was an exercise to take a stab (as a class) at coming up with a preliminary data model for a mythical genealogical database system.

While deciding if we should model PLACE as an entity, a woman in our class who is a genealogy specialist told us that only one database she has ever worked with tries to do any validation of location – but that it is virtually impossible due to the scale of the problem. Since the borders and names of places on earth have changed so rapidly over time, and often with little remaining documentation, it is hard to correlate place names from archival records with fixed locations on the planet. Anyone who has waded through the fabulous ship records on the Ellis Island website hunting for information about their grandparents or great-grandparents has struggled with trying to understand how the place names on those records relate to the physical world we live in.

So – now to my daydream. Imagine if we could somehow work towards a consolidated GIS database that included place names and boundary information throughout history. Each GIS layer would relate to specific years or eras in time. Imagine if you could connect any set of archival records that contained location data to this GIS database and not only visualize the records via a map – but visualize the records with the ability to change the layers so you could see how the boundaries and place names changed. And view the relationship between records that have different place names on them from different eras – but are actually from the same location.

I poked around to see what people are already doing – and found all of this:

Digital Earth and it’s more recently updated counterpart Geospatial Applications and Interoperability (GAI), a working group of the Federal Geographic Data Committee that seems to now exist within the National Geospatial Program Office of the USGS.
GOS – Geospatial One Stop which led me to the fabulous Lewis and Clark GeoSystems
The National Atlas (also found off GOS) that includes a special History Chapter (that starts to head in the direction I am imagining I think)
GEOnet Names Server (GNS) that provides access to the National Geospatial-Intelligence Agency’s (NGA) and the U.S. Board on Geographic Names‘ (US BGN) database of foreign geographic feature names (take this and add in a history element, and we are getting even warmer)
GIS for the Humanities – funded by a 2003 NEH Focus Grant, this project’s goal is “designed to create, and train faculty in the use of, mapping modules intended to enhance humanities courses”. I included this one because it gives a slice of the kind of teaching my dream GIS database could fuel.
And two clearinghouses for information: the US National Geospatial Data Clearinghouse and the United Nations Environment Programme / Global Resource Information Database (UNEP/GRID) Spatial Data Clearinghouse

I know it is a daydream – but I believe in my heart of hearts that it will exist someday as computing power increases, the price of storing data decreases and more data sources converge. I do forsee another issue related to the challenges presented by different versions of borders and place names from the same time period – but there are ways to address that too. It could happen – believe with me!

Ideas about Zotero and Digitized Archives

September 9, 2006 9 Comments

Dan Cohen posted recently about the soon to be available, open-source, firefox plugin, research support software named Zotero . Looking at the quick start guide, I immediately spotted the icon to “add a new collection folder”. As the “archivist-in-training” that I am, my reaction now to the word “collection” is different than it would have been a year ago. Though I strongly suspect it will not be the case (at least not in the first released version) I immediately was daydreaming of browsing a digitized collection online – clicking the “add a new collection folder” icon – and ending up with a copy of the entire collection of records for examination and comparison later.

Of course this would be most useful for the historian digging through and analyzing archival records if Zotero was able to pull down metadata beyond that of a standard citation and retain any hierarchical information or relationships among the records.

Now on Dead Reckoning‘s post on Zotero RDFa is mentioned. I don’t know anything about RDFa beyond what I have read in the last few hours, so it is not clear to me how complicated the metadata can be – perhaps it can support a full digital object XML record of some kind. So maybe the trick isn’t so much getting Zotero to do things it wasn’t designed to do – but rather the slow migration of sites to using the software packages and standards listed here.

I don’t want anyone to think that I am not excited about Zotero and all the neat things it is likely to do. I suspect I will rapidly become a frequent Zotero user verging on a zealot – but it is fun to daydream. I think it is most fun to daydream now, before I start using it and get lost in all the great stuff it CAN do. I definitely will post more after I get a chance to take it for a spin in early October.

Google Newspaper Archives

September 6, 2006

I was intrigued by the news that Google had launched a News Archive search interface. For my first search, I searched on “Banjo Dancing” (a one man show that spent most of the 1980s in Arena Stage‘s Old Vat Room). It was tantalizing to see articles from “way back when” appear. The ‘timeline’ format was very useful way to quickly move through the articles and help focus your search.

Many newspapers that provide online access to their archives charge a per article fee for viewing the full article. You are not charged when you click on the link – but you do get a chance to view some sort of short abstract before paying. The advanced search permits you to limit your results based on their cost – so you can search only for those articles which are free or cost below a specific amount. By modifying my original search to only include free articles I found three, one from 1979, one from 2002 and one which did not yield anything.

So what does this mean for archives? In their FAQ, Google states “If you have a historical archive that you think would be a good fit in News archive search, we would love to hear from you.”. Take a moment and think about that – archives with digitized news content could raise their hand and ask to be included. Google has suddenly put the tools for increasing access in the hands of everyone. The university that has digitized it’s newspapers can suddenly be put on the same level with the New York Times and the Washington Post. There currently does not seem to be a fixed list showing “these are the news sources included in the Google news archive” – but I hope they add one.

In their usual fashion, Google has increased the chance of the serendipitous discovery of information – but because everything in the news archive will come from a vetted source, the quality and reliability of the information found should be far and above your standard web search.

SAA 2006: Research Library Group Roundtable – Internet Archiving

August 29, 2006

Late in the afternoon on Thursday August 3rd I attended the Research Library Group Roundtable at SAA 2006. It was an opportunity for RLG to share information with the archival community about their latest products and services. This session included presentations on the Internet Archive , Archive-It and the Web Archives Workbench.

After some brief business related to the SAA 2007 program committee and the rapid election of Brian Stevens from NYU Archives as the new chair of the group, Anne Van Camp spoke about the period of transition as RLG merges with OCLC. In the interest of the blending of cultures – she told a bar joke (as all OCLC meetings apparently begin). She explained that RLG products and services will be integrated into the OCLC product line. RLG programs will continue as RLG becomes the research arm for the joined interest areas of libraries, archives and museums. This has not existed before and they believe it will be a great chance to explore things in ways that RLG hasn’t had the opportunity to do in the past.

The initiatives on their agenda:

archival gateways: convened 2 meetings recently. The first to see if there is a way to be interactive with international archive databases and the second to bring regional archives together to see how they can work together.
web archiving: started looking at it from a service point of view, but also some community issues that have to be worked out around web archiving. Looking at big problems that will need community involvement – issues like metadata and selection.
standards: continuing to support EAD, pursuing rigorous agenda regarding EAC
OCLC has a whole group of people who works on registries (where you put information about organizations). RLG has talked about building a registry on top of Archive Grid of US archives.

In her introduction, Merrilee (frequent poster on hangingtogether.org ) highlighted that there are lots of questions about the intellectual side of web archiving (vs the technical challenges) such as:

what to archive?
what metadata data and description is appropriate for it?
what would end users of web archives need? How would they use a web archive?
what about collaborative collection development? It is expensive to archive the web – how does an institution say “I am archiving this corner of the web – this deep – this often”. This information should be publicly available for others doing research and others archiving the web.

She pointed out that RLG is happy about their work with Internet Archive – they are doing work to make the technical side easier but they understand that there is a lot for the archival community to sort out.

Next up was Kristine Hanna of the Internet Archive giving her presentation ‘Archiving and Preserving the Web’. The Internet Archive has been working with RLG this year and they need information from the users in the RLG community. They are looking into how they are going to work with OCLC and have applied for an NDIIP grant.

The Internet Archive (IA), founded by Brewster Kahle in 1996, is built on open source principles and dedicated to Open Source software.

What do they collect in the archive? Over 2 billion pages a month in 21 languages. It is free and the largest archive on the web including 55 billion pages from 55 million sites and supporting 60,000 unique users per day.

Why try to collect it all? They don’t feel comfortable making the choices about appraisal. And at risk websites and collections are disappearing all the time. The average lifespan of a web page is 100 days. They did a case study of crawling websites associated with the Nigerian election – 6 months after the election 70% of the crawled sites were gone, but they live on in the archive.

How do they collect? They use these components and tools:

Heritrix – web crawler
Wayback Machine – access tools for rendering and viewing files
Nutch – search engine
Arc File – archival file format used for preservation

How do they preserve it? They keep multiple copies at different digital repositories (CA, Alexandria (Egypt), France, Amsterdam) using over 1300 server machines.

IA also does targeted archiving for partners. Institutions that want to create specific online collections or curated domain crawls can work with IA. These archives start at 100+ million documents and are based on crawls run by IA crawl engineers. The Library of Congress has arranged for an assortment of targeted archives including archives of US National Elections 2000, September 11 and the War in Iraq (not accessible yet – marked March 2003 – Ongoing). Australia arranged for archiving of the entire .au domain. Also see Purpose, Pragmatism and Perspective – Preserving Australian Web Resources at the National Library of Australia by Paul Koerbin of the National Library of Australia and published in February of 2006.

What’s Next for Internet Archive?

collaboration and partnerships
OCA – open content alliance
Multiple copies around the world

Next, Dan Avery of IA gave a 9 minute version of his 35 minute presentation on Archive-It. Archive-It is a web based annual subscription service provided by IA to permit the capture of up to 10 million pages. Kristine gave some examples of those using Archive-It during her presentation:

Indiana University – web sites
North Carolina State Archives – Government Agencies, Occupational Licensing Boards and commissions.
Library of Virginia – Jamestown 2007 commemoration and Governor Mark Warner’s last year in office. When Mark Warner was listed by the New York Times as a possible presidential candidate, this archive got lots of hits. (This brings up interesting questions of watching content that is being purposefully preserved to get an idea of what some expect for the future. Don’t be surprised by a post on this idea all by itself later. Need to think about it some more!)

He highlighted the different elements and techniques used in Archive-It: crawling, web user interface, storage, playback, text indexing and integration.

Crawling/Browsing:
- Heritrix :
  - open source java
  - Archival-quality (they preserve exactly what they get back from the server)
  - Highly configurable
- Wayback Machine :
  - lets you surf the web as it was
  - in Archive-It – each customer has their own wayback machine
  - not open source yet.. that is a work in progress
The user interface is a web application:
- collects all the info they need to do the crawling the customer requests
- schedule (monthly, daily, weekly, quarterly… etc)
- seed URLS (the starting point for archive web crawls)
- crawl parameters
NutchWAX
- extension of Nutch which is built on Lucene
- full text search plus link analysis
- can search by date instead of relevance – useful for individual archives

While there are public collections in Archive-It, logging in gives you access to personal sites: shows the total documents archived (and more), lets you check your list of active collections and set up a new collection (includes unique collection identifier). He showed some screen shots of the interface and examples (this was the first time there wasn’t a network available for his presentation – he was amused that his paranoia that forced him to always bring screen captures finally paid off!).

It was interesting seeing this presentation back to back with the general Internet Archive overview. There are lots of overlap in tools and approaches between them – but Archive-It definitely has it’s own unique requirements. It puts the tools for managing larges scale web crawling in the hands of archivists (or more likely information managers of some sort) – rather than the technical staff of IA.

The final presentation of the roundtable was by Judy Cobb – a Product Manager fromOCLC. She gave an overview of the Web Archives Workbench. (I hunted for a good link to this – but the best I came up with was acknowledgments document and the login page .)The inspiration for the creation of Workbench was the challenge of collecting from web. The Internet is a big place. It is hard to define the scope of what to archive.

Workbench is a discovery tool that will permit its users to investigate what domains should be included when crawling a website for archiving. It will ask you which domains should be included. For example, you can tell it not to crawl Adobe.com just because there is a link to it to let people download acrobat.

Workbench will let you set metadata data for your collection based on the domains you said were in scope. It will then let you appraise and rank the entities/domains being harvested, leaving you with a list of organizations or entities in scope and ranked by importance. Next it will translate a site map of what is going to be crawled, define parts of the map as series and put the harvested content and related metadata into a repository. Other configuration options permit setting how frequently you harvest various series, choosing to only get new content and requesting notification if the sitemap changes.

Workbench is currently in beta and is still under development. The 3rd phase will add the support for Richard Pierce-Moses’s Arizona Model for Web Preservation and Access. The focus of the Arizona Model is curation, not technology. It strives to find a solution somewhere between manual harvesting and bulk harvesting that is based on standard archival theories. Workbench will be open source and funded by LOC.

I wasn’t sure what to expect from the roundtable – but I was VERY glad that I attended. The group was very enthusiastic – cramming in everything they could manage to share with those in the room. The Internet Archive, Archive-It and the Web Archives Workbench represent the front of the pack of software tools intended to support archiving the web. It was easy to see that if the Workbench is integrated in with Archive-It, that it should permit archivists to start paying more attention to the identification of what should be archived rather than figuring out how to do the actual archiving.

SAA2006 Session 305: Extended Archival Description Part III – EAD and TEI

August 20, 2006

Amanda Wilson of the Ohio State University Libraries delivered the final presentation of SAA2006 session 305 (Extended Archival Description: Context and Specificity for Digital Objects), Dynamic Duo: Enhancing Access through Dual Description with EAD and TEI. She described a proof of concept project designed to explore if EAD and TEI can be used to support a humanities professor who has students learning how to digitize and add markup.

She provided the following list of example sites:

Walt Whitman Archive
LEADERS Project ‘Linking EAD to Electronically Retrievable Sources’ – transcriptions, original images and metadata
Barren Lands Digital Collection – University of Toronto, using EAD to enhance item level descriptions.

The professor’s goals for this year’s project are to create a home page that includes a collection description, document the scholarly process and follow markup rules. Amanda got a big cheer for saying she was “not sure if you can keep scholarly process in EAD – but hey, I’ll try anything”.

For each item being digitized the professor wanted to include all of the following:

Markup
Reading View
Diplomatic View – summary of all marks
XML source (aka TEI file – AACR2 bibliographic record could be created from this data)

How can she replicate the process the class was going through to support these goals? First, she picked the software created by DLXS that was already being used on site.

During the course of her research, she had to come up with methods to do all of the following:

Retrieve metadata and images
Convert image to TIFF
Figure out how to connect from EAD to the TEI information
Load TEI and EAD files and images into DLXS

The solution would need to support the community and the work they have already done. Amanda’s vision was to permit addition of item level description to the EAD with no additional editing AND load the TEI with little or no modification. It was a challenge to massage the TEI to validate against the Document Type Definition (DTD). She also wanted to link back to the original site created by the professor’s students in order to retain extra information that had no place in the EAD.

At about this point I was wishing (for the umpteenth time at SAA) that there was internet in order for online demos.

There were definitely challenges. For example, DLXS is aimed at eBooks, while TEI has additional fields such as those required to support properties related to “hand” (as in who wrote the scanned content). This makes it hard to change the TEI format to fit into DLXS. Using TEI in DLXS does permit searching for individual items, but they may need to do additional massaging to get the data to line up with the fields expected.

The conclusion of the presentation was that it can be done. It is possible to integrate the professor’s transcriptions using EAD and TEI within DLXS, but there will need to be more discussion with the faculty member about their requirements. The ultimate aim is for a federated search that is integrated into the institution’s central search. Those who are working on similar projects may be interested in moving to using a standard, but only up to the point at which they start loosing the data that is important to them for their research focus.

SAA 2006 Poster: Communicating Context in Online Collections

August 15, 2006

I promised a number of people I spoke with at the SAA 2006 conference that I would post information from my poster. I have finally added it on a page here on the blog.

For those of you who didn’t make the conference (or didn’t make it to my mini talk in front of my poster on Friday morning), my poster showed the results of my research into how the web interfaces to various digitized archival collections handled the issues of original order and communication of context. I was very interested to see to what degree websites for digitized collections were doing a good job helping the user understand the relationships between the records as well as the context of the records.

Most people asked me which was my ‘favorite’ – and my answer was always that I liked something about each of the sites I showed on my poster. A perfect site would have the collection overview that the Library of Congress American Memory – Browse Collections page shows, the convenient search result resorting option shown on the Greene & Greene Virtual Archives search result page, the item details display option provided on the Irene Kaufman Settlement Photograph Collection ‘images with full record’ search results page, the clear communication of hierarchy shown in the Yoshiko Uchida example of the GenView MOA2 document viewer and a rich use of audio, images and in-place historical context as is done on the Gilder Lehrman Wartime Love Letters site. The big answer I found from all of this was that planning ahead was key. If you keep metadata related to the order of the records being digitized, it gives you the opportunity to do good things with that information when building your interface.

On the ‘Poster page’ have included a list of links to the websites I used as my examples, my key points and a thumbnail of the poster with a link to download a BIG version (you will need to scroll around a good bit – but you should be able to read it in the large version).

If you have questions – just let me know. I can always be reached via email at jeanne AT spellboundblog.com.

SAA2006 Session 305: Extended Archival Description Part II – ArchivesUM

August 11, 2006

The second part of SAA2006 session 305 (Extended Archival Description: Context and Specificity for Digital Objects) was a presentation titled “Understanding from Context: Pairing EAD and Digital Repository Description”. Delivered by Ann Handler and Jennifer O’Brian of the University of Maryland‘s ArchivesUM project, this talk explained the general approach being used to tackle the challenges of managing item level description of digitized images at a large, diverse institution.

They described the tension between collection level descriptions and the inheritance of attributes by items. With more images being digitized all the time, storing the images on a network drive made them hard to find or inventory. Different departments at University of Maryland described images at different levels and using different software. While one department had just folder collection level descriptions, another set of 3000 photos were described at the item level but without adherence to any standard.

A flat system of description was not the answer because it provided no way to include hierarchical description. Their answer was to combine archival description via finding aids with well structured item level description.

As the ArchivesUM project defines them, an item can be part of more than one collection and could be part of a collection that represents the source of the item as well as an online exhibition. The team also wanted to create relationships between the existing repository of finding aids and new digital objects. In order to accomplish everything they required (and in contrast with the Archives of American Art’s choice of ColdFusion), ArchivesUM selected FEDORA – an open source digital object repository for their development. They created a transitional database using SQL Server to track and store the metadata of newly scanned digital objects to not loose anything while the custom FEDORA system was being developed. The digital object repository uses a rich descriptive standard based on Dublin Core and Visual Resources Association (VRA) Core .

In order to connect to other kinds of objects from non-archival collections and other sources they simply add them in finding aid. To present the finding aid they needed new style sheets. Their improved layout helps user understand the hierarchy of the collection and how to find the individual record they are looking for, lets them move up and down through tree easily to explore and helps user understand the context of the record they are viewing.

One of the questions that was asked at the end of the session related to the possible drawbacks of linking just one or two digitized documents to a finding aid. The concern was that it might mislead the user into believing that these few documents were the only documents in the collection held by the archives. The response was strong, asserting that the opposite was true – that having the link into the finding aid gave users the context they desperately needed to understand the few digitized records and point them in the right direction for finding the rest of the offline collection. This is especially important in the Google universe. Users may end up viewing a single digitized record from an online collection and, without a link back to the finding aid for the collection, not understand the meaning of the record in question.

The ArchivesUM team is currently setting the stage for migration to the live system. I look forward to exploring the actual user interface after it does.

Question from the Archives of American Art and EAD talk (session 305)

August 9, 2006 2 Comments

At the end of the Extended Archival Description panel, someone in the audience asked if ColdFusion and ASP were used for the Archives of American Art project. The response was interesting. The answer was yes to ColdFusion and no to ASP. That wasn’t the interesting part. The part I was intrigued by was the reasons WHY they had used ColdFusion.

The developer on the project was there and stood to add his 2 cents. He said these were the reasons for the choice of ColdFusion:

The Smithsonian is not enthusiastic about open source software
The Smithsonian is not unfriendly towards ColdFusion
He knew ColdFusion very well

This immediately made me think of a recent post at Creating Passionate Users: When the “best tool for the job”… isn’t. In her post, Kathy Sierra talks about other factors to weigh when choosing a software tool to solve a problem OTHER than what is the best tool for the job based on the features of all the options. She proposes (in what she admits is a sweeping generalization) that enthusiasm for a tool be weighed more heavily than it’s pure appropriateness for the task when selecting which tool to use.

I am not saying that ColdFusion was necessarily the AAA developer’s first choice – but that it is interesting to remember that there are LOTS of different elements that go into choosing software to address the challenges at the intersection of archives and the internet. One of those things is simply the skills of the people you have to work on a project – and their enthusiasm for the tools at hand.