Menu Close

Category: context

Session 305: Extended Archival Description Part I – Archives of American Art

Session 305 included perspectives from three digital collections which are trying to use EAD and meta data to solve real world problems of navigation and access. This post addresses the presentation by the first speaker, Barbara Aikens from the Archives of American Art at the Smithsonian.

The Archives of American Art (AAA) has over 4,500 collections focusing on the history of American art. They received a 3.6 million dollar grant from the Terra Foundation to fund their 5 year project. They had already been using EAD for their standard in online finding aids since 2004. They also had already looked into digitizing their microfilmed holdings and they believe that the history of microfilming at AAA made the transition to scanning entire collections at the item level easier than it might otherwise have been. So far they have digitized 11 full collections (45 linear feet).

Their organization of the digitized files was based on collection code, box and folder. Basing their template on the EAD Cookbook, AAA used Note Tab Pro to create their XML EAD finding aid. I wonder how they might be able to take advantage of the open source software tools being developed such as Archon and the Archivists’ Toolkit (if you are interested in these packages, keep your eye open for my future post looking at them each in detail). There was some mention of re-purposing DCDs, but I was not clear about what they were describing.

The resulting online finding aid lets you read all the information you would expect to find in a finding aid (see an example), as well as permitting you to drill down into each series or container to view a list of folders. Finally the folder view provides thumbnails on the left and a big image on the right. Note that this item level folder view includes very basic folder meta data and a link back to that folder’s corresponding series page. There is no meta data for any of the images of individual items. This approach for organizing and viewing digitized collections is workable for large collections. The context is well communicated and the user’s experience is very like that of going through a collection while physically visiting an archive. First you use the finding aid to location collections of interest. Next you examine the Series and or Container descriptions to location the types of information for which you are looking. Finally, you can drill down to folders with enticing names to see if you can find what you need.

As an experiment, I tested the ‘Search within Collections/Finding Aids’ option by searching for “Downtown Gallery” and for gallery artist files to see if I was given a link to the new Downtown Gallery Records finding aid. My search for “Downtown Gallery” instead directed me to what appears to be a MARC record in the Smithsonian Archives, Manuscripts and Photographs catalog. Two versions of the finding aid are linked to from this record – with no indication as to how they are different (it turned out one was an old version – the other the new one which includes links to the digitized content). A bit more experimentation showed me that the new online collection finding aids are not integrated into the search. I will have to remember to try this sort of searching in a few months to see what the search experience is like.

What I was hoping for (in a perfect world) would be highlighting of the search terms and deep linking from the search results directly to the series and folder description pages. I wonder what side effects there will be for the accuracy of search results given that the series/folder detail description page does not include all the other text from the main finding aid. (ie New Finding Aid vs New Finding Aid Series Level Page). Oddly enough – the old version of the finding aid for this same collection includes the folder level descriptions on the SAME page (with HTML anchors permitting linking from the side bar Table of Contents to the correct location on the page). So a search for terms that appear in the historical background along with the name of an artist only listed at the folder level WOULD return results (in standard text searching) for the old finding aid but not for the new one. Once the new finding aids are integrated into the search results – it would be very helpful to have an option to only return finding aids that include digitized collections.

While exploring the folder level view, I assumed that the order of the images in the folders is the original order in the analog folder. If so, then that is a fabulous and elegant way of communicating the original order of the records to the user of the digital interface. If NOT – then it is quite misleading because a user could easily assume, as I did, that the order in which they are displayed in the folder view is the original order.

Overall, this is exciting work – and shows how well the EAD can function as a framework for the item level digitization of documents. It also points to some interesting questions about how to handle search within this type of framework.

UPDATE: See the comment below for the clarification that the new finding aids based on the work described in this presentation are NOT online yet – but should be at the end of the month (posted: 08/09/2006). 

Thoughts on Archiving Web Sites

Shortly after my last post, a thread surfaced on the Archives Listserv asking the best way to crawl and record the top few layers of a website. This led to many posts suggesting all sorts of software geared toward this purpose. This post shares some of my thinking on the subject.

Adobe Acrobat can capture a website and convert it into a PDF. As pointed out in the thread above, that would loose the original source HTML – yet there are more issues than that alone. It would also loose any interaction other than links to other pages. It is not clear to me what would happen to a video or flash interface on a site being ‘captured’ by Acrobat. Quoting a lesson for Acrobat7 titled Working with the Web : “Acrobat can download HTML pages, JPEG, PNG, SWF, and GIF graphics (including the last frame of animated GIFs), text files, image maps and form fields. HTML pages can include tables, linkes, frames, background colors, text colors, and forms. Cascading Stylesheets are supported. HTML links are turned into Web links, and HTML forms are turned into PDF forms.”

I looked at a few website HTML capture programs such as Heritrix, Teleport Pro, HTTrack Web and the related ProxyTrack. I hope to take the time to compare each of these options and discover what it does when confronted with something more complicated than HTML, images or cascading style sheets. It also got me thinking about HTML and versions of browsers. It think it safe to say that most people who browse the internet with any regularity have had the experience of viewing a page that just didn’t look right. Not looking right might be anything from strange alignment or odd fonts all the way to a page that is completely illegible. If you are a bit of a geek (like me) you might have gotten clever and tried another browser to see if it looked any better. Sometimes it does – sometimes it doesn’t. Some sites make you install something special (flash or some other type of plugin or local program).

Where does this leave us when archiving websites? A website is much more than just it’s text. If the text were all we worried about I am sure you could crawl and record (or screen scrape) just the text and links and call it a day being fairly confident that text stored as a plain ASCII file (with some special notation for links) would continue to be readable even if browsers disappeared from the world. While keeping the words is useful, it also looses a lot of the intended meaning. Have you read full text journal articles online that don’t have the images? I have – and I hate it. I am a very visually oriented person. It doesn’t help me to know there WAS a diagram after the 3rd paragraph if I can’t actually see it. Keeping all the information on a webpage is clearly important. The full range of content (all the audio, video, images and text on a page) is important to viewing the information in its original context.

Archivists who work with non-print media records that require equipment for access are already in the practice of saving old machines hoping to ensure access to their film, video and audio records. I know there are recommendations for retaining older computers and software to ensure access to data ‘trapped’ in ‘dead’ programs (I will define a dead program here as one which is no longer sold, supported or upgraded – often one that is only guaranteed to run on a dead operating system). My fear is for the websites that ran beautifully on specific old browsers. Are we keeping copies of old browsers? Will the old browsers even run on newer operating systems? The internet and its content is constantly changing – even just keeping the HTML may not be enough. What about those plugins – what about the streaming video or audio. Do the crawlers pull and store that data as well?

One of the most interesting things about reading old newspapers can be the ads. What was being advertised at the time? How much was the sale price for laundry detergent in 1948? With the internet customizing itself to individuals or simply generating random ads how would that sort of snapshot of products and prices be captured? I wonder if there is a place for advertising statistics as archival records. What google ads were most popular on a specific day? Google already has interesting graphs to show the correspondence between specific keyword searches and news stories that google perceives as related to the event. The Internet Archive (IA) could be another interesting source for statistical analysis of advertising for those sites that permit crawling.

What about customization? Only I (or someone looking over my shoulder) can see my MyYahoo page. And it changes each time I view it. It is a conglomeration of the latest travel discounts, my favorite comics, what is on my favorite TV and cable channels tonight, the headlines of the newspapers/blogs I follow and a snapshot of my stock portfolio. Take even a corporate portal inside an intranet. Often a slightly less moving target – but still customizable to the individual. Is there a practical way to archive these customized pages – even if only for a specific user of interest? Would it be worthwhile to be archiving the personalized portal pages of an ‘important’ or ‘interesting’ person on a daily basis – such that their ‘view’ of the world via a customized portal could be examined by researchers later?

A wealth of information can be found on the website for the Joint Workshop on Future-proofing Institutional Websites from January 2006. The one thing most of these presentations agree upon is that ‘future-proofing’ is something that institutions should think about at the time of website design and creation. Standards for creating future-proof websites directs website creators to use and validate against open standards. Preservation Strategies for institutional website content shows insight into NARA‘s approach for archiving US government sites, the results of which can be viewed at A summary of the issues they found can be read in the tidy 11 page web harvesting survey.

I definitely have more work ahead of me to read through all the information available from the International Internet Preservation Consortium and the National Library of Australia’s Preserving Access to Digital Information (PADI). More posts on this topic as I have time to read through their rich resources.

All around, a lot to think about. Interesting challenges for researchers in the future. The choices archivists face today often will depend on the type of site they are archiving. Best practices are evolving both for ‘future-proofing’ sites and for harvesting sites for archiving. Unfortunately, not everyone building a website that may be worth archiving is particularly concerned with validating their sites against open standards. Institutions that KNOW that they want to archive their sites are definitely a step ahead. They can make choices in their design and development to ensure success in archiving at a later date. It is the wild west fringe of the internet that are likely to present the greatest challenge for archivists and researchers.

Paper Calendars, Palm Pilots and Google Calendar

In my intro archives class (LBSC 605 Archival Principles, Practices, and Programs), one of the first ideas that made a light bulb go on over my head related to the theory that archivists want to retain the original order of records. For example, if someone choose to put a series of 10 letters together in a file – then they should be kept that way. A researcher may be able to glean more information from these letters when he/she sees them grouped that way – organized as the person who originally used them organized them.

Our professor went on to explain that seeing what the person who used the records saw was crucial to understanding the original purpose and usage of those records. That took my mind quickly to the world of calendars. Years ago, a CEO of some important organization would have a calendar or datebook of some sort – likely managed by an assistant. Ink or pencil was used to write on paper. Perhaps fresh daily schedules would be typed.

Fast forward to now and the universe of the Palm Pilot and other such handy-dandy hand held and totally customizable devices. If you have one (or have seen those of a friend) you know that how I choose to look at my schedule may be radically different from the way you choose to see your schedule. Mine might have my to-do list shown on the bottom half of the screen. Yours might have little colored icons to show you when you have a conference call. The archivist asked to preserve a born digital calendar will have a lot of hard choices to make.

These days I actually use Google Calendar more often than my Palm. While it has more of a fixed layout (for the moment) – I have the option of including many external calendars (see examples at iCalShare). Right now I have listings of when new movies come out as well as the concert schedule for summer 2006 for the Wolf Trap National Park for the Performing Arts. In the old style paper calendar, a researcher would be able to see related events that the user of the calendar cared about because they would be written down right there. If someone wanted to include my Google calendar in an archive someday (or that of someone much more important!), I suspect they would be left with JUST the records I had added myself into my calendar. When I choose to display the Wolf Trap summer schedule, Google calendar asks me to wait while it loads – presumably from an externally published iCalendar or other public Google calendar source.

This has many implications for the archivist tasked with preserving the records in that Palm Pilot or Google calendar (or any of a laundry list of scheduling applications). This post can do nothing other than list interesting questions at this stage (both ‘this stage’ of my archival education as well as ‘this stage’ of consideration of born digital records in the archival field).

  • How important is it to preserve the appearance of the interface used by the digital calendar user?
  • Might printing or screen capturing a statistical sample (an entire month? an entire year?) help researchers in the future understand HOW the record creator in question interacted with their calendar – what sorts of information they were likely to use in making choices in their scheduling?
  • Could there be a place for preserving publicly shared calendars (like the ones you can choose to access on Google Calendar or Apple’s iCal) such that they would be available to researchers later? What organization would most likely be capable of taking this sort of task on?
  • Could emulators be used to permit easy access to centrally stored born digital calendars? At least one PalmOS Emulator already exists, created mainly for use by those developing software for hardware that runs the Palm operating system it mimics how the tested software would run in the real world. Should archivists be keeping copies of this sort of software as they look to the future of retaining the best access possible to these sorts of records?
  • How can the standard iCalendar format be leveraged by archivists working to preserve born digital calendars?
  • To what degree are the schedules of people whose records will be of interest to archivists someday moving out of private offices (and even out of personally owned computers and handheld devices) and into the centralized storage of web applications such as Google Calendar?

I know that this is just a tiny bite of the kinds of issues being grappled with by Archivists around the world as they begin to accept born digital records into archives. Each type of application (scheduling vs accounting vs business systems) will pose similar issues to those described above – along with special challenges unique to each type. Perhaps if each of the most common classes of applications (such as scheduling) are tackled one by one by a designated team we can save individual archivists the pain of reinventing the wheel. Is this already happening?


My name is Jeanne. I am a graduate student in an Archives program pursuing my MLS (aka, Master of Library Science). I have enjoyed all my classes to date (3) and I love the ideas that those classes have generated. Sometimes I leave class with just as many personal ideas scrawled in the margins of my notebook as class notes written on the main page. I am especially intrigued by the ways in which concepts from different fields intersect. How do ideas from my current field of software development and database design illuminate new issues, questions and concepts in the realm of archival studies?

I am particularly interested in topics related to audio and visual archival materials, digitization, description, meta-data, and retention of context in digitized collections.

So, here we are – you reading and I writing. I hope to make you think about things in a way you may not have before. I hope if you have been down the mental road I am taking and you have noticed something that I have missed, you might take a moment to point it out to me.

Please – ask questions and let me know your thoughts.