digital preservation | Spellbound Blog

Another Thrilling Digital Adventure With Team Digital Preservation

May 6, 2009 5 Comments

Thanks to Archivism.net for this animated gem from DigitalPreservationEurope. Somehow they manage to include digital preservation, trusted data repositories, metadata and refreshing storage media in their story of Team Digital Preservation vs Team Chaos.

I really want a t-shirt with the Bit-Rot guy on it!

SAA2008: Preservation and Experimentation with Analog/Digital Hybrid Literary Collections (Session 203)

September 6, 2008

The official title of Session 203 was Getting Our Hands Dirty (and Liking It): Case Studies in Archiving Digital Manuscripts. The session chair, Catherine Stollar Peters from the New York State Archives and Records Administration, opened the session with a high level discussion of the “Theoretical Foundations of Archiving Digital Manuscripts”. The focus of this panel was preserving hybrid collections of born digital and paper based literary records. The goal was to review new ways to apply archival techniques to digital records. The presenters were all archivists without IT backgrounds who are building on others work … and experimenting. She also mentioned that this also impacts researchers, historians, and journalists.For each of the presenters, I have listed below the top challenges and recommendations. If you attended the sessions, you can skip forward to my thoughts.

Norman Mailer’s Electronic Records

Speaker: Gabriela Redwine from University of Texas at Austin’s Harry Ransom Center
Featured Collection: Norman Mailer Papers

Challenges & Questions:

3 laptops and nearly 400 disks of correspondence
While the letters might have been dictated or drafted by Mailer, all the typing, organization and revisions done on the computer were done by his assistant Judith McNally. This brings into question issues of who should be identified as the record creator. How do they represent the interaction between Mailer & McNally? Who is the creator? Co-Creators?
All the laptops and disks were held by Judith McNally. When she died all of her possessions were seized by county officials. All the disks from her apartment were eventually recovered over a year later – but it causes issues of provenance. There is no way to know who might have viewed/changed the records.

Revelations and Recommendations:

What is accessioning and processing when dealing with electronic records? What needs to be done?

gain custody
gather information about creator’s (or creators’) use of the electronic records. In March 2007 they interviewed Mailer to understand the process of how they worked together. They learned that the computers were entirely McNally’s domain.
number disks, computers (given letters), other digital media
create disk catalog – to reflect physical information of the disk. Include color of ink.. underlining..etc. At this point the disk has never been put into a computer. This captures visual & spacial information
gather this info from each disk: file types, directory structure & file names

The ideal for future collections of this type is archivist involvement earlier – the earlier the better.

Papers of Peter Ganick

Speaker: Melissa Watterworth
Featured Collection: Papers of Writer and Small Press Publisher Peter Ganick, Thomas J Dodd Research Center, University of Connecticut

Challenges & Questions:

What are the primary sources of our modern world?
How do we acquire and preserve born digital records as trusted custodians?
How do we preserve participatory media – maybe we can learn from those who work on performance art?
How do we incrementally build our collections of electronic records? Should we be preserving the tools?
Timing of acquisition: How actively should we be pursuing personal archives? How can we build trust with creators and get them to understand the challenges?
Personal papers are very contextual – order matters. Does this hold true for born digital personal archives? What does the networking aspect of electronic records mean – how does it impact the idea of order?
First attempt to accession one of Peter Ganick’s laptops and the archivist found nothing she could identify as files.. she found fragments of text – hypertext work and lots of files that had questionable provenance (downloaded from a mailing list? his creations?). She had to sit down next to him and learn about how he worked.
He didn’t understand at first what her challenges were. He could get his head around the idea of metadata and issues of authenticity. He had trouble understanding what she was trying to collect.
How do we arrange and keep context in an online environment?
Biggest tech challenge: are we holding on for too long to ideas of original order and context?
Is there a greater challenge in collecting earlier in the cycle? What if the creator puts restrictions on groupings or chooses to withdraw them?
Do we want to create contracts with donors? Is that practical?

Revelations and Recommendations:

Collect materials that had high value as born digital works but were at a high risk of loss.
Build infrastructure to support preservation of born digital records.
Go back to the record creator to learn more about his creative process. They used to acquire records from Ganick every few years.. that wasn’t frequent enough. He was changing the tools he used and how he worked very quickly. She made sure to communicate that the past 30 years of policy wasn’t going to work anymore. It was going to have to evolve.
Created a ‘submission agreement’ about what kinds of records should be sent to the archive. He submitted them in groupings that made sense to him. She reviewed the records to make sure she understood what she was getting.
Considering using PDFa to capture snapshot of virtual texts.
Looked to model of ‘self archiving’ – common in the world of professors to do ongoing accruals.
What about ’embedded archivists’? There is a history of this in the performing arts and NGOs and it might be happening more and more.

George Whitmore Papers

Speaker: Michael Forstrom: Beinecke Rare Book and Manuscript Library, Yale University
Featured Collection: George Whitmore Papers

Challenges & Questions:

How do you establish identity in a way that is complete and uncorrupted? How do you know it is authentic? How do you make an authentic copy? Are these requirements as unreasonable and unachievable?

Revelations and Recommendations:

Refresh and replicate files on a regular schedule.
They have had good success using Quick View Plus to enable access to many common file formats. On the downside, it doesn’t support everything and since it is proprietary software there are no long term guarantees.
In some cases they had to send CP/M files to a 3rd party to have them converted into WordStar and have the ascii normalized.
Varied acquisition notes.. and accession records.. loan form with the 3rd party who did the conversion that summarized the request.. they did NOT provide information about what software was used to convert from CP/M to DOS. This would be good information to capture in the future.
Proposed an expansion of the standards to include how electronic records were migrated in the <processinfo> processing notes.

Questions & Answers

Question: As part of a writers community, what do we tell people who want to know what they can DO about their records. They want technical information.. they want to know what to keep. Current writers are aware they are creating their legacy.

Answer: Michael: The single best resource is the interPARES 2 Creator Guidelines. The Beineke has adapted them to distrubute to authors. Melissa: Go back to your collection development policies and make sure to include functions you are trying to document (like process.. distribution networks). Also communities of practice (acid free bits) are talking about formats and guidelines like that Gabriela: People often want to address ‘value’. Right now we don’t know how to evaluate the value of electronic drafts – it is up to authors.

Question: Cal Lee: Not a question so much as an idea: the world of digital forensics and security and the ‘order of volatility’ dictate that everyone should always be making a full disk copy bit by bit before doing anything else.

Comment: Comment on digital forensic tools – there is lots of historical and editing history of documents in the software… also delete files are still there.

Question: Have you seen examples of materials that are coming into the archive where the digital materials are working drafts for a final paper version? This is in contrast to others are electronic experiments.

Answer: Yes, they do think about this. It can effect arrangement and how the records are described. The formats also impact how things are preserved.

Question: Access issues? Are you letting people link to them from the finding aids? How are the documents authenticity protected.

Answer: DSpace gives you a new version anytime you want it (the original bitstream) .. lots of cross linking supports people finding things from more than one path. In some cases documents (even electronic) can only be accessed from within the on site reading room.

Question: What is your relationship is like with your IT folks?

Answer: Gabriela: Our staff has been very helpful. We use ‘legacy’ machines to access our content. They build us computers. They are also not archivists, so there is a little divide about priorities and the kind of information that I am interested in.. but it has been a very productive conversation.

Question: (For Melissa) Why didn’t you accept Peter’s email (Melissa had said they refused a submission of email from Peter because it didn’t have research value)?

Answer: The emails that included personal medical emails were rejected. The agreement with Peter didn’t include an option to selectively accept (or weed) what was given.

Question: In terms of gathering information from the creators.. do you recommend a formal/recorded interview? Or a more informal arrangement in which you can contact them anytime on an ongoing basis?

Answer: Melissa: We do have more formal methods – ‘documentation study’ style approaches. We might do literature reviews.. Ultimately the submission agreement is the most formal document we have. Gabriela: It depends on what the author is open to.. formal documentation is best.. but if they aren’t willing to be recorded, then you take what you can get!

My Thoughts

I am very curious to see how best practices evolve in this arena. I wonder how stories written using something like Google Documents, which auto-saves and preserves all versions for future examination, will impact how scholars choose to evaluate the evolution of documents. There have already been interesting examinations of the evolution of collaborative documents. Consider this visual overview of the updates to the Wikipedia entry for Sarah Palin created by Dan Cohen and discussed in his blog post Sarah Palin, Crowdsourced. Another great example of this type of visual experience of a document being modified was linked to in the comments of that post: Heavy Metal Umlaut: The Movie. If you haven’t seen this before – take a few minutes to click through and watch the screencast which actually lets you watch as a Wikipedia page is modified over time.

While I can imagine that there will be many things to sort out if we try to start keeping these incredibly frequent snapshot save logs (disk space? quantity of versions? authenticity? author preferences to protect the unpolished versions of their work?) – I still think that being able to watch the creative process this way will still be valuable in some situations. I also believe that over time new tools will be created to automate the generation of document evolution visualization and movies (like the two I link to above) that make it easy for researchers to harness this sort of information.

Perhaps there will be ways for archivists to keep only certain parts of the auto-save versioning. I can imagine an author who does not want anyone to see early drafts of their writing (as is apparently also the case with architects and early drafts of their designs) – but who might be willing for the frequency of updates to be stored. This would let researchers at least understand the rhythm of the writing – if not the low level details of what was being changed.

I love the photo I found for the top of this post. I admit to still having stacks of 3 1/2 floppy disks. I have email from the early days of BITNET. I have poems, unfinished stories, old resumes and SQL scripts. For the moment my disks live in a box on the shelf labeled ‘Old Media’. Lucky me – I at least still have a computer with a floppy drive that can read them!

Image Credit: oh messy disks by Blude via flickr.

As is the case with all my session summaries from SAA2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

Digital Preservation via Emulation – Dioscuri and the Prevention of Digital Black Holes

December 25, 2007 2 Comments

Available Online posted about the open source emulator project Dioscuri back in late September. In the course of researching Thoughts on Digital Preservation, Validation and Community I learned a bit about the Microsoft Virtual PC software. Virtual PC permits users to run multiple operating systems on the same physical computer and can therefore facilitate access to old software that won’t run on your current operating system. That emulator approach pales in comparison with what the folks over at Dioscuri are planning and building.

On the Digital Preservation page of the Dioscuri website I found this paragraph on their goals:

To prevent a digital black hole, the Koninklijke Bibliotheek (KB), National Library of the Netherlands, and the Nationaal Archief of the Netherlands started a joint project to research and develop a solution. Both institutions have a large amount of traditional documents and are very familiar with preservation over the long term. However, the amount of digital material (publications, archival records, etc.) is increasing with a rapid pace. To manage them is already a challenge. But as cultural heritage organisations, more has to be done to keep those documents safe for hundreds of years at least.

They are nothing if not ambitious… they go on to state:

Although many people recognise the importance of having a digital preservation strategy based on emulation, it has never been taken into practice. Of course, many emulators already exist and showed the usefulness and advantages it offer. But none of them have been designed to be digital preservation proof. For this reason the National Library and Nationaal Archief of the Netherlands started a joint project on emulation.

The aim of the emulation project is to develop a new preservation strategy based on emulation.

Dioscuri is part of Planets (Preservation and Long-term Access via NETworked Services) – run by the Planets consortium and coordinated by the British Library. The Dioscuri team has created an open source emulator that can be ported to any hardware that can run a Java Virtual Machine (JVM). Individual hardware components are implemented via separate modules. These modules should make it possible to mimic many different hardware configurations without creating separate programs for every possible combination.

You can get a taste of the big thinking that is going into this work by reviewing the program overview and slide presentations from the first Emulation Expert Meeting (EEM) on digital preservation that took place on October 20th, 2006.

In the presentation given by Geoffrey Brown from Indiana University titled Virtualizing the CIC Floppy Disk Project: An Experiment in Preservation Using Emulation I found the following simple answer to the question ‘Why not just migrate?’:

Loss of information — e.g. word edits
Loss of fidelity — e.g. WordPerfect to Word isn’t very good
Loss of authenticity — users of migrated document need access to original to verify authenticity
Not always possible — closed proprietary formats
Not always feasible — costs may be too high
Emulation may necessary to enable migration

After reading through Emulation at the German National Library, presented by Tobias Steinke, I found my way to the kopal website. With their great tagline ‘Data into the future’, they state their goal is “…to develop a technological and organizational solution to ensure the long-term availability of electronic publications.” The real gem for me on that site is what they call the kopal demonstrator. This is a well thought out Flash application that explains the kopal project’s ‘procedures for archiving and accessing materials’ within the OAIS Reference Model framework. But it is more than that – if you are looking for a great way to get your (or someone else’s) head around digital archiving, software and related processes – definitely take a look. They even include a full Glossary.

I liked what I saw in Defining a preservation policy for a multimedia and software heritage collection, a pragmatic attempt from the Bibliothèque nationale de France, a presentation by Grégory Miura, but felt like I was missing some of the guts by just looking at the slides. I was pleased to discover what appears to be a related paper on the same topic presented at IFLA 2006 in Seoul titled: Pushing the boundaries of traditional heritage policy: Maintaining long-term access to multimedia content by introducing emulation and contextualization instead of accepting inevitable loss . Hurrah for NOT ‘accepting inevitable loss’.

Vincent Joguin’s presentation, Emulating emulators for long-term digital objects preservation: the need for a universal machine, discussed a virtual machine project named Olonys. If I understood the slides correctly, the idea behind Olonys is to create a “portable and efficient virtual processor”. This would provide an environment in which to run programs such as emulators, but isolate the programs running within it from the disparities between the original hardware and the actual current hardware. Another benefit to this approach is that only the virtual processor need be ported to new platforms rather than each individual program or emulator.

Hilde van Wijngaarden presented an Introduction to Planets at EEM. I also found another introductory level presentation that was given by Jeffrey van der Hoeven at wePreserve in September of 2007 titled Dioscuri: emulation for digital preservation.

The wePreserve site is a gold mine for presentations on these topics. They bill themselves as “the window on the synergistic activities of DigitalPreservationEurope (DPE), Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval (CASPAR), and Preservation and Long-term Access through NETworked Services (PLANETS).” If you have time and curiosity on the subject of digital preservation, take a glance down their home page and click through to view some of the presentations.

On the site of The International Journal of Digital Curation there is a nice ten page paper that explains the most recent results of the Dioscuri project. Emulation for Digital Preservation in Practice: The Results was published in December 2007. I like being able to see slides from presentations (as linked to above), but without the notes or audio to go with them I am often left staring at really nice diagrams wondering what the author’s main point was. The paper is thorough and provides lots of great links to other reading, background and related projects.

There is a lot to dig into here. It is enough to make me wish I had a month (maybe a year?) to spend just following up on this topic alone. I found my struggle to interpret many of the Power Point slide decks that have no notes or audio very ironic. Here I was hunting for information about the preservation of born digital records and I kept finding that the records of the research provided didn’t give me the full picture. With no context beyond the text and images on the slides themselves, I was left to my own interpretation of their intended message. While I know that these presentations are not meant to be the official records of this research, I think that the effort obviously put into collecting and posting them makes it clear that others are as anxious as I to see this information.

The best digital preservation model in the world will only preserve what we choose to save. I know the famous claim on the web is that ‘content is king’ – but I would hazard to suggest that in the cultural heritage community ‘context is king’.

What does this have to do with Dioscuri and emulators? Just that as we solve the technical problems related to preservation and access, I believe that we will circle back around to realize that digital records need the same careful attention to appraisal, selection and preservation of context as ‘traditional’ records. I would like to believe that the huge hurdles we now face on the technical and process side of things will fade over time due to the immense efforts of dedicated and brilliant individuals. The next big hurdle is the same old hurdle – making sure the records we fight to preserve have enough context that they will mean anything to those in the future. We could end up with just as severe a ‘digital black hole’ due to poorly selected or poorly documented records as we could due to records that are trapped in a format we can no longer access. We need both sides of the coin to succeed in digital preservation.

Did I mention the part about ‘Hurray for open source emulator projects with ambitious goals for digital preservation’? Right. I just wanted to be clear about that.

Image Credit: The image included at the top of this post was taken from a screen shot of Dioscuri itself, the original version of which may be seen here.

The MemoryArchive Affiliate Program: A Wiki Engine for Collecting Memoirs

November 14, 2007 2 Comments

A Beautiful WWW posted A Review of MemoryArchive.org. MemoryArchive, founded by historian Marshall Poe, is a new MediaWiki based website aimed at collecting first person accounts that they term ‘memoirs’. In sharp contrast with the communal authorship approach of most wikis, MemoryArchive locks down edits of each entry after a format review.

What sorts of memoirs are they looking for? In their FAQ they say they want “pretty much anything you remember that someone else might conceivably find interesting, now or in 500 years”.

I spent some time exploring. I read a very moving memorial titled Death by Aids The Goodbye Party, 1992, by Jay Blotcher (ed note: Jay emailed me with the correct title for this memoir). I wandered through some 9/11 memories. Eventually something dawned on me. Maybe it is the fact that I am spending most of my days lately thinking deep thoughts about metadata and classification — or maybe my archives course work is to blame — whatever the reason, I realized that I wanted more information about the storytellers. Right now it appears that each memoir includes Who, What, When and Where data – to whatever degree the contributors choose to furnish such information. Categories are also available and seem to be frequently employed.

But I want to know more about the individuals who are telling the stories. I appreciate that some posts will be made more powerful through anonymity, but for those cases that an individual is willing to share additional biographic information it would be great to have an easy place for that information to be captured.

I think the most interesting aspect of the Memory Archive to the archives community is the Memory Archive Affiliate Program. The theory behind this program is to support the collection and archiving of personal histories online. It is described as being of interest to the following types of organizations:

historical societies (urban, state, or national)
institutions interested in recording their own history (a club, society, or military unit)
educational institutions teaching history (high school or college)
public history projects (oral history gathering, or document collection)

This is a powerful idea. Any time you can accumulate a critical mass of of a single type of information on the web (in this case, memoirs) you have the chance of becoming a destination. There is also the added benefit of enabling smaller organizations to launch an online memoir collection initiatives without needing to worry about the technology, costs and people-power that would usually be required.

There does needs to be an easy way for the Memory Archive Affiliates to download these born digital memoirs for offline use and preservation purposes. This could be accomplished by an ‘export’ or ‘format for printing’ button on each memoir page, or perhaps some form of bulk download for all memoirs collected for a single affiliate’s project. I will say that the default print format isn’t bad. It seems to already do some special reformatting (such as displaying URL links in their entirety). I still also would want more metadata, though perhaps the definition of attributes to be collected could be customized per project.

I am curious to see the overall quality of the memoirs a year from now. I suspect that memoirs collected is association with a topically focused program may be more compelling than the average ‘man-on-the-net’ first person narratives. That isn’t to say that there is no value in the memories of someone who feels compelled to share their story – but a collection created around a theme would have the additional power of that common thread. The affiliate program memoirs would also be more likely to come with some contextual background explaining the source and origin of the solicited accounts. I am a fan the existing thematic memory sites, such as The April 16 Archive and the Hurricane Digital Memory Bank. I love that the Omeka software used to create these two example sites is open source and free. Unfortunately, I don’t think the average small historical society or public history project is likely to have the resources to build and support a site like this even with free software. I think that a program like the Memory Archive Affiliate Program (or something like it) could bridge the gap for these smaller organizations and make the creation of online memoir collection projects a reality.

SAA2007: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building (Session 707)

September 22, 2007 2 Comments

Session 707 of SAA2007 in Chicago discussed many aspects of the project that created Publishers’ Bindings Online (PBO). The full title of this session was The Anatomy of a Collaborative Digital Project and Lessons Learned in the Realms of Access, Outreach, and Creative Success: A Multi-Disciplinary Look at Publishers’ Bindings Online, 1815-1930: The Art of Books. The presenters have kindly posted the full slide deck from their panel online. In this post I attempt to capture the main points of the presentation and Q&A discussion of PBO.

Who Spoke?

Jessica Lacher-Feldman (session chair) – University of Alabama, PBO project manager

Amy Rudersdorf – now at North Carolina State University, Digital production coordinator, NCSU special Collections, but was at University of Wisconsin, Madison during PBO project

Kristy Dixon – University of Alabama , PBO staff

PBO Project Overview

PBO was made possible by a 3 year Institute of Museum and Library Services (IMLS) grant. Originally awarded in 2003, the grant was extended once (and I think they mentioned additional funding being applied for). The primary grant funded the digitization of 10,000 images from up to 5000 book bindings. Ultimately 10,570 images were added to PBO and made searchable by metadata. The bindings selected included books from 1815-1930, primarily US titles and mostly in English.

Their guiding vision was of “giving something to the world that is both needed and useful” (and really beautiful). And they succeeded! PBO is a lot more than 10,000+ digitized book bindings. The project strived to make the information available in many different ways, including via:

a web-based database
online exhibits & galleries,
vodcasts and podcasts
web-based tutorials
virtual and real exhibits
presentations & class lectures
opportunities to adapt the project to other disciplines – history, book arts, librarianship, literature.. K-12 and more

Technology and Processes

The division of labor for PBO was split between the University of Alabama and the University of Wisconsin, Madison.

Many extensions to the OCLC SiteSearch based database were made by the UWDCC (UW Digital Collections Center) digital production center at the University of Wisconsin, Madison .

They went through an overview of the participants and staff – who did what.. what skills were needed and what was brought by the two institutions to the collaboration. They acknowledged their fabulous advisory group including Sue Allen – “the expert on publisher’s bindings”. Individuals from outside their teams contribute based on their special interest and knowledge about a specific individual (this contribution is still ongoing).

Working in collaboration forced them to wrestle with many challenges including:

staff in two locations – most of whom had never met
“long distance relationships are hard”
they had to work hard to ensure that all were ‘equally-valued participants’
standards – you need ground rules from the outset

Collaboration & Description

“Every pair of eyes are different”. PBO tapped into the resource of the ‘young fertile minds’ to power the project out of the local MLS programs at both institutions. Even with a detailed description form – there was confusion over subject headings and overlap – especially when those selecting subject headings were grad students who might not know the official terms for things. For example, the list of terms might include Ouroboros – but the students might not know this it is the term for a snake eating it’s own tail.

Ultimately they had to do quality control at a single location. They spent a LOT of time on this.

Their top tips for cultivating continuity for virtual project teams:

write into your grants money for travel (they stressed that your grant includes funds to support people meeting each other)
continuous communication is critical
‘shared working group website’ available online
email, conference calls and instant messaging (IM) for communication
regular reporting to each other
being project manager means that you have to be on top of everything – you need to be the glue
focus on the deliverables – use planning tools and timelines

They discovered that IM was key to developing trust between the two institutions.

Metadata – the core of the project

The key to their metadata approach was to consider a book less as a ‘bibliographic object’ and more as an ‘art object’.

They called books in PBO ‘objects’ but still kept the bibliographic metadata. They used Dublin Core by pulling the MARC data into the Dublin Core structure. As part of this they took all the subjects from the bibliographic info and moved it to the Dublin Core description and labeled it ‘book topic’. Then they used the ‘Subjects’ portion of the Dublin Core record to describe the binding and talk about what the images are OF. This is where the subject terms from the controlled vocabulary were added.

These are the steps of their metadata workflow process:

selection from collections of note – faculty, consultants and library staff did this step
description – used a paper form, described the books on paper and joined that description to what was in the MARC record – done by the grad students and library staff
metadata entry – entry of data through an online form – done by students (overseen by library staff) actually ended up being cheaper to manually enter the MARC data (rather than automated extraction)
quality control – content, grammar, spelling – done by library staff (took a lot more time than anyone expected)
no live update between their working Filemaker Pro database and the final SiteSearch database
record ownership – indicated in the identifier field (with a special code in the identifier) AND in the Submitter field

A lot of description went into this project.

They needed to develop a controlled vocabulary for the project. To do this they first worked with content specialists to develop a list. They used Library of Congress Subject Headings (LCSH) terms where they could, as well as Getty Art and Architecture Thesaurus. Then they added some local terms. The controlled vocabulary list evolved with the project and is the foundation of all teaching, search and more.

The speaker showed an example of the controlled vocabulary – the terms really are a window into the past. Users can browse the controlled vocabulary through the front end.

On the description paper form they had a list of ‘binding themes’ for those doing the description to pick from. A lot of work was done to get the huge list of themes onto a single page. Ultimately they had to provide some fill in the blank extension fields. For example, rather than believing they had listed every useful trade or profession, there was a section on the list labeled: Profession/Trade – _______________ with the expectation that those describing a binding might need to fill in the blank.

Digitization and The Database

Generally two scans were taken from each book, but sometimes as many as five. What did they scan? Front cover, spine, back cover and end papers.

There were two different image reformatting standards at the two institutions – 300 DPI vs 600 DPI. Both used a black background when scanning. All books were presented in as in condition – some have front/back covers missing. After the scanning they began with master TIFs and then transformed them to JPGs in three sizes in 72 DPI.

The presentation showed screen shots of:

simple search
brief view record in search results — which includes subjects
full record view – including display of all images associated with the book object record
gallery view – thumbnail, title and indication if there are one or more images related to the title
guided search (advanced search)
clickable subject headings

All the images in PBO are freely available for download.

With an eye to digital preservation, all the original uncompressed TIF images are archived in triplicate to digital archive tape and stored in three different locations. The metadata is stored with images in both text and SGML format (which is what SiteSearch works with). The full process documents are available on the project site.

Future Growth

The PBO team is talking to Louisiana State University (LSU) to figure out how PBO can grow. LSU would need to work and live with the way PBO works and learn their processes. They are talking to other institutions – if you are interested in adding content to PBO, please contact them.

The Richard Minsky Collection has been purchased and is being added to the project. This is a rich collection that was gathered to create a catalog. PBO has the catalog and all of Minsky’s research that goes with the collection. The goal is to feed as much of this rich data into PBO as possible. They are working with individual scholars and collectors to find other avenues for growth.

Value Added Components

One of the focuses of PBO has been to look beyond the digital images themselves to creating value added components for their user community.

A tutorial for users is provided, including information about how to email a record. A comprehensive bibliography has been created and is used by scholars. The page prompts users to submit feedback so the bibliography is a live document.

Over 30 galleries have been created – organizing access to essays and additional info by topic. Types of galleries include:

Galleries on Bindings and Book binding techniques – these are not really related to individual book objects – but give more information, for example Silver & Gold: The Art of Metal Stamping
Galleries on Collections – for example the Wade Hall Collection of Southern History and Culture
Galleries on Artistic Styles and Movements – a narrative approach provides information on the historical roots of the movements and show how the bindings fit into the movements
Galleries on History – they have 11 of these galleries,including major historical events, literature and culture of the time
Galleries on Literature

Links to trusted information outside of PBO’s site are shown whenever possible. For example – links to the full text of books are provided via Project Gutenberg. Throughout the site’s text link to sources such as the Library of Congress, .gov sites, PBS and so forth can be found.

Canned searches are provided to make it easy for users to explore content. An example of this is the Silver & Gold: The Art of Metal Stamping search that will find every binding with either silver or gold stamping. This is in contrast with making users figure out the right syntax to submit the search criteria themselves.

The Teaching Tools portion of the site provides sample lesson plans on all sorts of topics. They worked with some high school history teachers via focus groups and got feedback about what they needed and wanted. The Industrial Revolution lesson plan was created based on that feedback.

The research tools that were created as a result of the PBO project and are made available online are:

glossary – 456 terms defined using ten major authorities
bibliography of print & web resources
controlled vocabulary for subject headings
publishers map – an interactive map that includes 2123 publishers so far
tutorials on various subjects

Signed or Designer bindings is a new resource to which scholars continue to contribute new information.

Through collaboration with teaching faculty they developed the presentation such as Indians, the Frontier, and the West in American Bookbindings. This presentation will eventually be podcast on the PBO site. It talks about how these books inspired people to move west and inspired kids to read.

Another podcast is on the way addressing the representation of Uncle Tom’s Cabin. It will discuss how the book was it marketed to different groups – Yiddish, German… etc. There already exists a gallery and essay on Uncle Tom’s Cabin .

Conclusions

The team has been very pleased by the tangible scholarly impact of PBO. They have seen extensive collaboration with the university community, new research, and promotion of the use of special collections materials in the classroom using digital resources. They point to PBO as showing a path to preserve these increasingly fragile books by moving out of the general stacks and into special collections – with a result of increased access to the book and decreased handling.

The presenters avowed that PBO could never have been created by their team alone – working with consultants and advisers was the key to their success. They needed input from experts and others to help PBO grow and keep it sustainable. This interaction makes the project strong – it has it’s own legs and won’t cease to exist when the money disappears.

Publicity and outreach got attention on the PBO project from the very beginning. They made documenting their experiences and making recommendations about how to market digital projects part of the original plan in their grant proposals. These documents were part of their deliverables. They even published a white paper about PBO and outreach.

PBO uses Google Analytics so they can see where their users are coming from. Also it makes cool talking points for your reports and fun things to tell the Dean!

I think the best conclusion to my summary of the presentation portion of this session is the list of points on the final slide titled “Beyond the grant: Room to Grow”:

Potential future contribution from other repositories in the US and abroad…
Potential future collaboration with teaching faculty at UA and beyond
With additional collections, the database and the project will only grow stronger
Potential as a web portal, clearing house, or consortium
Additional potential funding opportunities, scholarship, and ways to highlight collections, resources, knowledge, and abilities

Questions and Answers

Keep in mind throughout this section that I am summarizing and paraphrasing the questions and their answers. Please do not take any statements as full and complete quotes. In cases where I missed too much of the question or answer I generally skipped including it in the list below. If you are anxious to know exactly what was said, you would need to buy and listen to the conference recordings for this session.

Question: Who maintains the website and who makes decisions about how things are going to get updated?
Answer: UA maintains the static web pages and UW maintains the database. The project manager has been in charge.. made prototypes of new design and sent it around for feedback. They have standards for colors in their handbooks.

Question: If the grant funding dried up right now would the project be sustainable?
Answer: There is support from the institutions… for example, it is just one project of many at UW.

Question: How did you get such good scans of the book spines?
Answer : At UW they used blocks or boxes to prop up the books and laid black foam core on top on flatbed scanners. At UA – they used black paper covered blocks in combination with overhead scanners.

Question: How did you get the full cover scans?
Answer: They very carefully lay the cover flat – so the pages sticking are sticking up.

Question: Who customized SiteSearch – OCLC or UW?
Answer: UW did the work – they had one and a half dedicated IT staff to do the customizations.

Question : Have you had to negotiate copyright issues for bindings from the late end of the time range of the project
Answer: No.

Question : Are you aware of others doing similar projects? Have you been approached and or are looking for others who want to contribute?
Answer: Yes. Right now they are working with LSU and are not actively seeking out new participants. There are plans to grow the project eventually.

Question: Did you think about the fact that you were creating your own online publication?
Answer: They didn’t realize it ahead of time – they didn’t realize how powerful the database was going to be to fuel their ability to build further on the work.

Question: Can you search for ‘young people’s covers’ – is there metadata for what age groups might enjoy specific books?
Answer: It depends on if it was part of the descriptive information, but you can search on ‘boys’ or ‘girls’ or ‘juvenile’ and gain useful results.

Question: Can you talk about the work behind the MARC to Dublin Core migration?
Answer: In some ways it was easier than they thought it would be – so many of the fields transfer directly from MARC to Dublin Core.. it was the revelation about the book as art object that made them realize the work they needed to do. Building the controlled vocabularies was where the heavy lifting occurred. It involved going through giant spread sheets with subject terms in alphabetical order looking for typos and working toward consistency (ie, use plurals). The spreadsheet didn’t show how many items used each term – it was hard to know how many changes would be needed.

Question: Do you get hits from the standard online catalog into PBO?
Answer: This is not happening now. They would love to build a better connection between the OPAC and PBO in both institutions.

Question: How did you make decisions when there were disagreements?
Answer: “I don’t remember any more.. it was all so beautiful…” <laughter > . There were no big issues about standards. There were more issues about the grant and things like how many images or books they were supposed to scan. In some cases it was easy because they were in charge of very different project areas – each team had “their own little fiefdom”.

Question: Do you think you might sell images to generate revenue?
Answer: They have considered it. The have made a calendar and a poster, but gave them away. They also have used images for making holiday cards. They don’t see selling images as a main goal right now.

Question: Have you considered pursuing online collaborative methods for work with the scholars and collectors?
Answer: No, but they think that would be useful to explore.

My Thoughts

I loved the energy and connection displayed by the presenters. It was fun to see a team of people who clearly were so proud of their work and pleased by its reception. I was personally intrigued by the highlighted challenge of coming up with (and painstakingly validating) their controlled vocabulary for subjects. I firmly believe that the topic of subject terms and their standardization across repositories will only grow in importance. For those interested in some of what is being done on this front – take a look at both the UK based High Level Thesaurus (HILT) and the Simple Knowledge Organisation Systems Core (SKOS) project. I suspect many will be intrigued by the SKOS use case titled An integrated view to medieval illuminated manuscripts.

Even given the mammoth effort required to create a shared controlled vocabulary, it is clear that the benefits they have reaped from this effort are still being discovered. The speakers mentioned on multiple occasions how pleased (and surprised) they were to realize how powerful their database of metadata has proven to be. All the amazing value added features build on this ‘heavy lifting’.

While it will be rare for such item level attention to be given to most archival documents, PBO sets the bar high for what can be done via collaboration across institutions. Their dedication to sharing their lessons learned is a fine example of what all big projects who are forging new frontiers could be doing. Finally – it is the weight of all the value added elements (galleries, tutorials, lesson plans.. and the list goes on) that have raised what could have been just a set of classified images in a database to being an active community with a growing draw for many types of users from around the world.

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

SAA2007: Preserving Born Digital Records of the Design Community (Session 106)

September 8, 2007 9 Comments

The official title for SAA2007 Session 106 is Constructing Sustainability: Real-World Implementations of Preservation Standards for Born-Digital Design Documentation, but I think it might have been better served to include the word Architecture somewhere in it’s title. Sponsored by the Architectural Records Roundtable, this session considered issues related to preserving born digital records of “the design community”. The design community in question includes both architects and landscape designers.

Each panelist gave a 5 minute brief about the way in which they are working toward preserving these design community records – and the rest of the session was opened up to Q&A. David Read, the session chair, mentioned how they used a wiki to collect questions and ideas for the session, gave an introduction to each of the panelists and helped guide the Question and Answer portion of the session.

Who was on the panel?

David Read (Session Chair, Information Resources Manager, DiMella Shaffer )
Phil Bernstein (Autodesk, Architect and Technologist)
Carissa Kowalski Dougherty (Art Institute of Chicago, Department of Architecture and Design )
Annemarie van Roessel (Columbia University, Avery Architectural and Fine Arts Library )
Dennis Newman (general manager at PFS Corporation , member of PDF standards working group of ISO)

What is being done?

Phil Bernstein kicked off the 5 minute summaries with a quick history of design technology. He explained how currently there is a shift in progress. Hundreds of years of paper drawings were followed by ten to fifteen years of electronic drawings. The latest development is use of Building Information Modelling (BIM). BIM relies on a database that generates ‘reports’ that are in fact ‘drawings’. These are sometimes referred to as Building Development Information Models. Digital printers can produce physical models directly from the stored BIM data with no need to step through generation of an actual drawing outside the computer.

Phil showed Yale School of Architecture design examples from the BIM world. These were fantastical organically shaped creations that looked more like strange undiscovered plants from under the sea than traditional buildings!

The good news is that the data in the BIM databases are all just text. The bad news is that the generated ‘design artifacts’ are based on the text data and can lead to digitally printed artifacts. There has been an explosion in the various means of representation. The architecture world is catching up to the to other industries (such as the auto industry) that have been doing this for 25+ years.

Current architects are application agnostic – they don’t care what they use to create their outputs. All the paths and platforms will only grow – what is driving the design process will be increasing in complexity. The building industry is making a fundamental shift from electronic drawing to the Building Information Modeling approach – but there is an unlimited environment for representation. He hoped to discuss the intersection between the archival/record keeping issues and the problems facing the architecture world.

Carissa Kowalski Dougherty’s overview covered the Digital Archive for Architecture (DAArch) project out of the Art Institute of Chicago . The project was based on the 2004 study Collecting, Archiving, and Exhibiting Digital Design Data. They considered how Architecture and Design firms are using software tools to produce and design – but examined these questions from a museum and curatorial perspective.

The recommendation is a two-tiered collection approach.

First tier: Native files – like autocad files – these are going to be preserved at the bit level – but there is no commitment to ensuring access to these files
Second tier : Output formats – only pdf and tif files
PDF: line drawings, vector-based graphic files, text documents power points
TIF: renderings, digital photographs

The second tier outputs are what they are committing to “functionally preserve”.

Carissa presented an example of what they accessioned from the Garofalo Architects‘ Manilow Residence (2001-2003) project. A lot of what they got were files that no-one (including the small architectural firm itself) could still open.. the software is gone. Another major challenge was poor naming conventions for the files themselves. The final project archive included over 200 native vector 2D files (.dxf, .dgn, .dwg), 145 pdfs.. and more.

From the UrbanLab they sought to preserve their Visitor Information Center Competition Entry from 2001. This was a project that was never built and therefore has little physical output. They mostly used autoCAD (2D), Maya (3D), FormZ (3D) and Adobe Illustrator (layout).

The DAArch Software highlights:

browser based
DSpace as back end
Dublin Core augmented with CDWA and custom metadata to support architecture data and digital materials
authority records
group and item level cataloging
will be available open source with BSD license via SourceForge (this was a requirement of the funder – that it be open source)

Final lessons and challenges from the DAArch project:

file naming and organization – the biggest challenges at the smaller firms – need outreach to these firms
metadata for digital objects – there is not a lot out there for 3D digital images
software and migration tools – can we/should we preserve the software dependent first tier files? or just the PDF/TIF outputs?
three-dimensional objects, BIM, animations, etc

Annemarie van Roessel discussed Columbia’s major Manhattanville project. Their goal is to make digital records last as long as steel and glass. The Avery Architectural and Fine Arts Library is feeling the pressure to be a leader, so how does Avery document this project? Manhattanville is a 30 year planning, design and build project targeted to be completed in 2030. It will cover 17 acres northwest of the main Columbia campus.

There are many building blocks to the digital design archives: autoCAD, project management records, collaborative environments (sharepoint – Microsoft), images, presentations, websites and movies (ie, more than just “scary CAD drawings”). They are planning staged preservation points. The Avery is committed to developing capacity for digital archiving by 2009. For their metadata they use at minimum the mandatory DACS elements mapped to Dublin Core elements.

Dennis Newman was the final panelist. He has clients who need to preserve/archive finished drawings – such as the documents being sent along to regulatory agencies for final approval. PDF/A-1 was based on ‘electronic paper’ – you loose lots of data when you ‘cut back’ to PDF-A. PDF-E is in it’s first draft/generation being submitted for version 1. PDF-A didn’t address 3D, complex metadata or moving images. PDF-E is based on Acrobat version 7. Adobe has thrown out PDF to the ISO community. Dennis believes that the final ‘as-built’ drawing is what should be the archived version.

He pointed out that Stage I responders need more information than the regulator commissions need. Since 9-11 the state requirements have changed about what need to be in the ‘record’.

As an IT professional he was asked “what can we do” and his answer is “how much do you want to spend?”. IT can do anything – but it takes time and money.

Questions and Answers

Keep in mind throughout this section that I was summarizing the questions and their answers as best I could. Please do not take any statements attributed to the session speakers as full and complete quotes. In cases where I missed too much of the question or answer I generally skipped including it in the list below. If you are anxious to know exactly what someone said, you would need to buy and listen to the conference recordings for this session.

QUESTION : Could a neutral exchange format such as International Alliance for Interoperability‘s (IAI) Industry Foundation Classes (IFC) be the foundation or a piece of the next step in preservation of born digital design documentation? Text + data model that could be read by different software (import/export of data). You can do this now with AutoCAD – you can dump into IFC.

Phil: Is a neutral exchange format the answer to the archiving problems? Software is changing so fast that there is no way that a standard could keep up with it. Also – even if all the data in the world could be put in XML – you still need something to ‘read and do something’ with the data. He put the business process diagram on screen from his talk and pointed out that all the different tools and their outputs exist within the CONTEXT of the business process itself.

Carissa (?): IFC is a recommendation of the Art Institute of Chicago

QUESTION: William Reilly from the FACADE project started to ask about the challenges inherent in the fact that the IFC standard only gives you the geometry. There was some back and forth about this idea with voices noting that IFC can capture more than that.. but not everything.

Kristine Fallon: The idea of doing a neutral format for complex information is a complicated thing. Going back probably 20 years, the people working on data exchange standards for engineering … the different software won’t perfectly talk to each other – but what they can do is exchange ‘model views’. The IFC data model is capable of a fairly comprehensive set of model views.

QUESTION : Who is going to keep it up in 20 years? Are the software producers going to keep it up?

Phil: Autodesk spent 5 million dollars in building the IFCs. If the archivists align their needs with the business needs then the business will pay for it and the archivists will get what they need.

Annemarie: The archivists don’t have the money and resources.. even at Columbia they don’t have the money to buy generation after generation of the software to read all the different file formats. Maybe the MIT approach of emulation is a better approach.

David : Will there ever be a day that I will have an emulator on his desktop? That makes me more curious about exporting pure text.. I can get my head around preservation of that.

Annemarie: The Mannhattenville project is the first step for Columbia in collecting digital data. Archivists need to reach out to organizations now to explain that they want to preserve what they are creating. I am being honest about the chaos coming down the track when we start getting the data from the 90s.

QUESTION (from the audience): The function of IFC is not for archiving.. it is for different software products to communicate with one another. How do you figure out what artifacts of the design process do you keep? How do you extract the ‘important’ parts to keep from what is ‘less’ important?

Phil : What about when there are physical digital models, analytical models and more.. how do you understand all of it?

Carissa: The architectural firms need to be able to get to all of this too. It isn’t just archivists who should be caring about access to all these models. There are legal ramifications and the possibility of renovations later… this needs to come out of the architecture profession.

QUESTION (I asked the following question): Are the problems in preserving the final products so challenging – are there any thoughts to trying to preserve the process. With paper there is an easier preservation of the evolution of design.

Annemarie: In the Manhattanville project one of the big challenges is the architect who does lots of self editing. In many cases they don’t want the word to see their interim choices during the design process.

Phil: Digital tools can encourage you to explore useless ideas. Keep in mind that the journal file for the Building Information Model keeps track of every change. It will tell you that on Tuesday at 4:10 pm someone moved this door 5 inches to the left.

Carissa : At the art institute, architect and archivists need to work together to figure out what is worth capturing.

David: Two different schools of thought. Archiving the final product or archiving the process. File formats are preserving the final product.

QUESTION : There is danger in keeping everything – the goal of archiving is to keep the best final version. The big hulking databases of the world open the door to keeping an overwhelming set of unimportant data.

Annemarie : the needs of all their different consumers are so broad. Perhaps the taking a snapshot should happen more often – thinner slices

Carissa : 2D snapshots are not going to capture the fullness of a 3D object. But it isn’t capturing as much as it might.

Phil : There could be an interactive digital simulation that generates 3D models.. there could be no ‘final’ product. Can we have an impact on how info is kept 4, 10, 30 years from now – for the future? In a world where you can borrow (or pay for) processing time… someone will keep all the versions of autocad.. you will pay for the 15 seconds of rendering time in AutoCAD 14 from some 3rd party.

Kristine Fallon: There is a real business purpose to sorting this out… the IAI work is very real world.. defining model views can help support business.. but they can also support the goals of archivists.

Kristine Fallon‘s Question : Was PDF-E designed to be an archival format?

Dennis: No.. it was designed to be a data interchange format. People who don’t want to give lots of proprietary data to another vendor – they still need to give them a bunch of data to work with them.. that is where PDF-E came out of.

My Thoughts

As seems to be the case with all born digital records, there are no easy answers. While events like 9-11 have had impacts on the types of final products that regulatory agencies and first responders need to evaluate and have easy access to, the speed of innovation and evolution in building design is stunning. It should come as no surprise that architects are more concerned with finding the best tools for their trade than they are with how to preserve the artifacts of their ultimate creations. They will change the tools they use when they find a better tool to manifest their vision.

The most promissing option seems to be having archivists get involved in discussions with the software developers, the architects, the builders and government early in the design process. The traditional model of archivists receiving the final products of business processes years after they were completed does not appear to be an answer on which we can depend. I suspect that proactive efforts to plan for preservation from the start will pay off – both for those trying to use the records 10 years from now and for those who want to preserve some subset of the records of the design community for future generations.

Preserving Virtual Worlds – TinyMUD to SecondLife

August 17, 2007 3 Comments

A recent press release from the Library of Congress, Digital Preservation Program Makes Awards to Preserve American Creative Works, describes the newly funded project aimed at the preservation of ‘virtual worlds’:

The Preserving Virtual Worlds project will explore methods for preserving digital games and interactive fiction. Major activities will include developing basic standards for metadata and content representation and conducting a series of archiving case studies for early video games, electronic literature and Second Life, an interactive multiplayer game. Second Life content participants include Life to the Second Power, Democracy Island and the International Spaceflight Museum. Partners: University of Maryland, Stanford University, Rochester Institute of Technology and Linden Lab.

This has gotten a fair amount of coverage from the gaming and humanities sides of the world, but I learned about it via Professor Matthew Kirschenbaum‘s blog post Just Funded: Preserving Virtual Worlds.

The How They Got Game 2 post Library of Congress announces grants for preservation of digital games gives a more in depth summary of the Preserving Virtual Worlds project goals:

The main goal of the project is to help develop generalizable mechanisms and methods for preserving digital games and interactive fiction, and to begin to test these mechanism through the archiving of selected test cases. Key deliverables include the development of metadata schema and wrapper recommendations, and the long-term curation of archived cases.

I take this all a bit more personally than most might. I was a frequent denizen of an online virtual world known as TinyMUD (now usually referred to as TinyMUD Classic). TinyMUD was a text based, online, multi-player game that existed for seven months beginning in August of 1989. In practice it was sort of a cross between a chat room and a text based adventure. The players could build new parts of the MUD as they went – in many ways it was an early example of crowdsourcing. There was a passionate core of players who were constantly building new areas for others to explore and experience – not unlike what is currently the case in SecondLife. These types of text based games still exist – see MudMagic for listings.

Apparently August 20, 2007 will be TinyMUD’s 18th Annual Brigadoon Day. It will be celebrated by putting TinyMUD classic online for access. The page includes careful notes about finding and using a MUD Client to access TinyMUD. The existence of an ongoing MUD community of users has kept software like this alive and available almost 20 years later.

With projects like Preserving Virtual Worlds getting grants and gaining momentum it seems more plausible with each passing day that 18 years from now, parts of 2007’s SecondLife will still be available for people to experience. I am thankful to know that a copy of the TinyMUD world I helped build is still out there. I am even more thankful to know that the technology still exists to permit users to access it even if it is only once a year.

Update: 20th Anniversary of TinyMud Brigadoon day is set for Thursday, August 20, 2009

Thoughts on Digital Preservation, Validation and Community

July 6, 2007 2 Comments

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.

Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By RateLimited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

International Environmental Data Rescue Organization: Rescuing At Risk Weather Records Around the World

June 7, 2007 2 Comments

In the middle of my crazy spring semester a few months back, I got a message about volunteer opportunities at the International Environmental Data Rescue Organization (IEDRO). I get emails from from VolunteerMatch.org every so often because I am always curious about virtual volunteer projects (ie, ways you can volunteer via your computer while in your pajamas). I filed the message away for when I actually had more time to take a closer look and it has finally made it to the top of my list.

A non-profit organization, IEDRO states their vision as being “.. to find, rescue, and digitize all historical environmental data and to make those data available to the world community.” They go on to explain on their website:

Old weather records are indeed worth the paper they are written on…actually tens of thousands times that value. These historic data are of critical importance to the countries within which they were taken, and to the world community as well. Yet, millions of these old records have already perished with the valuable information contained within, lost forever. These unique records, some dating back to the 1500s, now reside on paper at great risk from mold, mildew, fire, vermin, and old age (paper and ink deteriorate) or being tossed away because of lack of storage space. Once these data are lost, they are lost forever. There are no back up sources; nothing in reserve.

Why are these weather records valuable? IEDRO gives lots of great examples. Old weather records can:

inform the construction and engineering community about maximum winds recorded, temperature extremes, rainfall and floods
let farmers know the true frequency of drought, flood, extreme temperatures and in some areas, the amount of sunshine enabling them to better plan crop varieties and irrigation or drainage systems increasing their food production and helping to alleviate hunger.
assist in explaining historical events such as plague and famine, movement of cultures, insect movements (i.e. locusts in Africa), and are used in epidemiological studies.
provide our global climate computer models with baseline information enabling them to better predict seasonal extremes. This provides more accurate real-time forecasts and warnings and a better understanding of global change and validation of global warming.

The IEDRO site includes excellent scenarios in which accurate historical weather data can help save lives. You can read about the subsistence farmer who doesn’t understand the frequency of droughts well enough to make good choices about the kind of rice he plants, the way that weather impacts the vectorization models of diseases such as malaria and about the computer programs that need historical weather data to accurately predict floods. I also found this Global Hazards and Extremes page on the NCDC’s site – and I wonder what sorts of maps they could make about the weather one or two hundred years ago if all the historical climate data records were already available.

There was additional information available on IEDRO’s VolunteerMatch page. Another activity they list for their organization is: “Negotiating with foreign national meteorological services for IEDRO access to their original observations or microfilm/microfiche or magnetic copies of those observations and gaining their unrestricted permission to make copies of those data”.

IEDRO is making it their business to coordinate efforts in multiple countries to find and take digital photos of at risk weather records. They include information on their website about their data rescue process. I love their advice about being tenacious and creative when considering where these weather records might be found. Don’t only look at the national meteorological services! Consider airports, military sites, museums, private homes and church archives. The most unusual location logged so far was a monastery in Chile.

Once the records are located, each record is photographed with a digital camera. They have a special page showing examples of bad digital photos to help those taking the digital photos in the field, as well as a guidelines and procedures document available in PDF (and therefore easy to print and use as reference offline).

The digital images of the rescued records are then sent to NOAA’s National Climatic Data Center (NCDC) in Asheville, North Carolina. The NCDC is part of the National Environmental Satellite, Data and Information Service (NESDIS) which is in turn under the umbrella of the National Oceanic and Atmospheric Administration (NOAA). The NCDC’s website claims they have the “World’s Largest Archive of Climate Data”. The NCDC has people contracted to transcribe the data and ensure the preservation of the digital image copies. Finally, the data will be made available to the world.

IEDRO already lists these ten countries as locations where activities are underway: Kenya, Malawi, Mozambique, Niger, Senegal, Zambia, Chile, Uruguay, Dominican Republic and Nicaragua.

I am fascinated by this organization. On a personal level it brings together a lot of things I am interested in – archives, the environment, GIS data, temporal data and an interesting use of technology. This is such a great example of records that might seem unimportant – but turn out to be crucial to improving lives in the here and now. It shows the need for international cooperation, good technical training and being proactive. I know that a lot of archivists would consider this more of a scientific research mission (the goal here is to get that data for the purposes of research), but no matter what else these are – they are still archival records.

Should we be archiving fonts?

February 9, 2007

I am a fan of beautiful fonts. This is why I find myself on the mailing list if MyFonts.com. I recently received their Winter 2007 newsleter featuring the short article titled ‘A cast-iron investment’. It starts out with:

Of all the wonderful things about fonts, there’s one that is rarely mentioned by us font sellers. It’s this: fonts last for a very long time. Unlike almost all the other software you may have bought 10 or 15 years ago, any fonts you bought are likely still working well, waiting to be called back into action when you load up that old newsletter or greetings card you made!

Interesting. The article goes on to point out:

But, of course, foundries make updates to their fonts every now and then, with both bug fixes and major upgrades in features and language coverage.

All this leaves me wondering if there is a place in the world for a digital font archive. A single source of digital font files for use by archives around the world. Of course, there would be a number of hurdles:

How do you make sure that the fonts are only available for use in documents that used the fonts legally?
How do you make sure that the right version of the font is used in the document to show us how the document appeared originally?

You could say this is made moot by using something like Adobe’s PDF/A format. It is also likely that we won’t be running the original word processing program that used the fonts a hundred years from now.

Hurdles aside, somehow it feels like a clever thing to do. We can’t know how we might enable access to documents that use fonts in the future. What we can do is keep the font files so we have the option to do clever things with them in the future.

I would even make a case for the fact that fonts are precious in their own right and deserve to be preserved. My mother spent many years as a graphic designer. From her I inherited a number of type specimen books – including one labeled “Adcraft Typographers, Inc”. Google led me to two archival collections that include font samples from Adcraft:

University of Delaware Library Special Collections: J. Ben Lieberman Papers – Series VII: Type Specimens and Commercial Type Directories, 1900s
University of Central Florida: Sol and Sadie Malkoff Papers – listed as “including over a hundered typography and font specimens”

Another great reason for a digital font archive is the surge in individual foundries creating new fonts every day. What once was an elite craft now has such a low point of entry that anyone can download some software and hang out their shingle as a font foundry. Take a look around MyFonts.com. Read about selling your fonts on MyFonts.com.

While looking for a good page about type foundries I discovered the site for Precision Type which shows this on their only remaining page:

For the last 12 years, Precision Type has sought to provide our customers with convenient access to a large and diverse range of font software products. Our business grew as a result of the immense impact that digital technology had in the field of type design. At no other time in history had type ever been available from so many different sources. Precision Type was truly proud to play a part in this exciting evolution.

Unfortunately however, sales of font software for Precision Type and many others companies in the font business have been adversely affected in recent years by a growing supply of free font software via the Internet. As a result, we have decided to discontinue our Precision Type business so that we can focus on other business opportunities.

I have to go back to May 23, 2004 in the Internet Archive Wayback Machine to see what Precision Type’s used to look like.

There are more fonts than ever before. Amateurs are driving professionals out of business. Definitely sounds like digital fonts and their history are a worthy target for archival preservation.

Category: digital preservation

My Thoughts