software | Spellbound Blog

The MemoryArchive Affiliate Program: A Wiki Engine for Collecting Memoirs

November 14, 2007 2 Comments

A Beautiful WWW posted A Review of MemoryArchive.org. MemoryArchive, founded by historian Marshall Poe, is a new MediaWiki based website aimed at collecting first person accounts that they term ‘memoirs’. In sharp contrast with the communal authorship approach of most wikis, MemoryArchive locks down edits of each entry after a format review.

What sorts of memoirs are they looking for? In their FAQ they say they want “pretty much anything you remember that someone else might conceivably find interesting, now or in 500 years”.

I spent some time exploring. I read a very moving memorial titled Death by Aids The Goodbye Party, 1992, by Jay Blotcher (ed note: Jay emailed me with the correct title for this memoir). I wandered through some 9/11 memories. Eventually something dawned on me. Maybe it is the fact that I am spending most of my days lately thinking deep thoughts about metadata and classification — or maybe my archives course work is to blame — whatever the reason, I realized that I wanted more information about the storytellers. Right now it appears that each memoir includes Who, What, When and Where data – to whatever degree the contributors choose to furnish such information. Categories are also available and seem to be frequently employed.

But I want to know more about the individuals who are telling the stories. I appreciate that some posts will be made more powerful through anonymity, but for those cases that an individual is willing to share additional biographic information it would be great to have an easy place for that information to be captured.

I think the most interesting aspect of the Memory Archive to the archives community is the Memory Archive Affiliate Program. The theory behind this program is to support the collection and archiving of personal histories online. It is described as being of interest to the following types of organizations:

historical societies (urban, state, or national)
institutions interested in recording their own history (a club, society, or military unit)
educational institutions teaching history (high school or college)
public history projects (oral history gathering, or document collection)

This is a powerful idea. Any time you can accumulate a critical mass of of a single type of information on the web (in this case, memoirs) you have the chance of becoming a destination. There is also the added benefit of enabling smaller organizations to launch an online memoir collection initiatives without needing to worry about the technology, costs and people-power that would usually be required.

There does needs to be an easy way for the Memory Archive Affiliates to download these born digital memoirs for offline use and preservation purposes. This could be accomplished by an ‘export’ or ‘format for printing’ button on each memoir page, or perhaps some form of bulk download for all memoirs collected for a single affiliate’s project. I will say that the default print format isn’t bad. It seems to already do some special reformatting (such as displaying URL links in their entirety). I still also would want more metadata, though perhaps the definition of attributes to be collected could be customized per project.

I am curious to see the overall quality of the memoirs a year from now. I suspect that memoirs collected is association with a topically focused program may be more compelling than the average ‘man-on-the-net’ first person narratives. That isn’t to say that there is no value in the memories of someone who feels compelled to share their story – but a collection created around a theme would have the additional power of that common thread. The affiliate program memoirs would also be more likely to come with some contextual background explaining the source and origin of the solicited accounts. I am a fan the existing thematic memory sites, such as The April 16 Archive and the Hurricane Digital Memory Bank. I love that the Omeka software used to create these two example sites is open source and free. Unfortunately, I don’t think the average small historical society or public history project is likely to have the resources to build and support a site like this even with free software. I think that a program like the Memory Archive Affiliate Program (or something like it) could bridge the gap for these smaller organizations and make the creation of online memoir collection projects a reality.

SAA2007: Publishers’ Bindings Online – Digitization, Collaboration, Standardization and Community Building (Session 707)

September 22, 2007 2 Comments

Session 707 of SAA2007 in Chicago discussed many aspects of the project that created Publishers’ Bindings Online (PBO). The full title of this session was The Anatomy of a Collaborative Digital Project and Lessons Learned in the Realms of Access, Outreach, and Creative Success: A Multi-Disciplinary Look at Publishers’ Bindings Online, 1815-1930: The Art of Books. The presenters have kindly posted the full slide deck from their panel online. In this post I attempt to capture the main points of the presentation and Q&A discussion of PBO.

Who Spoke?

Jessica Lacher-Feldman (session chair) – University of Alabama, PBO project manager

Amy Rudersdorf – now at North Carolina State University, Digital production coordinator, NCSU special Collections, but was at University of Wisconsin, Madison during PBO project

Kristy Dixon – University of Alabama , PBO staff

PBO Project Overview

PBO was made possible by a 3 year Institute of Museum and Library Services (IMLS) grant. Originally awarded in 2003, the grant was extended once (and I think they mentioned additional funding being applied for). The primary grant funded the digitization of 10,000 images from up to 5000 book bindings. Ultimately 10,570 images were added to PBO and made searchable by metadata. The bindings selected included books from 1815-1930, primarily US titles and mostly in English.

Their guiding vision was of “giving something to the world that is both needed and useful” (and really beautiful). And they succeeded! PBO is a lot more than 10,000+ digitized book bindings. The project strived to make the information available in many different ways, including via:

a web-based database
online exhibits & galleries,
vodcasts and podcasts
web-based tutorials
virtual and real exhibits
presentations & class lectures
opportunities to adapt the project to other disciplines – history, book arts, librarianship, literature.. K-12 and more

Technology and Processes

The division of labor for PBO was split between the University of Alabama and the University of Wisconsin, Madison.

Many extensions to the OCLC SiteSearch based database were made by the UWDCC (UW Digital Collections Center) digital production center at the University of Wisconsin, Madison .

They went through an overview of the participants and staff – who did what.. what skills were needed and what was brought by the two institutions to the collaboration. They acknowledged their fabulous advisory group including Sue Allen – “the expert on publisher’s bindings”. Individuals from outside their teams contribute based on their special interest and knowledge about a specific individual (this contribution is still ongoing).

Working in collaboration forced them to wrestle with many challenges including:

staff in two locations – most of whom had never met
“long distance relationships are hard”
they had to work hard to ensure that all were ‘equally-valued participants’
standards – you need ground rules from the outset

Collaboration & Description

“Every pair of eyes are different”. PBO tapped into the resource of the ‘young fertile minds’ to power the project out of the local MLS programs at both institutions. Even with a detailed description form – there was confusion over subject headings and overlap – especially when those selecting subject headings were grad students who might not know the official terms for things. For example, the list of terms might include Ouroboros – but the students might not know this it is the term for a snake eating it’s own tail.

Ultimately they had to do quality control at a single location. They spent a LOT of time on this.

Their top tips for cultivating continuity for virtual project teams:

write into your grants money for travel (they stressed that your grant includes funds to support people meeting each other)
continuous communication is critical
‘shared working group website’ available online
email, conference calls and instant messaging (IM) for communication
regular reporting to each other
being project manager means that you have to be on top of everything – you need to be the glue
focus on the deliverables – use planning tools and timelines

They discovered that IM was key to developing trust between the two institutions.

Metadata – the core of the project

The key to their metadata approach was to consider a book less as a ‘bibliographic object’ and more as an ‘art object’.

They called books in PBO ‘objects’ but still kept the bibliographic metadata. They used Dublin Core by pulling the MARC data into the Dublin Core structure. As part of this they took all the subjects from the bibliographic info and moved it to the Dublin Core description and labeled it ‘book topic’. Then they used the ‘Subjects’ portion of the Dublin Core record to describe the binding and talk about what the images are OF. This is where the subject terms from the controlled vocabulary were added.

These are the steps of their metadata workflow process:

selection from collections of note – faculty, consultants and library staff did this step
description – used a paper form, described the books on paper and joined that description to what was in the MARC record – done by the grad students and library staff
metadata entry – entry of data through an online form – done by students (overseen by library staff) actually ended up being cheaper to manually enter the MARC data (rather than automated extraction)
quality control – content, grammar, spelling – done by library staff (took a lot more time than anyone expected)
no live update between their working Filemaker Pro database and the final SiteSearch database
record ownership – indicated in the identifier field (with a special code in the identifier) AND in the Submitter field

A lot of description went into this project.

They needed to develop a controlled vocabulary for the project. To do this they first worked with content specialists to develop a list. They used Library of Congress Subject Headings (LCSH) terms where they could, as well as Getty Art and Architecture Thesaurus. Then they added some local terms. The controlled vocabulary list evolved with the project and is the foundation of all teaching, search and more.

The speaker showed an example of the controlled vocabulary – the terms really are a window into the past. Users can browse the controlled vocabulary through the front end.

On the description paper form they had a list of ‘binding themes’ for those doing the description to pick from. A lot of work was done to get the huge list of themes onto a single page. Ultimately they had to provide some fill in the blank extension fields. For example, rather than believing they had listed every useful trade or profession, there was a section on the list labeled: Profession/Trade – _______________ with the expectation that those describing a binding might need to fill in the blank.

Digitization and The Database

Generally two scans were taken from each book, but sometimes as many as five. What did they scan? Front cover, spine, back cover and end papers.

There were two different image reformatting standards at the two institutions – 300 DPI vs 600 DPI. Both used a black background when scanning. All books were presented in as in condition – some have front/back covers missing. After the scanning they began with master TIFs and then transformed them to JPGs in three sizes in 72 DPI.

The presentation showed screen shots of:

simple search
brief view record in search results — which includes subjects
full record view – including display of all images associated with the book object record
gallery view – thumbnail, title and indication if there are one or more images related to the title
guided search (advanced search)
clickable subject headings

All the images in PBO are freely available for download.

With an eye to digital preservation, all the original uncompressed TIF images are archived in triplicate to digital archive tape and stored in three different locations. The metadata is stored with images in both text and SGML format (which is what SiteSearch works with). The full process documents are available on the project site.

Future Growth

The PBO team is talking to Louisiana State University (LSU) to figure out how PBO can grow. LSU would need to work and live with the way PBO works and learn their processes. They are talking to other institutions – if you are interested in adding content to PBO, please contact them.

The Richard Minsky Collection has been purchased and is being added to the project. This is a rich collection that was gathered to create a catalog. PBO has the catalog and all of Minsky’s research that goes with the collection. The goal is to feed as much of this rich data into PBO as possible. They are working with individual scholars and collectors to find other avenues for growth.

Value Added Components

One of the focuses of PBO has been to look beyond the digital images themselves to creating value added components for their user community.

A tutorial for users is provided, including information about how to email a record. A comprehensive bibliography has been created and is used by scholars. The page prompts users to submit feedback so the bibliography is a live document.

Over 30 galleries have been created – organizing access to essays and additional info by topic. Types of galleries include:

Galleries on Bindings and Book binding techniques – these are not really related to individual book objects – but give more information, for example Silver & Gold: The Art of Metal Stamping
Galleries on Collections – for example the Wade Hall Collection of Southern History and Culture
Galleries on Artistic Styles and Movements – a narrative approach provides information on the historical roots of the movements and show how the bindings fit into the movements
Galleries on History – they have 11 of these galleries,including major historical events, literature and culture of the time
Galleries on Literature

Links to trusted information outside of PBO’s site are shown whenever possible. For example – links to the full text of books are provided via Project Gutenberg. Throughout the site’s text link to sources such as the Library of Congress, .gov sites, PBS and so forth can be found.

Canned searches are provided to make it easy for users to explore content. An example of this is the Silver & Gold: The Art of Metal Stamping search that will find every binding with either silver or gold stamping. This is in contrast with making users figure out the right syntax to submit the search criteria themselves.

The Teaching Tools portion of the site provides sample lesson plans on all sorts of topics. They worked with some high school history teachers via focus groups and got feedback about what they needed and wanted. The Industrial Revolution lesson plan was created based on that feedback.

The research tools that were created as a result of the PBO project and are made available online are:

glossary – 456 terms defined using ten major authorities
bibliography of print & web resources
controlled vocabulary for subject headings
publishers map – an interactive map that includes 2123 publishers so far
tutorials on various subjects

Signed or Designer bindings is a new resource to which scholars continue to contribute new information.

Through collaboration with teaching faculty they developed the presentation such as Indians, the Frontier, and the West in American Bookbindings. This presentation will eventually be podcast on the PBO site. It talks about how these books inspired people to move west and inspired kids to read.

Another podcast is on the way addressing the representation of Uncle Tom’s Cabin. It will discuss how the book was it marketed to different groups – Yiddish, German… etc. There already exists a gallery and essay on Uncle Tom’s Cabin .

Conclusions

The team has been very pleased by the tangible scholarly impact of PBO. They have seen extensive collaboration with the university community, new research, and promotion of the use of special collections materials in the classroom using digital resources. They point to PBO as showing a path to preserve these increasingly fragile books by moving out of the general stacks and into special collections – with a result of increased access to the book and decreased handling.

The presenters avowed that PBO could never have been created by their team alone – working with consultants and advisers was the key to their success. They needed input from experts and others to help PBO grow and keep it sustainable. This interaction makes the project strong – it has it’s own legs and won’t cease to exist when the money disappears.

Publicity and outreach got attention on the PBO project from the very beginning. They made documenting their experiences and making recommendations about how to market digital projects part of the original plan in their grant proposals. These documents were part of their deliverables. They even published a white paper about PBO and outreach.

PBO uses Google Analytics so they can see where their users are coming from. Also it makes cool talking points for your reports and fun things to tell the Dean!

I think the best conclusion to my summary of the presentation portion of this session is the list of points on the final slide titled “Beyond the grant: Room to Grow”:

Potential future contribution from other repositories in the US and abroad…
Potential future collaboration with teaching faculty at UA and beyond
With additional collections, the database and the project will only grow stronger
Potential as a web portal, clearing house, or consortium
Additional potential funding opportunities, scholarship, and ways to highlight collections, resources, knowledge, and abilities

Questions and Answers

Keep in mind throughout this section that I am summarizing and paraphrasing the questions and their answers. Please do not take any statements as full and complete quotes. In cases where I missed too much of the question or answer I generally skipped including it in the list below. If you are anxious to know exactly what was said, you would need to buy and listen to the conference recordings for this session.

Question: Who maintains the website and who makes decisions about how things are going to get updated?
Answer: UA maintains the static web pages and UW maintains the database. The project manager has been in charge.. made prototypes of new design and sent it around for feedback. They have standards for colors in their handbooks.

Question: If the grant funding dried up right now would the project be sustainable?
Answer: There is support from the institutions… for example, it is just one project of many at UW.

Question: How did you get such good scans of the book spines?
Answer : At UW they used blocks or boxes to prop up the books and laid black foam core on top on flatbed scanners. At UA – they used black paper covered blocks in combination with overhead scanners.

Question: How did you get the full cover scans?
Answer: They very carefully lay the cover flat – so the pages sticking are sticking up.

Question: Who customized SiteSearch – OCLC or UW?
Answer: UW did the work – they had one and a half dedicated IT staff to do the customizations.

Question : Have you had to negotiate copyright issues for bindings from the late end of the time range of the project
Answer: No.

Question : Are you aware of others doing similar projects? Have you been approached and or are looking for others who want to contribute?
Answer: Yes. Right now they are working with LSU and are not actively seeking out new participants. There are plans to grow the project eventually.

Question: Did you think about the fact that you were creating your own online publication?
Answer: They didn’t realize it ahead of time – they didn’t realize how powerful the database was going to be to fuel their ability to build further on the work.

Question: Can you search for ‘young people’s covers’ – is there metadata for what age groups might enjoy specific books?
Answer: It depends on if it was part of the descriptive information, but you can search on ‘boys’ or ‘girls’ or ‘juvenile’ and gain useful results.

Question: Can you talk about the work behind the MARC to Dublin Core migration?
Answer: In some ways it was easier than they thought it would be – so many of the fields transfer directly from MARC to Dublin Core.. it was the revelation about the book as art object that made them realize the work they needed to do. Building the controlled vocabularies was where the heavy lifting occurred. It involved going through giant spread sheets with subject terms in alphabetical order looking for typos and working toward consistency (ie, use plurals). The spreadsheet didn’t show how many items used each term – it was hard to know how many changes would be needed.

Question: Do you get hits from the standard online catalog into PBO?
Answer: This is not happening now. They would love to build a better connection between the OPAC and PBO in both institutions.

Question: How did you make decisions when there were disagreements?
Answer: “I don’t remember any more.. it was all so beautiful…” <laughter > . There were no big issues about standards. There were more issues about the grant and things like how many images or books they were supposed to scan. In some cases it was easy because they were in charge of very different project areas – each team had “their own little fiefdom”.

Question: Do you think you might sell images to generate revenue?
Answer: They have considered it. The have made a calendar and a poster, but gave them away. They also have used images for making holiday cards. They don’t see selling images as a main goal right now.

Question: Have you considered pursuing online collaborative methods for work with the scholars and collectors?
Answer: No, but they think that would be useful to explore.

My Thoughts

I loved the energy and connection displayed by the presenters. It was fun to see a team of people who clearly were so proud of their work and pleased by its reception. I was personally intrigued by the highlighted challenge of coming up with (and painstakingly validating) their controlled vocabulary for subjects. I firmly believe that the topic of subject terms and their standardization across repositories will only grow in importance. For those interested in some of what is being done on this front – take a look at both the UK based High Level Thesaurus (HILT) and the Simple Knowledge Organisation Systems Core (SKOS) project. I suspect many will be intrigued by the SKOS use case titled An integrated view to medieval illuminated manuscripts.

Even given the mammoth effort required to create a shared controlled vocabulary, it is clear that the benefits they have reaped from this effort are still being discovered. The speakers mentioned on multiple occasions how pleased (and surprised) they were to realize how powerful their database of metadata has proven to be. All the amazing value added features build on this ‘heavy lifting’.

While it will be rare for such item level attention to be given to most archival documents, PBO sets the bar high for what can be done via collaboration across institutions. Their dedication to sharing their lessons learned is a fine example of what all big projects who are forging new frontiers could be doing. Finally – it is the weight of all the value added elements (galleries, tutorials, lesson plans.. and the list goes on) that have raised what could have been just a set of classified images in a database to being an active community with a growing draw for many types of users from around the world.

As is the case with all my session summaries from SAA2007, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

SAA2007: Preserving Born Digital Records of the Design Community (Session 106)

September 8, 2007 9 Comments

The official title for SAA2007 Session 106 is Constructing Sustainability: Real-World Implementations of Preservation Standards for Born-Digital Design Documentation, but I think it might have been better served to include the word Architecture somewhere in it’s title. Sponsored by the Architectural Records Roundtable, this session considered issues related to preserving born digital records of “the design community”. The design community in question includes both architects and landscape designers.

Each panelist gave a 5 minute brief about the way in which they are working toward preserving these design community records – and the rest of the session was opened up to Q&A. David Read, the session chair, mentioned how they used a wiki to collect questions and ideas for the session, gave an introduction to each of the panelists and helped guide the Question and Answer portion of the session.

Who was on the panel?

David Read (Session Chair, Information Resources Manager, DiMella Shaffer )
Phil Bernstein (Autodesk, Architect and Technologist)
Carissa Kowalski Dougherty (Art Institute of Chicago, Department of Architecture and Design )
Annemarie van Roessel (Columbia University, Avery Architectural and Fine Arts Library )
Dennis Newman (general manager at PFS Corporation , member of PDF standards working group of ISO)

What is being done?

Phil Bernstein kicked off the 5 minute summaries with a quick history of design technology. He explained how currently there is a shift in progress. Hundreds of years of paper drawings were followed by ten to fifteen years of electronic drawings. The latest development is use of Building Information Modelling (BIM). BIM relies on a database that generates ‘reports’ that are in fact ‘drawings’. These are sometimes referred to as Building Development Information Models. Digital printers can produce physical models directly from the stored BIM data with no need to step through generation of an actual drawing outside the computer.

Phil showed Yale School of Architecture design examples from the BIM world. These were fantastical organically shaped creations that looked more like strange undiscovered plants from under the sea than traditional buildings!

The good news is that the data in the BIM databases are all just text. The bad news is that the generated ‘design artifacts’ are based on the text data and can lead to digitally printed artifacts. There has been an explosion in the various means of representation. The architecture world is catching up to the to other industries (such as the auto industry) that have been doing this for 25+ years.

Current architects are application agnostic – they don’t care what they use to create their outputs. All the paths and platforms will only grow – what is driving the design process will be increasing in complexity. The building industry is making a fundamental shift from electronic drawing to the Building Information Modeling approach – but there is an unlimited environment for representation. He hoped to discuss the intersection between the archival/record keeping issues and the problems facing the architecture world.

Carissa Kowalski Dougherty’s overview covered the Digital Archive for Architecture (DAArch) project out of the Art Institute of Chicago . The project was based on the 2004 study Collecting, Archiving, and Exhibiting Digital Design Data. They considered how Architecture and Design firms are using software tools to produce and design – but examined these questions from a museum and curatorial perspective.

The recommendation is a two-tiered collection approach.

First tier: Native files – like autocad files – these are going to be preserved at the bit level – but there is no commitment to ensuring access to these files
Second tier : Output formats – only pdf and tif files
PDF: line drawings, vector-based graphic files, text documents power points
TIF: renderings, digital photographs

The second tier outputs are what they are committing to “functionally preserve”.

Carissa presented an example of what they accessioned from the Garofalo Architects‘ Manilow Residence (2001-2003) project. A lot of what they got were files that no-one (including the small architectural firm itself) could still open.. the software is gone. Another major challenge was poor naming conventions for the files themselves. The final project archive included over 200 native vector 2D files (.dxf, .dgn, .dwg), 145 pdfs.. and more.

From the UrbanLab they sought to preserve their Visitor Information Center Competition Entry from 2001. This was a project that was never built and therefore has little physical output. They mostly used autoCAD (2D), Maya (3D), FormZ (3D) and Adobe Illustrator (layout).

The DAArch Software highlights:

browser based
DSpace as back end
Dublin Core augmented with CDWA and custom metadata to support architecture data and digital materials
authority records
group and item level cataloging
will be available open source with BSD license via SourceForge (this was a requirement of the funder – that it be open source)

Final lessons and challenges from the DAArch project:

file naming and organization – the biggest challenges at the smaller firms – need outreach to these firms
metadata for digital objects – there is not a lot out there for 3D digital images
software and migration tools – can we/should we preserve the software dependent first tier files? or just the PDF/TIF outputs?
three-dimensional objects, BIM, animations, etc

Annemarie van Roessel discussed Columbia’s major Manhattanville project. Their goal is to make digital records last as long as steel and glass. The Avery Architectural and Fine Arts Library is feeling the pressure to be a leader, so how does Avery document this project? Manhattanville is a 30 year planning, design and build project targeted to be completed in 2030. It will cover 17 acres northwest of the main Columbia campus.

There are many building blocks to the digital design archives: autoCAD, project management records, collaborative environments (sharepoint – Microsoft), images, presentations, websites and movies (ie, more than just “scary CAD drawings”). They are planning staged preservation points. The Avery is committed to developing capacity for digital archiving by 2009. For their metadata they use at minimum the mandatory DACS elements mapped to Dublin Core elements.

Dennis Newman was the final panelist. He has clients who need to preserve/archive finished drawings – such as the documents being sent along to regulatory agencies for final approval. PDF/A-1 was based on ‘electronic paper’ – you loose lots of data when you ‘cut back’ to PDF-A. PDF-E is in it’s first draft/generation being submitted for version 1. PDF-A didn’t address 3D, complex metadata or moving images. PDF-E is based on Acrobat version 7. Adobe has thrown out PDF to the ISO community. Dennis believes that the final ‘as-built’ drawing is what should be the archived version.

He pointed out that Stage I responders need more information than the regulator commissions need. Since 9-11 the state requirements have changed about what need to be in the ‘record’.

As an IT professional he was asked “what can we do” and his answer is “how much do you want to spend?”. IT can do anything – but it takes time and money.

Questions and Answers

Keep in mind throughout this section that I was summarizing the questions and their answers as best I could. Please do not take any statements attributed to the session speakers as full and complete quotes. In cases where I missed too much of the question or answer I generally skipped including it in the list below. If you are anxious to know exactly what someone said, you would need to buy and listen to the conference recordings for this session.

QUESTION : Could a neutral exchange format such as International Alliance for Interoperability‘s (IAI) Industry Foundation Classes (IFC) be the foundation or a piece of the next step in preservation of born digital design documentation? Text + data model that could be read by different software (import/export of data). You can do this now with AutoCAD – you can dump into IFC.

Phil: Is a neutral exchange format the answer to the archiving problems? Software is changing so fast that there is no way that a standard could keep up with it. Also – even if all the data in the world could be put in XML – you still need something to ‘read and do something’ with the data. He put the business process diagram on screen from his talk and pointed out that all the different tools and their outputs exist within the CONTEXT of the business process itself.

Carissa (?): IFC is a recommendation of the Art Institute of Chicago

QUESTION: William Reilly from the FACADE project started to ask about the challenges inherent in the fact that the IFC standard only gives you the geometry. There was some back and forth about this idea with voices noting that IFC can capture more than that.. but not everything.

Kristine Fallon: The idea of doing a neutral format for complex information is a complicated thing. Going back probably 20 years, the people working on data exchange standards for engineering … the different software won’t perfectly talk to each other – but what they can do is exchange ‘model views’. The IFC data model is capable of a fairly comprehensive set of model views.

QUESTION : Who is going to keep it up in 20 years? Are the software producers going to keep it up?

Phil: Autodesk spent 5 million dollars in building the IFCs. If the archivists align their needs with the business needs then the business will pay for it and the archivists will get what they need.

Annemarie: The archivists don’t have the money and resources.. even at Columbia they don’t have the money to buy generation after generation of the software to read all the different file formats. Maybe the MIT approach of emulation is a better approach.

David : Will there ever be a day that I will have an emulator on his desktop? That makes me more curious about exporting pure text.. I can get my head around preservation of that.

Annemarie: The Mannhattenville project is the first step for Columbia in collecting digital data. Archivists need to reach out to organizations now to explain that they want to preserve what they are creating. I am being honest about the chaos coming down the track when we start getting the data from the 90s.

QUESTION (from the audience): The function of IFC is not for archiving.. it is for different software products to communicate with one another. How do you figure out what artifacts of the design process do you keep? How do you extract the ‘important’ parts to keep from what is ‘less’ important?

Phil : What about when there are physical digital models, analytical models and more.. how do you understand all of it?

Carissa: The architectural firms need to be able to get to all of this too. It isn’t just archivists who should be caring about access to all these models. There are legal ramifications and the possibility of renovations later… this needs to come out of the architecture profession.

QUESTION (I asked the following question): Are the problems in preserving the final products so challenging – are there any thoughts to trying to preserve the process. With paper there is an easier preservation of the evolution of design.

Annemarie: In the Manhattanville project one of the big challenges is the architect who does lots of self editing. In many cases they don’t want the word to see their interim choices during the design process.

Phil: Digital tools can encourage you to explore useless ideas. Keep in mind that the journal file for the Building Information Model keeps track of every change. It will tell you that on Tuesday at 4:10 pm someone moved this door 5 inches to the left.

Carissa : At the art institute, architect and archivists need to work together to figure out what is worth capturing.

David: Two different schools of thought. Archiving the final product or archiving the process. File formats are preserving the final product.

QUESTION : There is danger in keeping everything – the goal of archiving is to keep the best final version. The big hulking databases of the world open the door to keeping an overwhelming set of unimportant data.

Annemarie : the needs of all their different consumers are so broad. Perhaps the taking a snapshot should happen more often – thinner slices

Carissa : 2D snapshots are not going to capture the fullness of a 3D object. But it isn’t capturing as much as it might.

Phil : There could be an interactive digital simulation that generates 3D models.. there could be no ‘final’ product. Can we have an impact on how info is kept 4, 10, 30 years from now – for the future? In a world where you can borrow (or pay for) processing time… someone will keep all the versions of autocad.. you will pay for the 15 seconds of rendering time in AutoCAD 14 from some 3rd party.

Kristine Fallon: There is a real business purpose to sorting this out… the IAI work is very real world.. defining model views can help support business.. but they can also support the goals of archivists.

Kristine Fallon‘s Question : Was PDF-E designed to be an archival format?

Dennis: No.. it was designed to be a data interchange format. People who don’t want to give lots of proprietary data to another vendor – they still need to give them a bunch of data to work with them.. that is where PDF-E came out of.

My Thoughts

As seems to be the case with all born digital records, there are no easy answers. While events like 9-11 have had impacts on the types of final products that regulatory agencies and first responders need to evaluate and have easy access to, the speed of innovation and evolution in building design is stunning. It should come as no surprise that architects are more concerned with finding the best tools for their trade than they are with how to preserve the artifacts of their ultimate creations. They will change the tools they use when they find a better tool to manifest their vision.

The most promissing option seems to be having archivists get involved in discussions with the software developers, the architects, the builders and government early in the design process. The traditional model of archivists receiving the final products of business processes years after they were completed does not appear to be an answer on which we can depend. I suspect that proactive efforts to plan for preservation from the start will pay off – both for those trying to use the records 10 years from now and for those who want to preserve some subset of the records of the design community for future generations.

Metadata World Building: Freebase.com and OpenLibrary.org

July 17, 2007 10 Comments

I find it interesting to have discovered both Freebase.com (an open shared database of the world’s knowledge) and the Open Library Demo Site (ultimately intended to be a library of all the world’s information about all the world’s books) in the same week. They are both counting on crowdsourcing to populate large databases of information – all of which will be available for use by the general public. Freebase.com’s data is all licensed under Creative Commons CC-BY while Open Library describes that the data must be “a product of the people: letting them create and curate its catalog, contribute to its content, participate in its governance, and have full, free access to its data”.

For the Open Library project, the creators “wrote a new type of wiki that lets users enter structured data”. They have a page showing the data structure for bibliographic items on a Schema page. They also document their creation of a database framework called ThingDB which was designed to hold huge quantities of records, “hold arbitrary semi-structured data” and handle history/versioning.

For Freebase, the the data structure itself is evolving – and alpha users have a major hand in guiding that evolution. They have three major building blocks: Topics, Properties and Types. A Topic may be anything about which you want to create a Freebase entry. A person, place, thing, or idea. Properties are exactly what they sound like – such as the name of a person, the location of a place or the material a thing is made of. The Types are what make it all interesting. A Freebase Type is a set of properties. For any Topic that is about a person it makes sense that you want the easy option to add all the ‘person’ related Properties to that Topic easily. But a Topic can have more than one Type associated with it – and a Property can be populated by values that themselves are (or become) Topics. I would give you links to examples in Freebase, but to get your fingers in the mix at this point you need to put your name in the hat and wait for them to invite you to play with the alpha version.

Let us consider Books. Freebase already has the notion of an Author. Any Topic associated with the Author Type automagically can have books the person wrote associated with them as values for a multi-value property — and each of the books added will instantly become a new Topic (or you can pick them from a list if Freebase already knows about the book) with all the Properties you need for a book (and yes – it already knows who the Author is if you added the book from the Author’s page).

At the time I wrote this post, the Freebase topic page for Mark Twain shows the following associated types:

Person (People)
Film Writer (Film)
Author (Publishing)
Deceased Person (People)
Influence Node (mikelove’s types)
Book Subject (Publishing)

The values in parentheses above are the Domains to which a Type belongs. You will note that the ‘Influence Node’ Type is associated with a domain called mikelove’s types. This is because Freebase lets individuals create new Types. These types can later be promoted for use by anyone – and if I am reading the help properly, can be ‘published’ for others to use even before they are ‘promoted’ to be belong to an official Domain.

For Mark Twain, each of the Types listed above brings the opportunity to populated various Properties of structured data. Here are all the structured data Properties available for population for Mark Twain (some with their current values):

Name: Mark Twain
Description: currently imported from Wikipedia
Also known as: Samuel Langhorne Clemens
Gender: Male
Date of Birth: Nov 30, 1835
Place of Birth: Florida, Missouri
Country Of Nationality: United States
Profession
Religion
Spouse(s)
Parents: Jane Lampton Clemens, John Marshall Clemens,
Children
Sibling(s)
Height (meters)
Weight (kg)
Date of Death: 1910
Place of Death: Redding, Connecticut
Cause Of Death
Date of burial
Place of burial
Web Links
Employment History
Education
Quotations
Film Writing Credits
Books Written
Short Stories Written
Essays/Articles Written
Influenced By
Peers
Influenced
Books About This Topic
Short Works of Non-Fiction About This Topic

In contrast – take a look at the Open Library page for Mark Twain – these are the structured data elements I spotted (populated or otherwise) on that page:

Name
Text Entry
November 30, 1835 – April 21, 1910
Genres
Related Authors
Website
Location: Elmira, New York (buried)
Alternate Name: Samuel Clemens
Books by this author

I wonder how much of the book/publishing/literary world of data will be duplicated between Open Library and Freebase. They are both very young (‘alpha’ for Freebase vs ‘early technology preview’ for Open Library) and both are throwing open their arms for help. Open Library has a special page about the librarianship and another about how you can help. If you can get the secret pass into Freebase Alpha, there are plenty of discussions, examples, demos, help and opportunities to contribute. (I set the value of the Gender Property on the Mark Twain record while pondering it for this blog post).

The biggest difference between these two projects relates to what each is trying to accomplish. Freebase wants to be a super flexible universal database of knowledge that can be used to power applications (and they have tons of tools and APIs aimed at developers). Open Library is all about books, books, books – and all things related to them. Freebase has many Topics assigned the Book Type (8,749), but has even more associated with Person (355,359), Restaurant (99,971), City/Town (59,848), Company (20,405), Film (22,641) and many more.

Freebase is the growing creation of Metaweb (made up of veterans of Netscape, The Internet Archive, Alexa, Tellme, Intel and Broderbund) – . The Open Library Demo includes a tidy list of the people who created the it. I noticed Alexa and The Internet Archive on both lists – small world.

I love the idea of Open Library (and will enthusiastically follow its progress), but personally I am more excited to work on Archives related data in Freebase. I am pondering the creation of a new Archives Type and an Archival Collection or Archival Finding Aid Type. Sound like fun? Ask for an Alpha account and let me know when you get one … I would love to brainstorm this with others of a like mind.

Thoughts on Digital Preservation, Validation and Community

July 6, 2007 2 Comments

The preservation of digital records is on the mind of the average person more with each passing day. Consider the video below from the recent BBC article Warning of data ticking time bomb.

Microsoft UK Managing Director Gordon Frazer running Windows 3.1 on a Vista PC
(Watch video in the BBC News Player)

The video discusses Microsoft’s Virtual PC program that permits you to run multiple operating systems via a Virtual Console. This is an example of the emulation approach to ensuring access to old digital objects – and it seems to be done in a way that the average user can get their head around. Since a big part of digital preservation is ensuring you can do something beyond reading the 1s and 0s – it is promising step. It also pleased me that they specifically mention the UK National Archives and how important it is to them that they can view documents as they originally appeared – not ‘converted’ in any way.

Dorthea Salo of Caveat Lector recently posted Hello? Is it me you’re looking for?. She has a lot to say about digital curation , IR (which I took to stand for Information Repositories rather than Information Retrieval) and librarianship. Coming, as I do, from the software development and database corners of the world I was pleased to find someone else who sees a gap between the standard assumed roles of librarians and archivists and the reality of how well suited librarians’ and archivists’ skills are to “long-term preservation of information for use” – be it digital or analog.

I skimmed through the 65 page Joint Information Systems Committee (JISC) report Dorthea mentioned (Dealing with data: Roles, rights, responsibilities and relationships). A search on the term ‘archives’ took me to this passage on page 22:

There is a view that so-called “dark archives” (archives that are either completely inaccessible to users or have very limited user access), are not ideal because if data are corrupted over time, this is not realised until point of use. (emphasis added)

For those acquainted with software development, the term regression testing should be familiar. It involves the creation of automated suites of test programs that ensure that as new features are added to software, the features you believe are complete keep on working. This was the first idea that came to my mind when reading the passage above. How do you do regression testing on a dark archive? And thinking about regression testing, digital preservation and dark archives fueled a fresh curiosity about what existing projects are doing to automate the validation of digital preservation.

A bit of Googling found me the UK National Archives requirements document for The Seamless Flow Preservation and Maintenance Project. They list regression testing as a ‘desirable’ requirement in the Statement of Requirements for Preservation and Maintenance Project Digital Object Store (defined as “those that should be included, but possibly as part of a later phase of development”). Of course it is very hard to tell if this regression testing is for the software tools they are building or for access to the data itself. I would bet the former.

Next I found my way to the website for LOCKSS (Lots of Copies Keep Stuff Safe). While their goals relate to the preservation of electronically published scholarly assets’ on the web, their approach to ensuring the validity of their data over time should be interesting to anyone thinking about long term digital preservation.

In the paper Preserving Peer Replicas By RateLimited Sampled Voting they share details of how they manage validation and repair of the data they store in their peer-to-peer architecture. I was bemused by the categories and subject descriptors assigned to the paper itself: H.3.7 [Information Storage and Retrieval]: Digital Libraries; D.4.5 [Operating Systems]: Reliability . Nothing about preservation or archives.

It is also interesting to note that you can view most of the original presentation at the 19th ACM Symposium on Operating Systems Principles (SOSP 2003) from a video archive of webcasts of the conference. The presentation of the LOCKSS paper begins about halfway through the 2nd video on the video archive page .

The start of the section on design principles explains:

Digital preservation systems have some unusual features. First, such systems must be very cheap to build and maintain, which precludes high-performance hardware such as RAID, or complicated administration. Second, they need not operate quickly. Their purpose is to prevent rather than expedite change to data. Third, they must function properly for decades, without central control and despite possible interference from attackers or catastrophic failures of storage media such as fire or theft.

Later they declare the core of their approach as “..replicate all persistent storage across peers, audit replicas regularly and repair any damage they find.” The paper itself has lots of details about HOW they do this – but for the purpose of this post I was more interested in their general philosophy on how to maintain the information in their care.

DAITSS (Dark Archive in the Sunshine State) was built by the Florida Center for Library Automation (FCLA) to support their own needs when creating the Florida Center for Library Automation Digital Archive (Florida Digital Archive or FDA). In mid May of 2007, FCLA announced the release of DAITSS as open source software under the GPL license.

In the document The Florida Digital Archive and DAITSS: A Working Preservation Repository Based on Format Migration I found:

… the [Florida Digital Archive] is configured to write three copies of each file in the [Archival Information Package] to tape. Two copies are written locally to a robotic tape unit, and one copy is written in real time over the Internet to a similar tape unit in Tallahassee, about 130 miles away. The software is written in such a way that all three writes must complete before processing can continue.

Similar to LOCKSS, DAITSS relies on what they term ‘multiple masters’. There is no concept of a single master. Since all three are written virtually simultaneously they are all equal in authority. I think it is very interesting that they rely on writing to tapes. There was a mention that it is cheaper – yet due to many issues they might still switch to hard drives.

With regard to formats and ensuring accessibility, the same document quoted above states on page 2:

Since most content was expected to be documentary (image, text, audio and video) as opposed to executable (software, games, learning modules), FCLA decided to implement preservation strategies based on reformatting rather than emulation….Full preservation treatment is available for twelve different file formats: AIFF, AVI, JPEG, JP2, JPX, PDF, plain text, QuickTime, TIFF, WAVE, XML and XML DTD.

The design of DAITSS was based on the Reference Model for an Open Archival Information System (OAIS). I love this paragraph from page 10 of the formal specifications for OAIS adopted as ISO 14721:2002.

The information being maintained has been deemed to need Long Term Preservation, even if the OAIS itself is not permanent. Long Term is long enough to be concerned with the impacts of changing technologies, including support for new media and data formats, or with a changing user community. Long Term may extend indefinitely. (emphasis added)

Another project implementing the OAIS reference model is CASPAR – Cultural, Artistic and Scientific knowledge for Preservation, Access and Retrieval. This project appears much greater in scale than DAITSS. It started a bit more than 1 year ago (April 1, 2006) with a projected duration of 42 months, 17 partners and a projected budget of 16 million Euros (roughly 22 million US Dollars at the time of writing). Their publications section looks like it could sidetrack me for weeks! On page 25 of the CASPAR Description of Work, in a section labeled Validation, a distinction is made between “here and now validation” and “the more fundamental validation techniques on behalf of the ‘not yet born'”. What eloquent turns of phrase!

Page 7 found me another great tidbit in a list of digital preservation metrics that are expected:

2) Provide a practical demonstration by means of what may be regarded as “accelerated lifetime” tests. These should involve demonstrating the ability of the Framework and digital information to survive:
a. environment (including software, hardware) changes: Demonstration to the External Review Committee of usability of a variety of digitally encoded information despite changes in hardware and software of user systems, and such processes as format migration for, for example, digital science data, documents and music
b. changes in the Designated Communities and their Knowledge Bases: Demonstration to the External Review Committee of usability of a variety of digitally encoded information by users of different disciplines

Here we have thought not only about the technicalities of how users may access the objects in the future, but consideration of users who might not have the frame of reference or understanding of the original community responsible for creating the object. I haven’t seen any explicit discussion of this notion before – at least not beyond the basic idea of needing good documentation and contextual background to support understanding of data sets in the future. I love the phrase ‘accelerated lifetime’ but I wonder how good a job we can do at creating tests for technology that does not yet exist (consider the Ladies Home Journal predictions for the year 2000 published in 1900).

What I love about LOCKSS, DAITSS and CASPAR (and no, it isn’t their fabulous acronyms) is the very diverse groups of enthusiastic people trying to do the right thing. I see many technical and research oriented organizations listed as members of the CASPAR Consortium – but I also see the Università degli studi di Urbino (noted as “created in 1998 to co-ordinate all the research and educational activities within the University of Urbino in the area of archival and library heritage, with specific reference to the creation, access, and preservation of the documentary heritage”) and the Humanities Advanced Technology and Information Institute, University of Glasgow (noted as having “developed a cutting edge research programme in humanities computing, digitisation, digital curation and preservation, and archives and records management”). LOCKSS and DAITSS have both evolved in library settings.

Questions relating to digital archives, preservation and validation are hard ones. New problems and new tools (like Microsoft’s Virtual PC shown in the video above) are appearing all the time. Developing best practices to support real world solutions will require the combined attention of those with the skills of librarians, archivists, technologists, subject matter specialists and others whose help we haven’t yet realized we need. The challenge will be to find those who have experience in multiple areas and pull them into the mix. Rather than assuming that one group or another is the best choice to solve digital preservation problems, we need to remember there are scores of problems – most of which we haven’t even confronted yet. I vote for cross pollination of knowledge and ideas rather than territorialism. I vote for doing your best to solve the problems you find in your corner of the world. There are more than enough hard questions to answer to keep everyone who has the slightest inclination to work on these issues busy for years. I would hate to think that any of those who want to contribute might have to spend energy to convince people that they have the ‘right’ skills. Worse still – many who have unique viewpoints might not be asked to share their perspectives because of general assumptions about the ‘kind’ of people needed to solve these problems. Projects like CASPAR give me hope that there are more examples of great teamwork than there are of people being left out of the action.

There is so much more to read, process and understand. Know of a digital preservation project with a unique approach to validation that I missed? Please contact me or post a comment below.

WordPress Blog Magic – A look under the hood

June 22, 2007 9 Comments

Spellbound Blog is served to you via the fabulous open source software that is WordPress. Last night I finally created a page about the major WordPress plugins I have used to customize this blog. If you are interested in such things, take a look at my new WordPress Customizations page.

Book Review: Dreaming in Code (a book about why software is hard)

May 24, 2007 1 Comment

Dreaming in Code: Two Dozen Programmers, Three Years, 4,732 Bugs, and One Quest for Transcendent Software
(or “A book about why software is hard”) by Scott Rosenberg

Before I dive into my review of this book – I have to come clean. I must admit that I have lived and breathed the world of software development for years. I have, in fact, dreamt in code. That is NOT to say that I was programming in my dream, rather that the logic of the dream itself was rooted in the logic of the programming language I was learning at the time (they didn’t call it Oracle Bootcamp for nothing).

With that out of the way I can say that I loved this book. This book was so good that I somehow managed to read it cover to cover while taking two graduate school courses and working full time. Looking back, I am not sure when I managed to fit in all 416 pages of it (ok, there are some appendices and such at the end that I merely skimmed).

Rosenberg reports on the creation of an open source software tool named Chandler. He got permission to report on the project much as an embedded journalist does for a military unit. He went to meetings. He interviewed team members. He documented the ups and downs and real-world challenges of building a complex software tool based on a vision.

If you have even a shred of interest in the software systems that are generating records that archivists will need to preserve in the future – read this book. It is well written – and it might just scare you. If there is that much chaos in the creation of these software systems (and such frequent failure in the process), what does that mean for the archivist charged with the preservation of the data locked up inside these systems?

I have written about some of this before (see Understanding Born Digital Records: Journalists and Archivists with Parallel Challenges), but it stands repeating: If you think preserving records originating from standardized packages of off-the-shelf software is hard, then please consider that really understanding the meaning of all the data (and business rules surrounding its creation) in custom built software systems is harder still by a factor of 10 (or a 100).

It is interesting for me to feel so pessimistic about finding (or rebuilding) appropriate contextual information for electronic records. I am usually such an optimist. I suspect it is a case of knowing too much for my own good. I also think that so many attempts at preservation of archival electronic records are in their earliest stages – perhaps in that phase in which you think you have all the pieces of the puzzle. I am sure there are others who have gotten further down the path only to discover that their map to the data does not bear any resemblance to the actual records they find themselves in charge of describing and arranging. I know that in some cases everything is fine. The records being accessioned are well documented and thoroughly understood.

My fear is that in many cases we won’t know that we don’t have all the pieces we need to decipher the data until many years down the road leads me to an even darker place. While I may sound alarmist, I don’t think I am overstating the situation. This comes from my first hand experience in working with large custom built databases. Often (back in my life as a software consultant) I would be assigned to fix or add on to a program I had not written myself. This often feels like trying to crawl into someone else’s brain.

Imagine being told you must finish a 20 page paper tonight – but you don’t get to start from scratch and you have no access to the original author. You are provided a theoretically almost complete 18 page paper and piles of books with scraps of paper stuck in them. The citations are only partly done. The original assignment leaves room for original ideas – so you must discern the topic chosen by the original author by reading the paper itself. You decide that writing from scratch is foolish – but are then faced with figuring out what the person who originally was writing this was trying to say. You find 1/2 finished sentences here and there. It seems clear they meant to add entire paragraphs in some sections. The final thorn in your side is being forced to write in a voice that matches that of the original author – one that is likely odd sounding and awkward for you. About halfway through the evening you start wishing you had started from scratch – but now it is too late to start over, you just have to get it done.

So back to the archivist tasked with ensuring that future generations can make use of the electronic records in their care. The challenges are great. This sort of thing is hard even when you have the people who wrote the code sitting next to you available to answer questions and a working program with which to experiment. It just makes my head hurt to imagine piecing together the meaning of data in custom built databases long after the working software and programmers are well beyond reach.

Does this sound interesting or scary or relevant to your world? Dreaming in Code is really a great read. The people are interesting. The issues are interesting. The author does a good job of explaining the inner workings of the software world by following one real world example and grounding it in the landscape of the history of software creation. And he manages to include great analogies to explain things to those looking in curiously from outside of the software world. I hope you enjoy it as much as I did.

ArchivesZ: Visualizing Archival Collections

May 13, 2007 8 Comments

Announcing ArchivesZ – a tool for visualizing archival collections. This prototype is the final project for my information visualization class. It is a web based tool designed to support exploration of aggregated data about archival collections – inspired by the availability of structured data in EAD encoded finding aids.

For visual thinkers who just want to see what this is about – take a look at our 5 minute video demonstration. I don’t yet have a version online for folks to play with – but that is in the works.

This is the official blurb we came up with to describe the project:

ArchivesZ is an information visualization tool designed to support search, understanding and exploration of of archival and manuscript collections. The tool addresses one of the major challenges facing those who work with archival records – the need to understand the scope and quantity of available records. Since archival collections are unique, vary dramatically in record quantity and are organized based on the records creators it can be a great challenge for users to gain perspective concerning the available records across multiple collections. ArchivesZ leverages a unique dual sided histogram to support exploration of the multiple subjects assigned to each collection. As subject terms are selected, the dual sided histogram chart is generated to display related subjects. The tool combines the dual sided histogram with a more traditional histogram displaying year data to permit tightly coupled, multi-dimensional browsing of subject and time period metadata. By representing the distribution of subjects and time periods using the metric of total aggregate linear feet of associated collections, ArchivesZ permits users to get a better sense of total available research materials than they would by viewing a standard search result list.

If you are curious about what a ‘dual-sided histogram’ actually is (or just want to read more about the process and ideas that led us to the current incarnation of ArchivesZ) take a look at our final paper about ArchivesZ.

There is a very long list of features I would like to add or improve but of course there is only so much you can do in the few weeks available for a project like this. Some of our ideas are detailed at the end of the paper I linked to above. I plan to continue working on ArchivesZ and I welcome all feedback – either as comments to this post or via email to jeanne AT spellboundblog.com.

RSS and Mainstream News Outlets

May 3, 2007 3 Comments

Recently posted on the FP Passport blog, The truth about RSS gives an overview of the results of a recent RSS study that looks at the RSS feeds produced by 19 major news outlets. The complete study (and its results) can be found here: International News and Problems with the News Media’s RSS Feeds.

If you are interested in my part in all this, read the Study Methodology section (which describes my role down under the heading ‘How the Research Team Operated’) and the What is RSS? page (which I authored, and describes both the basics of RSS as well as some other web based tools we used in the study – YahooPipes and Google Docs).

Why should you care about RSS? RSS feeds are becoming more common on archives websites. It should be treated as just another tool in the outreach toolbox for making sure that your archives maintains or improves its visibility online. To get an idea of how they are being used, consider the example of the UK National Archives. They currently publish three RSS feeds:

Latest news Get the latest news and events for The National Archives.
New document releases Highlights of new document releases from The National Archives.
Podcasts Listen to talks, lectures and other events presented by The National Archives.

The results of the RSS study I link to above shed light on the kinds of choices that are made by content providers who publish feeds – and on the expectations of those who use them. If you don’t know what RSS is – this is a great intro. If you use and love (or hate) RSS already – I would love to know your thoughts on the study’s conclusions.

Visualizing Archival Collections

April 8, 2007 1 Comment

As I mentioned earlier, I am taking an Information Visualization class this term. For our final class project I managed to inspire two other classmates to join me in creating a visualization tool based on the structured data found in the XML version of EAD finding aids.

We started with the XML of the EAD finding aids from University of Maryland’s ArchivesUM and the Library of Congress Finding Aids. My teammates have written a parser that extracts various things from the XML such as title, collection size, inclusive dates and subjects. Our goal is to create an innovative way to improve the exploration and understanding of archival collections using an interactive visualization.

Our main targets right now are to use a combination of subjects, years and collection size to give users a better impression of the quantity of archival materials that fit various search criteria. I am a bit obsessed about using the collection size as a metric for helping users understand the quantity of materials. If you do a search for a book in a library’s catalog – getting 20 hits usually means that you are considering 20 books. If you consider archival collections – 20 hits could mean 20 linear feet (20 collections each of which is 1 linear foot in size) or it could mean 2000 linear feet (20 collections each of which is 100 linear feet in size). Understanding this difference is something that visualization can help us with. Rather than communicating only the number of results – the visualization will communicate the total size of collections assigned each of the various subjects.

I have uploaded 2 preliminary screen mockups one here and the second here trying to get at my ideas for how this might work.

Not reflected in the mock-ups is what could happen when a user clicks on the ‘related subject’ bars. Depending on where they click – one of two things could happen. If they click on the ‘related subject’ bar WITHIN the boundaries of the selected subject (in the case above, that would mean within the ‘Maryland’ box), then the search would filter further to only show those collections that have both the ‘Maryland’ and newly ‘added’ tag. The ‘related subjects’ list and displayed year distribution would change accordingly as well. If, instead, the user clicks on a ‘related subject’ bar OUTSIDE the boundary of the selected subject — then that subject would become the new (and only) selected subject and the displayed collections, related subjects and years would change accordingly.

So that is what we have so far. If you want to keep an eye on our progress, our team has a page up on our class wiki about this project. I have a ton of ideas of other things I would love to add to this (my favorite being a map of the world with indications of where the largest amount of archival materials can be found based on a keyword or subject search) – but we have to keep our feet on the ground long enough actually build something for our class project. This is probably a good thing. Smaller goals make for a greater chance of success.

Category: software