Menu Close

Category: access

Archival Transcriptions: for the public, by the public

There is a recent thread on the archives listserv that talks about transcriptions – specifically for small projects or those that have little financial support. There is even a case in which there is no easy OCR answer due to the state of the digitized microfilm records.
One of the suggestions was to use some combination of human effort to read the documents – either into a program that would transcribe them, or to another human who would do the typing. It made me wonder what it would look like to make a place online where people who wanted to could volunteer their transcription time. In the case where the records are already digitized and viewable, this seems like an interesting approach.

Something like this already exists for the genealogy world over at the USGenWeb Archives Project. They have a long list of different projects listed here. Though the interface is a bit confusing, the spirit of the effort is clear – many hands make light work. Precious genealogical resources can be digitized, transcribed and added to this archive to support the research of many by anyone – anywhere in the world.

Of course in the case of transcribing archival records there are challenges to be overcome. How do you validate what is transcribed? How do you provide guidance and training for people working from anywhere in the world? If I have figured out that a particular shape is a capital S in a specific set of documents, that could help me (or an OCR program) as I progress through the documents, but if I only see one page from a series – I will have to puzzle through that one page without the support of my past experience. Perhaps that would encourage people to keep helping with a specific set of records? Maybe you give people a few sample pages with validated translations to practice with? And many records won’t be that hard to read – easy for a human’s eye but still a challenge for an OCR program.

The optimist in me hopes that it could be a tempting task for those who want to volunteer but don’t have time to come in during the normal working day. Transcribing digitized records can be done in the middle of the night in your pajamas from anywhere in the world. Talk about increasing your pool of possible volunteers! I would think that it could even be an interesting project for high school and college students – a chance to work with primary sources. With careful design, I can even imagine providing an option to select from a preordained set of subjects or tags (or in Folksonomy friendly environment, the option to add any tags that the transcriber deems appropriate) – though that may be another topic worthy of its own exploration independent of transcription.

The initial investment for a project like this would come from building a framework to support a distributed group of volunteers. You would need an easy way to serve up a record or group of records to a volunteer and prevent duplication of effort – but this is an old problem with good solutions from the configuration management world of software development and other collaboration work environments.

It makes a nice picture in my mind – a slow, but steady, team effort to transcribe collections like the Colorado River Bed Case (2,125 pages of digitized microfilm at the University of Utah’s J. Willard Marriott Library) – mostly done from people’s homes on their personal computers in the middle of the night. A central website for managing digitized archival transcriptions could give the research community the ability to vote on the next collection that warrants attention. Admit it – you would type a page or two yourself, wouldn’t you?

Records Speaking to the Present: Voices Not Silenced

When I composed my main essay for my application to University of Maryland’s MLS program, I wrote about why I was drawn to their Archives Program. I told them I revel in hearing the voices of the past speak through records such as those at I love the power that records can wield – especially when they can be accessed digitally from anywhere in the world. It is this sort of power that let me see the ship manifests and the names of the boats on which my grandparents came to this country (such as The Finland ).

All this came rushing back to me while reading the September 18th article 2 siblings reunited after being separated in Holocaust. The grandsons of a Holocaust survivor looked up their grandmother in Yad Vashem’s central database of Shoah Victims’ Names – and found an entry stating that she had died during the Holocaust. One thing led to another – and two siblings that thought they had lost each other 65 years earlier were reunited.

The fact that access to records can bring people together across time speaks to me at a very primal level. So now you know – I am a romantic and an optimist (okay, if you have been reading my blog already – this shouldn’t come as any surprise). I want to believe that people who were separated long ago can be reunited – either through words or in person. This isn’t the first story like this – a quick search in google news turned up others – such as this holocaust reunion story from 2003.

This led me to do more research into how archival records are being used to find people lost during the Holocaust.

The Red Cross Holocaust Tracing Center has researched 28,000 individuals – and found over 1,000 of them alive since 1990. The FAQ on their website states that they believe there to be over 280,000 Holocaust survivors and family members in the United States alone and that they believe their work may continue for many years. As much as I love the idea of finding a way to provide access to digitized records – it is easy to see why the Tracing Center isn’t going away anytime soon. First of all – consider their main data sources – lots of private information that likely does NOT belong someplace where it can be read by just anyone:

While the American Red Cross has been providing tracing for victims of WWII and the Nazi regime since 1939, impetus for the creation of the center occurred in 1989 with the release of files on 130,000 people detained for forced labor and 46 death books containing 74,000 names from Auschwitz. Microfilm copies released to the International Committee of the Red Cross (ICRC) by the Soviet Union provided the single largest source of information since the end of WWII.

The staff of the center have also forged strong ties with the ICRC’s International Tracing Service in Arolsen, Germany – and get rapid turnaround times for their queries as a result. They have access to many organizations, archives and museums around the world in their hunt for evidence of what happened to individuals. They use all the records they can find to discover the answers to the questions they are asked – to be the detectives that families need to discover what happened to their loved ones. To answer the questions that have never been answered.

The USC Shoah Foundation Institute for Visual History and Education consists of 52,000 testimonies of survivors and other witnesses to the Holocaust collected in 56 countries and 32 languages from 1994 through 2000. These video testimonies document experiences before, during and after the Holocaust. It is the sort of first hand documentation that just could not have existed without the vision and efforts of many. They say on their FAQ page:

Now that this unmatched archive has been amassed, the Shoah Foundation is engaged in a new and equally urgent mission: to overcome prejudice, intolerance, and bigotry – and the suffering they cause – through the educational use of the Foundation’s visual history testimonies… Currently, the Foundation is committed to making these videotaped testimonies accessible to the public as an international educational resource. Simultaneously, an intensive program of cataloguing and indexing the testimonies is underway. This process will eventually enable researchers and the general public to access information about specific people, places, and experiences mentioned in the testimonies in much the same way as an index permits a reader to find specific information in a book.

The testimonies also serve as a basis for a series of educational materials such as interactive web exhibits, documentary films, and classroom videos developed by the Shoah Foundation.

I guess I am not sure where I am going with this – other than to point out a dramatic array of archives that are touching the lives of people right now. Consider this post a fan letter to all the amazing people who have sheparded these collections (and in some cases their digital counterparts) into the twenty-first century where they will continue to help people hear the voices of their ancestors.

I have more ideas brewing on how these records compare and contrast with those about the survivors and those who were lost to 9/11, The Asian Tsunami and Katrina. How do these types of records compare with the Asian Tsunami Web Archive or the Hurricane Digital Memory Bank? Where will the grandchildren of those who lost their homes to Katrina go in 30 years to find out what street the family home used to be on? Who will give witness to the people lost in Asia to the Tsunami? Lots to think about.

My New Daydream: A Hosting Service for Digitized Collections

In her post Predictions over on, Merrilee asked “Where do you predict that universities, libraries, archives, and museums will be irresistibly drawn to pooling their efforts?” after reading this article.

And I say: what if there were an organization that created a free (or inexpensive fee-based) framework for hosting collections of digitized materials? What I am imagining is a large group of institutions conspiring to no longer be in charge of designing, building, installing, upgrading and supporting the websites that are the vehicle for sharing digital historical or scholarly materials. I am coming at this from the archivists perspective (also having just pondered the need for something like this in my recent post: Promise to Put It All Online ) – so I am imagining a central repository that would support the upload of digitized records, customizable metadata and a way to manage privacy and security.

The hurdles I imagine this dream solution removing are those that are roughly the same for all archival digitization projects. Lack of time, expertise and ongoing funding are huge challenges to getting a good website up and keeping it running – and that is even before you consider the effort required to digitize and map metadata to records or collections of records. It seems to me that if a central organization of some sort could build a service that everyone could use to publish their content – then the archivists and librarians and other amazing folks of all different titles could focus on the actual work of handling, digitizing and describing the records.

Being the optimist I am I of course imagine this service as providing easy to use software with the flexibility for building custom DTDs for metadata and security to protect those records that cannot (yet or ever) be available to the public. My background as a software developer drives me to imagine a dream team of talented analysts, designers and programmers building an elegant web based solution that supports everything needed by the archival community. The architecture of deployment and support would be managed by highly skilled technology professionals who would guarantee uptime and redundant storage.

I think the biggest difference between this idea and the wikipedias of the world is that there would be some step required for an institution to ‘join’ such that they could use this service. The service wouldn’t control the content (in fact would need to be super careful about security and the like considering all the issues related to privacy and copyright) – rather it would provide the tools to support the work of others. While I know that some institutions would not be willing to let ‘control’ of their content out of their own IT department and their own hard drives, I think others would heave a huge sigh of relief.

There would still be a place for the Archons and the Archivists’ Toolkits of the world (and any and all other fabulous open-source tools people might be building to support archivists’ interactions with computers), but the manifestation of my dream would be the answer for those who want to digitize their archival collection and provide access easily without being forced to invent a new wheel along the way.

If you read my GIS daydreams post, then you won’t be surprised to know that I would want GIS incorporated from the start so that records could be tied into a single map of the world. The relationships among records related to the same geographic location could be found quickly and easily.

Somehow I feel a connection in these ideas to the work that the Internet Archive is doing with In that case, producers of websites want them archived. They don’t want to figure out how to make that happen. They don’t want to figure out how to make sure that they have enough copies in enough far flung locations with enough bandwidth to support access – they just want it to work. They would rather focus on creating the content they want Archive-It to keep safe and accessible. The first line on Archive-It’s website says it beautifully: “Internet Archive’s new subscription service, Archive-It, allows institutions to build, manage and search their own web archive through a user friendly web application, without requiring any technical expertise.”

So, the tag line for my new dream service would be “DigiCollection’s new subscription service, Digitize-It, allows institutions to upload, manage and search their own digitized collections through a user friendly web application, without requiring any technical expertise.”

GIS, Access, Archives and Daydreams

Today in my Information Structure class, our topic was Entity Relationship Modeling. While this is a technique that I have used frequently over the many years I have been designing Oracle databases, it was interesting to see a slightly different spin on the ideas. The second half of class was an exercise to take a stab (as a class) at coming up with a preliminary data model for a mythical genealogical database system.

While deciding if we should model PLACE as an entity, a woman in our class who is a genealogy specialist told us that only one database she has ever worked with tries to do any validation of location – but that it is virtually impossible due to the scale of the problem. Since the borders and names of places on earth have changed so rapidly over time, and often with little remaining documentation, it is hard to correlate place names from archival records with fixed locations on the planet. Anyone who has waded through the fabulous ship records on the Ellis Island website hunting for information about their grandparents or great-grandparents has struggled with trying to understand how the place names on those records relate to the physical world we live in.

So – now to my daydream. Imagine if we could somehow work towards a consolidated GIS database that included place names and boundary information throughout history. Each GIS layer would relate to specific years or eras in time. Imagine if you could connect any set of archival records that contained location data to this GIS database and not only visualize the records via a map – but visualize the records with the ability to change the layers so you could see how the boundaries and place names changed. And view the relationship between records that have different place names on them from different eras – but are actually from the same location.

I poked around to see what people are already doing – and found all of this:

I know it is a daydream – but I believe in my heart of hearts that it will exist someday as computing power increases, the price of storing data decreases and more data sources converge. I do forsee another issue related to the challenges presented by different versions of borders and place names from the same time period – but there are ways to address that too. It could happen – believe with me!