Archival Transcriptions: for the public, by the public

There is a recent thread on the archives listserv that talks about transcriptions – specifically for small projects or those that have little financial support. There is even a case in which there is no easy OCR answer due to the state of the digitized microfilm records.
One of the suggestions was to use some combination of human effort to read the documents – either into a program that would transcribe them, or to another human who would do the typing. It made me wonder what it would look like to make a place online where people who wanted to could volunteer their transcription time. In the case where the records are already digitized and viewable, this seems like an interesting approach.

Something like this already exists for the genealogy world over at the USGenWeb Archives Project. They have a long list of different projects listed here. Though the interface is a bit confusing, the spirit of the effort is clear – many hands make light work. Precious genealogical resources can be digitized, transcribed and added to this archive to support the research of many by anyone – anywhere in the world.

Of course in the case of transcribing archival records there are challenges to be overcome. How do you validate what is transcribed? How do you provide guidance and training for people working from anywhere in the world? If I have figured out that a particular shape is a capital S in a specific set of documents, that could help me (or an OCR program) as I progress through the documents, but if I only see one page from a series – I will have to puzzle through that one page without the support of my past experience. Perhaps that would encourage people to keep helping with a specific set of records? Maybe you give people a few sample pages with validated translations to practice with? And many records won’t be that hard to read – easy for a human’s eye but still a challenge for an OCR program.

The optimist in me hopes that it could be a tempting task for those who want to volunteer but don’t have time to come in during the normal working day. Transcribing digitized records can be done in the middle of the night in your pajamas from anywhere in the world. Talk about increasing your pool of possible volunteers! I would think that it could even be an interesting project for high school and college students – a chance to work with primary sources. With careful design, I can even imagine providing an option to select from a preordained set of subjects or tags (or in Folksonomy friendly environment, the option to add any tags that the transcriber deems appropriate) – though that may be another topic worthy of its own exploration independent of transcription.

The initial investment for a project like this would come from building a framework to support a distributed group of volunteers. You would need an easy way to serve up a record or group of records to a volunteer and prevent duplication of effort – but this is an old problem with good solutions from the configuration management world of software development and other collaboration work environments.

It makes a nice picture in my mind – a slow, but steady, team effort to transcribe collections like the Colorado River Bed Case (2,125 pages of digitized microfilm at the University of Utah’s J. Willard Marriott Library) – mostly done from people’s homes on their personal computers in the middle of the night. A central website for managing digitized archival transcriptions could give the research community the ability to vote on the next collection that warrants attention. Admit it – you would type a page or two yourself, wouldn’t you?
Related Posts:

Posted on 12th October 2006
Under: access, digitization, historical research, outreach, preservation, search, transcription, what if | 11 Comments » | Print This Post Print This Post

11 Responses to “Archival Transcriptions: for the public, by the public”

  1. Dorothea Says:

    Similar efforts include Project Gutenberg’s Distributed Proofreading and the Puerto Rico Census transcription efforts.

    As for accuracy, the usual method is double-keying (assigning the same page to at least two transcribers) and resolving errors afterward.

  2. Jeanne Says:

    Thanks! Double-keying makes a lot of sense. It could easily be automated such that exact matches between the two copies could be auto-published without further human intervention.

    The Distributed Proofreaders site looks very interesting – and I love their tag line “Preserving History One Page at a Time”.

  3. Paco Says:

    Very nice post (as the entire blog is, of course)!
    I think it could be very useful for European Archive’s Early Modern manuscript collections, as for the students of Palaeography, who could work, as you said, with primary sources. Wikipaleography? Wikitranscription? Any case, a very good idea for building the Finding Aids 2.0

  4. Paleografía 2.0: reflexiones sobre posibles proyectos de transcripción de documentos de archivo « @rchivista Says:

    […] En un post reciente de Jeanne Kramer-Smyth, en su bitácora Spellbound blog (altamente recomendada), la autora reflexionaba sobre la posibilidad de crear sitios web “donde la gente que quisiera pudiera ofrecer voluntariamente su tiempo para la transcripción [de documentos]”. Es decir, una suerte de wiki o sitio colaborativo en el que cualquier persona pueda publicar trascripciones de documentos de archivo (o editar/corregir las ya publicadas) -entiendo que- referenciando su archivo y signatura y, si fuera posible, enlazando a las descripciones correspondientes de dichos documentos y/o sus imágenes digitalizadas, colgadas en los instrumentos de descripción en línea de los diferentes archivos. […]

  5. Jeanne Says:


    Glad you like the idea. I hope to post more on it when I have had time to think it through a bit more. I suspect that if the infrastructure were done well that the idea could support many applications. One of the more interesting problems is how to make it easy for those who submit their records for transcription to get the actual final work back into their local repository such that it is associated properly with the right original record and accessible via their local access interface.

    I wish I could read your blog as easily as you appear to read mine. I have been enjoying the international reach of my blog but I wish I could read a dozen languages (or that Babblefish or Google Translate did a better job). Ah ha – there is a neat idea! RSS feeds that are automatically translated. When I add your blog to Bloglines, one of the configuration options should be what language to translate from and to. Does this already exist?

  6. - an interesting option for putting collections online - - ponderings of an archives student Says:

    […] Okay – remember those big dreams of mine? Specifically relating to A Hosting Service for Digitized Collections and Archival Transcriptions? Well looks like an interesting option to explore with these ideas in mind. […]

  7. reCAPTCHA: crowdsourcing transcription comes to life - - spellbound by archival science and information technology in the digital age Says:

    […] have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like […]

  8. LOC + Flickr equals Crowdsourced Tagging - - spellbound by archival science and information technology in the digital age Says:

    […] have posted before about the potential of crowdsourcing. I am in favor of it. Yes, all the tags won’t be perfect. Yes, there will be seven different […]

  9. THATCamp » Blog Archive » Crowdsourcing Transcriptions Says:

    […] to talk about is the idea of distributed document transcription as I explain it in my blog post: Archival Transcriptions: for the public, by the public.  While I do love what reCaptcha does at the word level and does with locations, […]

  10. Early Modern Notes » Interactive digital history Says:

    […] Archival transcriptions: for the public, by the public […]

  11. Crowdsourcing Manuscript Transcription « Document Says:

    […] generated by a Spellbound Blog post. Possibly related posts: (automatically generated)OutlineDigital Media and Learning CompetitionAn […]

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>