THATCamp 2008: Crowdsourced Transcription and Collaborative Annotation

The THATCamp session officially titled ‘Crowdsourcing’ on the schedule was actually aimed at discussing the intersection of crowdsourced transcription and collaborative annotation. The group was small – just six of us and Ben Brumfield got us going by giving us an overview of transcription software and projects:

The FamilySearch Indexing Project is an LDS church project put out by the FamilySearch Labs. Their goals: “Volunteers extract family history information from digital images of historical documents to create searchable indexes that assist everyone in finding their ancestors.”
The Manuscript Transcription Assistant is based at Worcester Polytechnic Institute (WPI) and is described as “a tool to assist transcribers in creating transcriptions, and incorporate meta-data about each image and transcription that can then be used to search through an electronic library of transcriptions”. I found mention in the FAQ of the desire to create a community so that “transcribers will be able to collaborate their work by rating the quality of other user’s transcriptions. By ranking the transcriptions, specific versions of transcriptions will emerge as an authority for that manuscript. ” Unfortunately, a lot of the links on that site are broken and my attempt to register gave me an error. It is not clear to me that this project is actually still active.
Soldier Studies is a website dedicated to posting transcriptions of civil war letters and diaries. This is not a tool for transcribing, but is clearly a repository targeting specifically transcriptions (see their Mission Statement for more information).
Oh No Robot is a comics transcription and search tool. It provides a page to find comics needing transcription and a great page to explain how transcription works on their site.

After examining what was out there, Ben concluded that what he wanted didn’t exist – so he started to build it himself. He gave us a demo of his “very beta” software. His goal is to build a web based tool to support collaborative manuscript transcription and annotation by individuals without a strong technical background. In its current (and private beta) state the software supports transcription, an innovative approach to linking individual words or phrases to collection defined subjects and some basic community tools to let his virtual team discuss transcription issues. Ben is working hard on the software – if you are interested in his project, definitely get in touch with him.

Travis Brown showed us his creation: eComma. eComma aims to “enable groups of students, scholars, or general readers to build collaborative commentaries on a text and to search, display, and share those commentaries online”. He showed us how users could tag or add comments on individual words or phrases of a loaded text. Take a look at the eComma page for Sonnet 18 by William Shakespeare. The words highlighted in blue are those which are tagged or have comments associated with them. If you highlight ‘the eye of heaven’ in line 5 you will see that it is tagged as a metaphor. Travis reported that he will have 2 other programmers working on eComma with him this summer and has his eye on improving some interface issues and adding a few more features.

We also talked about ways to display transcription. Elena Razlogova guided us over to the DoHistory website. There she showed us the Magic Lens interface. This interface displays the transcription of a handwritten diary page via a lens style overlay that you can move with your mouse. This reminded me of the Gilder Lehrman Battle Lines: Letters from America’s Wars interface that I found when doing research for my Communicating Context in Online Collections Poster. If you haven’t seen it before – go examine the page showing the transcription of (turn down your speaker if a reader’s voice will disturb those around you) Nathanael Green’s letter to Catherine Greene dated July 17, 1778.

While on the DoHistory site I also found the Try Your Hand At Transcribing page. This page shows the challenge of transcribing handwritten documents by giving you the chance to try it yourself and then lets you check your transcription with the click of a button.

We talked a bit about the technology behind eComma (forgive me Travis for not having enough details in my notes to explain your current architecture here) and the challenges inherent in wanting to annotate overlapping sets of words. Though he isn’t using it in the current implementation of eComma, Travis mentioned the Layered Markup Annotation Language (LMNL) which the tutorial page explains as:

…LMNL documents contain character data which is marked up using named and occasionally overlapping ranges. Ranges can have annotations, which can themselves be annotated and can have structured content. To support authoring, especially collaborative authoring, markup is namespaced and divided into layers, which might reflect different views on the text.

I can definitely see how LMNL might be an interesting framework for building transcription and annotation software.

Krissy O’Hare brought up the challenges of transcribing audio and video that she has faced working on oral history projects at Concordia University. This led to Travis (I think?) mentioning the Texas German Dialect Project (TGDP) and the CMU Sphinx Group Speech Recognition Engine. TGDP has an online archive of recorded interviews along with their transcriptions and translations. CMU Sphinx’s introduction explains that their software tools are targeted at expert users wanting to build speech-using applications.

This was a great session. The small group gave everyone a chance to contribute and take over the keyboard in order to show off their favorite sites. It was immediately after the Text Mining session, so our minds were already full of all the great things one could do with text once it is transcribed.

I am excited to watch the evolution of group transcription and annotation software. If you know of other transcription or annotation tools or projects – please post them to the comments.

Image credit: Free pencils by zone41 via flickr

As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

13 Comments

Ben Brumfield
June 6, 2008 at 8:41 am

What a great write-up of the session! Thanks again for all your feedback — I feel like several technical and conceptual roadblocks for my project have been cleared thanks to the suggestions and ideas that came from THATCamp.

If you’re interested in more of my research on similar projects, I’ve written them up on my blog, under the similar projects category.
Karin Dalziel
June 6, 2008 at 1:38 pm

Thank you so much for providing such a detailed write up. It’s great to read what went on in the sessions I could not attend.
Jeanne Post author
June 6, 2008 at 2:04 pm

Karin,

Glad the post was useful. Looking at your blog I think we went to all different sessions the first day – but all of the sessions you went to looked interesting to me too! Maybe next year we can get them to record the sessions and make vodcats or podcasts out of them.

Jeanne
Seks Hikayeleri
June 9, 2008 at 11:15 pm

Thank you Jeanne
Karin Dalziel
June 10, 2008 at 11:23 am

Jeanne – I think podcasts of the sessions would be great. I thought about just recording them for myself, but in the rush to get to D.C. I forgot my audio recorder.
Pingback:Early Modern Notes » Interactive digital history
Pingback:General Transcription 101. | 7Wins.eu
Pingback:Websites tagged "annotation" on Postsaver
james
January 8, 2009 at 12:33 pm

Would Design by Humans or springleap.com be examples of crowdsourcing?
Pingback:On Crowdsourcing and History - Jefferson’s Newspaper
Pingback:Examples of Collaborative Digital Humanities Projects « Digital Scholarship in the Humanities
Gina
October 12, 2010 at 12:31 pm

Thank you for this article. The talk about eComma and transcription was good.
Pingback:Examples of Collaborative Digital Humanities Projects - archaeoinaction.info

Comments are closed.