The THATCamp session officially titled ‘Crowdsourcing’ on the schedule was actually aimed at discussing the intersection of crowdsourced transcription and collaborative annotation. The group was small – just six of us and Ben Brumfield got us going by giving us an overview of transcription software and projects:
- The FamilySearch Indexing Project is an LDS church project put out by the FamilySearch Labs. Their goals: “Volunteers extract family history information from digital images of historical documents to create searchable indexes that assist everyone in finding their ancestors.”
- The Manuscript Transcription Assistant is based at Worcester Polytechnic Institute (WPI) and is described as “a tool to assist transcribers in creating transcriptions, and incorporate meta-data about each image and transcription that can then be used to search through an electronic library of transcriptions”. I found mention in the FAQ of the desire to create a community so that “transcribers will be able to collaborate their work by rating the quality of other user’s transcriptions. By ranking the transcriptions, specific versions of transcriptions will emerge as an authority for that manuscript. ” Unfortunately, a lot of the links on that site are broken and my attempt to register gave me an error. It is not clear to me that this project is actually still active.
- Soldier Studies is a website dedicated to posting transcriptions of civil war letters and diaries. This is not a tool for transcribing, but is clearly a repository targeting specifically transcriptions (see their Mission Statement for more information).
- Oh No Robot is a comics transcription and search tool. It provides a page to find comics needing transcription and a great page to explain how transcription works on their site.
After examining what was out there, Ben concluded that what he wanted didn’t exist – so he started to build it himself. He gave us a demo of his “very beta” software. His goal is to build a web based tool to support collaborative manuscript transcription and annotation by individuals without a strong technical background. In its current (and private beta) state the software supports transcription, an innovative approach to linking individual words or phrases to collection defined subjects and some basic community tools to let his virtual team discuss transcription issues. Ben is working hard on the software – if you are interested in his project, definitely get in touch with him.
Travis Brown showed us his creation: eComma. eComma aims to “enable groups of students, scholars, or general readers to build collaborative commentaries on a text and to search, display, and share those commentaries online”. He showed us how users could tag or add comments on individual words or phrases of a loaded text. Take a look at the eComma page for Sonnet 18 by William Shakespeare. The words highlighted in blue are those which are tagged or have comments associated with them. If you highlight ‘the eye of heaven’ in line 5 you will see that it is tagged as a metaphor. Travis reported that he will have 2 other programmers working on eComma with him this summer and has his eye on improving some interface issues and adding a few more features.
We also talked about ways to display transcription. Elena Razlogova guided us over to the DoHistory website. There she showed us the Magic Lens interface. This interface displays the transcription of a handwritten diary page via a lens style overlay that you can move with your mouse. This reminded me of the Gilder Lehrman Battle Lines: Letters from America’s Wars interface that I found when doing research for my Communicating Context in Online Collections Poster. If you haven’t seen it before – go examine the page showing the transcription of (turn down your speaker if a reader’s voice will disturb those around you) Nathanael Green’s letter to Catherine Greene dated July 17, 1778.
While on the DoHistory site I also found the Try Your Hand At Transcribing page. This page shows the challenge of transcribing handwritten documents by giving you the chance to try it yourself and then lets you check your transcription with the click of a button.
We talked a bit about the technology behind eComma (forgive me Travis for not having enough details in my notes to explain your current architecture here) and the challenges inherent in wanting to annotate overlapping sets of words. Though he isn’t using it in the current implementation of eComma, Travis mentioned the Layered Markup Annotation Language (LMNL) which the tutorial page explains as:
…LMNL documents contain character data which is marked up using named and occasionally overlapping ranges. Ranges can have annotations, which can themselves be annotated and can have structured content. To support authoring, especially collaborative authoring, markup is namespaced and divided into layers, which might reflect different views on the text.
I can definitely see how LMNL might be an interesting framework for building transcription and annotation software.
Krissy O’Hare brought up the challenges of transcribing audio and video that she has faced working on oral history projects at Concordia University. This led to Travis (I think?) mentioning the Texas German Dialect Project (TGDP) and the CMU Sphinx Group Speech Recognition Engine. TGDP has an online archive of recorded interviews along with their transcriptions and translations. CMU Sphinx’s introduction explains that their software tools are targeted at expert users wanting to build speech-using applications.
This was a great session. The small group gave everyone a chance to contribute and take over the keyboard in order to show off their favorite sites. It was immediately after the Text Mining session, so our minds were already full of all the great things one could do with text once it is transcribed.
I am excited to watch the evolution of group transcription and annotation software. If you know of other transcription or annotation tools or projects – please post them to the comments.
Image credit: Free pencils by zone41 via flickr
As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.
- Archival Transcriptions: for the public, by the public
- reCAPTCHA: crowdsourcing transcription comes to life
- THATCamp 2008: Text Mining and the Persian Carpet Effect
- SEO Evaluation of an Archival Website: Looking at UMBC’s Digital Collections