transcription | Spellbound Blog

Harnessing The Power of We: Transcription, Acquisition and Tagging

October 15, 2012 2 Comments

In honor of the Blog Action Day for 2012 and their theme of ‘The Power of We’, I would like to highlight a number of successful crowdsourced projects focused on transcribing, acquisition and tagging of archival materials. Nothing I can think of embodies ‘the power of we’ more clearly than the work being done by many hands from across the Internet.

Transcription

Old Weather Records: “Old Weather volunteers explore, mark, and transcribe historic ship’s logs from the 19th and early 20th centuries. We need your help because this task is impossible for computers, due to diverse and idiosyncratic handwriting that only human beings can read and understand effectively. By participating in Old Weather you’ll be helping advance research in multiple fields. Data about past weather and sea-ice conditions are vital for climate scientists, while historians value knowing about the course of a voyage and the events that transpired. Since many of these logs haven’t been examined since they were originally filled in by a mariner long ago you might even discover something surprising.”
From The Page: “FromThePage is free software that allows volunteers to transcribe handwritten documents on-line.” A number of different projects are using this software including: The San Diego Museum of Natural History’s project to transcribe the field notes of herpetologist Laurence M. Klaube and Southwestern University’s project to transcribe the Mexican War Diary of Zenas Matthews.
National Archives Transcription: as part of the National Archives Citizen Archivist program, individuals have the opportunity to transcribe a variety of records. As described on the transcription home page: “letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files”.

Acquisition:

Archive Team: The ArchiveTeam describes itself as “a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage.” Here is an example of the information gathered, shared and collaborated on by the ArchiveTeam focused on saving content from Friendster. The rescued data is (whenever possible) uploaded in the Internet Archive and can be found here:

Springing into action, Archive Team began mirroring Friendster accounts, downloading all relevant data and archiving it, focusing on the first 2-3 years of Friendster’s existence (for historical purposes and study) as well as samples scattered throughout the site’s history – in all, roughly 20 million of the 112 million accounts of Friendster were mirrored before the site rebooted.

Tagging:

National Archives Tagging: another part of the Citizen Archivist project encourages tagging of a variety of records, including images of the Titanic, architectural drawings of lighthouses and the Petition Against the Annexation of Hawaii from 1898.
Flickr Commons: throughout the Flickr Commons, archives and other cultural heritage institutions encourage tagging of images

These are just a taste of the crowdsourced efforts currently being experimented with across the internet. Did I miss your favorite? Please add it below!

Blog Action Day 2009: IEDRO and Climate Change

October 16, 2009

In honor of Blog Action Day 2009‘s theme of Climate Change, I am revisiting the subject of a post I wrote back in the summer of 2007: International Environmental Data Rescue Organization (IEDRO). This non-profit’s goal is to rescue and digitize at risk weather and climate data from around the world. In the past two years, IEDRO has been hard at work. Their website has gotten a great face-lift, but even more exciting is to see is how much progress they have made!

Weather balloon observations received from Lilongwe, Malawi (Africa) from 1968-1991: all the red on these charts represents data rescued by IEDRO — an increase from only 30% of the data available to over 90%.
Data rescue statistics from around the world

They do this work for many reasons – to improve understanding of weather patterns to prevent starvation and the spread of disease, to ensure that structures are built to properly withstand likely extremes of weather in the future and to help understand climate change. Since the theme for the day is climate change, I thought I would include a few excerpts from their detailed page on climate change:

“IEDRO’s mandate is to gather as much historic environmental data as possible and provide for its digitization so that researchers, educators and operational professionals can use those data to study climate change and global warming. We believe, as do most scientists, that the greater the amount of data available for study, the greater the accuracy of the final result.

If we do not fully understand the causes of climate change through a lack of detailed historic data evaluation, there is no opportunity for us to understand how humankind can either assist our environment to return to “normal” or at least mitigate its effects. Data is needed from every part of the globe to determine the extent of climate change on regional and local levels as well as globally. Without these data, we continue to guess at its causes in the dark and hope that adverse climate change will simply not happen.”

So, what does this data rescue look like? Take a quick tour through their process – from organizing papers, photographing each page, the transcription of all data and finally upload of this data to NOAA’s central database. These data rescue efforts span the globe and take the dedicated effort of many volunteers along the way. If you would like to volunteer to help, take a look at the IEDRO listings on VolunteerMatch.

Sunshine Week 2009: Archives, Records and Other Online Government Information

March 18, 2009

Sunshine Week 2009 is a national initiative spearheaded by journalists to “open a dialogue about the importance of open government and freedom of information”. The Electronic Frontier Foundation (EFF) chose to mark Sunshine Week this year by announcing the release their new tool for searching EFF’s FOIA documents. Learn more about EFF’s efforts to make open government a reality in this EFF call to action.

The Sunshine Week blog announced the release of a 2009 Survey Of State Government Information Online. The survey results explains:

Using a standardized worksheet surveyors rated each section on its usability, looking at factors such as whether the information was clearly linked, if full reports or only summaries were available, whether viewing and/or downloading was free, and whether the data were current. The categories for the survey were selected for generally serving the overall public good — the kind of information people need for their own health and well-being and that of the community.

See the worksheet for details on the categories selected for inclusion in the survey and the results for lots of interesting tidbits about exactly which states provide access (or not) to various public information online. A few very randomly selected highlights:

Maryland: Nursing home information, mhcc.maryland.gov/consumerinfo/nhguide, got high marks for facilitating online search and for allowing users to “compare data in a variety of ways.”
Iowa: The state auditor’s office reportedly offers online more than 5,000 full reports of all its audits dating back to 2001. The audits are easily accessible from tabs on the main Web page, www.auditor.iowa.gov.
Colorado: Bridge inspection reports in Colorado are considered public, but they are not published online. Anyone who wants to see the reports is advised to file an FOI request.

All of this made me recall my blog post about the parallel goals of journalists and archivists when considering digital public records and databases. I wanted to celebrate Sunshine Week by looking for other online sources of government information. My first stop was the website of the Council of State Archivists (CoSA). They had a couple of great resources including:

A 2007 status report on the state of State Records (and it looks like a new report should be out soon – their 2008 survey just closed at the end of January 2009)
Directory of State Archives and Records Programs
Details on their Local Government Project

A bit further afield we find GovernmentDocs.org advertised as a “community government document reviewer system”. On their about page we read:

With the GovernmentDocs.org system, citizen reviewers can engage in the government accountability process like never before. Registered users can review and comment on documents, adding their insights and expertise to the work of the national nonprofit organizations which are partnering on this project. This new information then becomes instantly searchable. The text of each document is searchable, as well, thanks to a powerful Optical Character Recognition (OCR) functionality.

GovernmentDocs.org adds a powerful layer to government transparency and accountability by indexing documents in a user-friendly manner that is remarkably easy to share. Every page of every document has its own unique url, allowing you and other users to link to that page on blogs, send emails about the documents to friends, and expose the information to a wider audience.

Here is an example GovernmentDocs page taken from a request submitted by CREW (Citizens for Responsibility and Ethics in Washington) regarding the Endangered Species Act. Each GovernmentDocs page has a unique URL, full text transcription of the page and supports comments and reviews. The possibility of building up a community around these records is very real. I am curious to see how many citizen reviewers and comments are associated with these documents a year from now.

Please help celebrate Sunshine Week by exploring all these amazing resources!

Library of Congress Inauguration 2009 Audio and Video Project

January 14, 2009

Amazing how much can change in 100 years. In March of 1909, the stereograph above shows African Americans driving the carriage that carried President and Mrs. Taft from the Capitol to lead the inauguration parade to the White House. On January 20th of 2009, Barack Obama will be the guest of honor. The American Folklife Center‘s Inauguration 2009 Sermons and Orations Project aims to collect recordings, transcriptions and ephemera of speeches addressing the significance of the inauguration of Barack Obama as the first African American president.

It is expected that such sermons and orations will be delivered at churches, synagogues, mosques and other places of worship, as well as before humanist congregations and other secular gatherings. The American Folklife Center is seeking as wide a representation of orations as possible.

The Inauguration 2009 project is modeled after prior Library of Congress collection projects. Two great examples of these earlier projects are:

“Man-on-the-Street” Interviews Following the Attack on Pearl Harbor – features audio recordings of the reactions of than 200 people to the Japanese attack on Pearl Harbor.
September 11, 2001, Documentary Project – includes 200 audio recordings collected between September 13, 2001 and February 13, 2002 in cities across the United States

If you want to organize a local recording, here are the basics:

Recording must be made between Friday, January 16th and Sunday, January 25th, 2009 and postmarked by February 27, 2009.
The project website provides the required Participant Release Form for speakers, photographers and those making the recordings.
The project is accepting audio recordings, video recordings, and written texts of sermons (see their detailed specifications page for information about accepted formats). Also accepted will be accompanying ephemera such as photographs and printed programs.
If you are sending materials to the Library of Congress, they encourage you to use FedEx, UPS, or DHL because of the danger of damage due to security screening done to USPS packages.

If you want to get a taste of other recordings held by the Library of Congress, you can spend some time browsing the fantastic list of Collections in the Archive of Folk Culture Containing Sermons and Orations provided on the project site.

So spread the word. Honor the Library of Congress’s goals by helping this collection include the perspectives of as many communities as possible. Your local religious or secular leader could have their point of view preserved as part of a snapshot of our country’s response to the Inauguration of 2009. While they hope for audio and video recordings, they are also accepting text transcriptions – so this doesn’t have to be a high tech endeavor. That said, perhaps this is the inspiration you have been waiting for to learn how to make an audio or video recording!

THATCamp 2008: Crowdsourced Transcription and Collaborative Annotation

June 5, 2008 13 Comments

The THATCamp session officially titled ‘Crowdsourcing’ on the schedule was actually aimed at discussing the intersection of crowdsourced transcription and collaborative annotation. The group was small – just six of us and Ben Brumfield got us going by giving us an overview of transcription software and projects:

The FamilySearch Indexing Project is an LDS church project put out by the FamilySearch Labs. Their goals: “Volunteers extract family history information from digital images of historical documents to create searchable indexes that assist everyone in finding their ancestors.”
The Manuscript Transcription Assistant is based at Worcester Polytechnic Institute (WPI) and is described as “a tool to assist transcribers in creating transcriptions, and incorporate meta-data about each image and transcription that can then be used to search through an electronic library of transcriptions”. I found mention in the FAQ of the desire to create a community so that “transcribers will be able to collaborate their work by rating the quality of other user’s transcriptions. By ranking the transcriptions, specific versions of transcriptions will emerge as an authority for that manuscript. ” Unfortunately, a lot of the links on that site are broken and my attempt to register gave me an error. It is not clear to me that this project is actually still active.
Soldier Studies is a website dedicated to posting transcriptions of civil war letters and diaries. This is not a tool for transcribing, but is clearly a repository targeting specifically transcriptions (see their Mission Statement for more information).
Oh No Robot is a comics transcription and search tool. It provides a page to find comics needing transcription and a great page to explain how transcription works on their site.

After examining what was out there, Ben concluded that what he wanted didn’t exist – so he started to build it himself. He gave us a demo of his “very beta” software. His goal is to build a web based tool to support collaborative manuscript transcription and annotation by individuals without a strong technical background. In its current (and private beta) state the software supports transcription, an innovative approach to linking individual words or phrases to collection defined subjects and some basic community tools to let his virtual team discuss transcription issues. Ben is working hard on the software – if you are interested in his project, definitely get in touch with him.

Travis Brown showed us his creation: eComma. eComma aims to “enable groups of students, scholars, or general readers to build collaborative commentaries on a text and to search, display, and share those commentaries online”. He showed us how users could tag or add comments on individual words or phrases of a loaded text. Take a look at the eComma page for Sonnet 18 by William Shakespeare. The words highlighted in blue are those which are tagged or have comments associated with them. If you highlight ‘the eye of heaven’ in line 5 you will see that it is tagged as a metaphor. Travis reported that he will have 2 other programmers working on eComma with him this summer and has his eye on improving some interface issues and adding a few more features.

We also talked about ways to display transcription. Elena Razlogova guided us over to the DoHistory website. There she showed us the Magic Lens interface. This interface displays the transcription of a handwritten diary page via a lens style overlay that you can move with your mouse. This reminded me of the Gilder Lehrman Battle Lines: Letters from America’s Wars interface that I found when doing research for my Communicating Context in Online Collections Poster. If you haven’t seen it before – go examine the page showing the transcription of (turn down your speaker if a reader’s voice will disturb those around you) Nathanael Green’s letter to Catherine Greene dated July 17, 1778.

While on the DoHistory site I also found the Try Your Hand At Transcribing page. This page shows the challenge of transcribing handwritten documents by giving you the chance to try it yourself and then lets you check your transcription with the click of a button.

We talked a bit about the technology behind eComma (forgive me Travis for not having enough details in my notes to explain your current architecture here) and the challenges inherent in wanting to annotate overlapping sets of words. Though he isn’t using it in the current implementation of eComma, Travis mentioned the Layered Markup Annotation Language (LMNL) which the tutorial page explains as:

…LMNL documents contain character data which is marked up using named and occasionally overlapping ranges. Ranges can have annotations, which can themselves be annotated and can have structured content. To support authoring, especially collaborative authoring, markup is namespaced and divided into layers, which might reflect different views on the text.

I can definitely see how LMNL might be an interesting framework for building transcription and annotation software.

Krissy O’Hare brought up the challenges of transcribing audio and video that she has faced working on oral history projects at Concordia University. This led to Travis (I think?) mentioning the Texas German Dialect Project (TGDP) and the CMU Sphinx Group Speech Recognition Engine. TGDP has an online archive of recorded interviews along with their transcriptions and translations. CMU Sphinx’s introduction explains that their software tools are targeted at expert users wanting to build speech-using applications.

This was a great session. The small group gave everyone a chance to contribute and take over the keyboard in order to show off their favorite sites. It was immediately after the Text Mining session, so our minds were already full of all the great things one could do with text once it is transcribed.

I am excited to watch the evolution of group transcription and annotation software. If you know of other transcription or annotation tools or projects – please post them to the comments.

Image credit: Free pencils by zone41 via flickr

As is the case with all my session summaries from THATCamp 2008, please accept my apologies in advance for any cases in which I misquote, overly simplify or miss points altogether in the post above. These sessions move fast and my main goal is to capture the core of the ideas presented and exchanged. Feel free to contact me about corrections to my summary either via comments on this post or via my contact form.

reCAPTCHA: crowdsourcing transcription comes to life

May 28, 2007 5 Comments

With a tag-line like ‘Stop Spam, Read Books’ – how can you not love reCAPTCHA? You might have already read about it on Boing Boing , NetworkWorld.com or digitizationblog – but I just couldn’t let it go by without talking about it.

Haven’t heard about reCAPTCHA yet? Ok.. have you ever filled out an online form that made you look at an image and type the letters or numbers that you see? These ‘verify you are a human’ sorts of challenges are used everywhere from on-line concert ticket purchase sites who don’t want scalpers to get too many of the tickets to blogs that are trying to prevent spam. What reCAPTCHA has done is harness this user effort to assist in the transcription of hard to OCR text from digitized books in the Internet Archive. Their website has a great explanation about what they are doing – and they include this great graphic below to show why human intervention is needed.

reCAPTCHA shows two words for each challenge – one that it knows the transcription of and a second that needs human verification. Slowly but surely all the words OCR doesn’t understand get transcribed and made available for indexing and search.

I have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like USGenWeb. It is so exciting to me that a version of this is out there – and I LOVE their take on it. Rather than find people who want to do transcription, they have taken an action lots of folks are already used to performing and given it more purpose. The statistics behind this are powerful. Apparently 60 million of these challenges are entered every DAY.

Want to try it? Leave a comment on this post (or any post in my blog) and you will get to see and use reCAPTCHA. I can also testify that the installation of this on a WordPress blog is well documented, fast and easy.

Archival Transcriptions: for the public, by the public

October 12, 2006 11 Comments

There is a recent thread on the archives listserv that talks about transcriptions – specifically for small projects or those that have little financial support. There is even a case in which there is no easy OCR answer due to the state of the digitized microfilm records.
One of the suggestions was to use some combination of human effort to read the documents – either into a program that would transcribe them, or to another human who would do the typing. It made me wonder what it would look like to make a place online where people who wanted to could volunteer their transcription time. In the case where the records are already digitized and viewable, this seems like an interesting approach.

Something like this already exists for the genealogy world over at the USGenWeb Archives Project. They have a long list of different projects listed here. Though the interface is a bit confusing, the spirit of the effort is clear – many hands make light work. Precious genealogical resources can be digitized, transcribed and added to this archive to support the research of many by anyone – anywhere in the world.

Of course in the case of transcribing archival records there are challenges to be overcome. How do you validate what is transcribed? How do you provide guidance and training for people working from anywhere in the world? If I have figured out that a particular shape is a capital S in a specific set of documents, that could help me (or an OCR program) as I progress through the documents, but if I only see one page from a series – I will have to puzzle through that one page without the support of my past experience. Perhaps that would encourage people to keep helping with a specific set of records? Maybe you give people a few sample pages with validated translations to practice with? And many records won’t be that hard to read – easy for a human’s eye but still a challenge for an OCR program.

The optimist in me hopes that it could be a tempting task for those who want to volunteer but don’t have time to come in during the normal working day. Transcribing digitized records can be done in the middle of the night in your pajamas from anywhere in the world. Talk about increasing your pool of possible volunteers! I would think that it could even be an interesting project for high school and college students – a chance to work with primary sources. With careful design, I can even imagine providing an option to select from a preordained set of subjects or tags (or in Folksonomy friendly environment, the option to add any tags that the transcriber deems appropriate) – though that may be another topic worthy of its own exploration independent of transcription.

The initial investment for a project like this would come from building a framework to support a distributed group of volunteers. You would need an easy way to serve up a record or group of records to a volunteer and prevent duplication of effort – but this is an old problem with good solutions from the configuration management world of software development and other collaboration work environments.

It makes a nice picture in my mind – a slow, but steady, team effort to transcribe collections like the Colorado River Bed Case (2,125 pages of digitized microfilm at the University of Utah’s J. Willard Marriott Library) – mostly done from people’s homes on their personal computers in the middle of the night. A central website for managing digitized archival transcriptions could give the research community the ability to vote on the next collection that warrants attention. Admit it – you would type a page or two yourself, wouldn’t you?

Category: transcription