reCAPTCHA: crowdsourcing transcription comes to life

May 28, 2007 5 Comments

With a tag-line like ‘Stop Spam, Read Books’ – how can you not love reCAPTCHA? You might have already read about it on Boing Boing , NetworkWorld.com or digitizationblog – but I just couldn’t let it go by without talking about it.

Haven’t heard about reCAPTCHA yet? Ok.. have you ever filled out an online form that made you look at an image and type the letters or numbers that you see? These ‘verify you are a human’ sorts of challenges are used everywhere from on-line concert ticket purchase sites who don’t want scalpers to get too many of the tickets to blogs that are trying to prevent spam. What reCAPTCHA has done is harness this user effort to assist in the transcription of hard to OCR text from digitized books in the Internet Archive. Their website has a great explanation about what they are doing – and they include this great graphic below to show why human intervention is needed.

reCAPTCHA shows two words for each challenge – one that it knows the transcription of and a second that needs human verification. Slowly but surely all the words OCR doesn’t understand get transcribed and made available for indexing and search.

I have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like USGenWeb. It is so exciting to me that a version of this is out there – and I LOVE their take on it. Rather than find people who want to do transcription, they have taken an action lots of folks are already used to performing and given it more purpose. The statistics behind this are powerful. Apparently 60 million of these challenges are entered every DAY.

Want to try it? Leave a comment on this post (or any post in my blog) and you will get to see and use reCAPTCHA. I can also testify that the installation of this on a WordPress blog is well documented, fast and easy.

Posted in access, digitization, open source, transcription

5 Comments

Stephen Francoeur
June 6, 2007 at 6:57 am

This is so cool! I had heard about this last week, but I’m excited to be trying it out. One question: since we’re supposed to be helping with the OCR of scanned texts, do we need to worry about being case-sensitive? One of my two words below is capitalized; I’ve typed it in with the capital, but it’s not clear to me if that is required.
Jeanne Post author
June 7, 2007 at 10:57 am

Stephen,

That is a great question – and I do not know the answer. My hunch is that it is case-sensitive, because the text results of OCR definitely includes capital letters – but I will post any definitive answer I can locate after some research.

Jeanne
Pingback:THATCamp » Blog Archive » Crowdsourcing Transcriptions
Lauren G.
August 25, 2008 at 6:07 am

I just found your blog, and I from this first post I can’t wait to catch up on the rest of it. As for crowdsourcing, I’m a little scared of it. Seems like an emotional response, tho. Hackers? Goof-offs? Or maybe I”m a stick in the mud. Maybe it just seems too good to be true, too obvious. I’m posting so I can try it out.
Thanks, and looking forward to more…
Lauren
JrzyGirl
August 26, 2008 at 1:23 pm

This is a wonderful example of the power of the Interwebs – lots of people giving a little each to combine into a greater good. I work for the feds, so I can’t put it on our website, but I can go to the reCAPTCHA website for a few minutes each day and help out. I don’t know where I’ve been for the last year, but thanks for catching me up. – Jill.

Comments are closed.