reCAPTCHA: crowdsourcing transcription comes to life

With a tag-line like ‘Stop Spam, Read Books’ – how can you not love reCAPTCHA? You might have already read about it on Boing Boing , NetworkWorld.com or digitizationblog – but I just couldn’t let it go by without talking about it.

Haven’t heard about reCAPTCHA yet? Ok.. have you ever filled out an online form that made you look at an image and type the letters or numbers that you see? These ‘verify you are a human’ sorts of challenges are used everywhere from on-line concert ticket purchase sites who don’t want scalpers to get too many of the tickets to blogs that are trying to prevent spam. What reCAPTCHA has done is harness this user effort to assist in the transcription of hard to OCR text from digitized books in the Internet Archive. Their website has a great explanation about what they are doing – and they include this great graphic below to show why human intervention is needed.

Why we need reCAPTCHA

reCAPTCHA shows two words for each challenge – one that it knows the transcription of and a second that needs human verification. Slowly but surely all the words OCR doesn’t understand get transcribed and made available for indexing and search.

I have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like USGenWeb. It is so exciting to me that a version of this is out there – and I LOVE their take on it. Rather than find people who want to do transcription, they have taken an action lots of folks are already used to performing and given it more purpose. The statistics behind this are powerful. Apparently 60 million of these challenges are entered every DAY.

Want to try it? Leave a comment on this post (or any post in my blog) and you will get to see and use reCAPTCHA. I can also testify that the installation of this on a WordPress blog is well documented, fast and easy.

http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/digg_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/reddit_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/stumbleupon_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/delicious_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/blinklist_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/newsvine_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/technorati_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/google_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/facebook_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/yahoobuzz_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/sphinn_48.png http://www.spellboundblog.com/wp-content/plugins/sociofluid/images/twitter_48.png
Related Posts:

Posted on 28th May 2007
Under: access, digitization, open source, transcription | 11 Comments » | Print This Post Print This Post

11 Responses to “reCAPTCHA: crowdsourcing transcription comes to life”

  1. Stephen Francoeur Says:

    This is so cool! I had heard about this last week, but I’m excited to be trying it out. One question: since we’re supposed to be helping with the OCR of scanned texts, do we need to worry about being case-sensitive? One of my two words below is capitalized; I’ve typed it in with the capital, but it’s not clear to me if that is required.

  2. Jeanne Says:

    Stephen,

    That is a great question – and I do not know the answer. My hunch is that it is case-sensitive, because the text results of OCR definitely includes capital letters – but I will post any definitive answer I can locate after some research.

    Jeanne

  3. Sarah Says:

    Hi Jeanne,

    Thanks for cross-posting about this great use of human power. I’ve been enjoying your blog for only a few weeks now, but I am impressed and excited about many of the projects that you have mentioned (both your own and others’).

  4. Jeanne Says:

    Glad you are enjoying the blog. Positive feedback is a wonderful thing — thank you!

  5. Earthceuticals Says:

    This captcha is a good thing, but honestly some of these tests I have failed a couple of times. Maybe, my vision isn’t what it used to be. I think having words and font faces that we know cause ocr errors *is* the way to go here. At least we can more easily interpret the captchas.

  6. Crowdish Says:

    kinda frustating as well when you are trying to buy tickets at 10:01 am for a concert 😉

  7. THATCamp » Blog Archive » Crowdsourcing Transcriptions Says:

    […] it in my blog post: Archival Transcriptions: for the public, by the public.  While I do love what reCaptcha does at the word level and Footnote.com does with locations, names and dates – I still think there […]

  8. Lauren G. Says:

    I just found your blog, and I from this first post I can’t wait to catch up on the rest of it. As for crowdsourcing, I’m a little scared of it. Seems like an emotional response, tho. Hackers? Goof-offs? Or maybe I”m a stick in the mud. Maybe it just seems too good to be true, too obvious. I’m posting so I can try it out.
    Thanks, and looking forward to more…
    Lauren

  9. JrzyGirl Says:

    This is a wonderful example of the power of the Interwebs – lots of people giving a little each to combine into a greater good. I work for the feds, so I can’t put it on our website, but I can go to the reCAPTCHA website for a few minutes each day and help out. I don’t know where I’ve been for the last year, but thanks for catching me up. – Jill.

  10. Crowded Says:

    I struggle with the captchas sometimes, as well.

  11. Sean Rock Says:

    Yeah, I agree. They seem to be rather tedious and take a lot of time to get right.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>