With a tag-line like ‘Stop Spam, Read Books’ – how can you not love reCAPTCHA? You might have already read about it on Boing Boing , NetworkWorld.com or digitizationblog – but I just couldn’t let it go by without talking about it.
Haven’t heard about reCAPTCHA yet? Ok.. have you ever filled out an online form that made you look at an image and type the letters or numbers that you see? These ‘verify you are a human’ sorts of challenges are used everywhere from on-line concert ticket purchase sites who don’t want scalpers to get too many of the tickets to blogs that are trying to prevent spam. What reCAPTCHA has done is harness this user effort to assist in the transcription of hard to OCR text from digitized books in the Internet Archive. Their website has a great explanation about what they are doing – and they include this great graphic below to show why human intervention is needed.
reCAPTCHA shows two words for each challenge – one that it knows the transcription of and a second that needs human verification. Slowly but surely all the words OCR doesn’t understand get transcribed and made available for indexing and search.
I have posted before about ideas for transcription using the power of many hands and eyes (see Archival Transcriptions: for the public, by the public) – but my ideas were more along the lines of what the genealogists are doing on sites like USGenWeb. It is so exciting to me that a version of this is out there – and I LOVE their take on it. Rather than find people who want to do transcription, they have taken an action lots of folks are already used to performing and given it more purpose. The statistics behind this are powerful. Apparently 60 million of these challenges are entered every DAY.
Want to try it? Leave a comment on this post (or any post in my blog) and you will get to see and use reCAPTCHA. I can also testify that the installation of this on a WordPress blog is well documented, fast and easy.
- Phoenix DVD destined for Mars
- THATCamp 2008: Text Mining and the Persian Carpet Effect
- Interesting Interface for exploring E-mail: TrampolineSystem SONAR Platform
- Archival Transcriptions: for the public, by the public