How Not to invent the next-gen CAPTCHA

blahedo · on Oct 20, 2009

Yeah, the problem with most captchas is that they're visual and they're working in an area where computers are starting to get really good. The main thing that computers are still pretty bad at is understanding completely arbitrary text, so the way to go has to be in that direction (which also has the benefit of being handicap-accessible). Ask questions, in text, that have textual answers. Throw in a random-number generator so that the questions aren't just a memorisable list. And most important, make EVERY USER of the captcha system ("user" here meaning the owner of the blog or the site, not the visitor to the site) able to edit the question list. Any system that is uniform across all its users will become a target for spam-hackers to break. But if you can change the question? Maybe even just _rephrase_ the question? Way, way harder. People are still smarter than computers, we just have to give them the chance to actually do their thing.

I wrote a plugin years ago for MovableType as a proof of concept (which I still use on my own blog): http://www.blahedo.org/botblock/ Even a user that doesn't want to touch any of the code (even though it's pretty easy) can always edit "Add one to this number:" to "What number comes after this number:" or somesuch. To solve the "more humans on this end" problem, it seems like you have to let them modify the very questions themselves.

jerf · on Oct 20, 2009

I spent a couple of years in college working on a learning content management system that had a large bank of randomized questions. Believe me, that's perfectly defeatable. I have existence proofs.

The question "RAND(1,100) + RAND(1,100) = ?" may represent 10,000 distinct questions, but the effort to answer them is only marginally greater than the effort you spent in writing it. Basically every CAPTCHA approach based on "I'll just have a bank of X" (questions, images, etc) will fail, because the spammers can classify faster than you can add to the set.

Note the "conventional" CAPTCHA, which has stood the test of time, doesn't have a "bank" of anything, it generates fresh stuff all the time. ReCAPTCHA has a bank, but it's structured to be way larger than any set of questions you will ever pull, and is also cleverly set up so that they still benefit a bit even if it is "broken".

blahedo · on Oct 20, 2009

Arguing about "RAND(1,100)..." is beside the point. First of all, adding two-digit numbers is _not_ something that's trivially easy for a lot of humans. But more importantly, it's not leveraging any level of textual natural language understanding. It's true that any spammer that decides to target your question will be able to write a rule for their rulebase that will defeat your captcha, but the whole thing that makes the spammer's task economical is that they spend zero or a very tiny amount of person-time per spammed site. This throws a spanner in the works.

I'm also not sure I'd say that the "conventional" variety has "stood the test of time". I still see sites using them, but many of them are now so hard for humans to make out that you have to make multiple tries. And that's if you're a human with good eyesight and full cognition. The audio captchas out there are loud and obnoxious and mostly incomprehensible.

sjs382 · on Oct 20, 2009

Completely off-topic but: really!? Comic sans!?

blahedo · on Oct 20, 2009

Heh. The main style for the site was set up for my blog, which may be the one actual legitimate use of Comic sans. Then I got lazy and didn't bother separately styling the rest. :)

swombat · on Oct 20, 2009

Interesting to note that the article Louis refers to has been pulled. I guess I'd pull that article too after it was pointed out how monumentally stupid the whole thing was...

dpcan · on Oct 20, 2009

The real solution might be to have Captcha technology built right into the browser that can physically detect whether or not a human is interfacing with the form.

docmach · on Oct 20, 2009

How would this work? Spammers aren't going to use a web browser with this feature, so it would only bother real people.

jm4 · on Oct 20, 2009

There are just as many holes. How does a browser detect whether a human is using it? Even if it could how would the site be made aware of this? Does the browser send some special information with the request? If so, can't it be reproduced by some script?

tetsuo13 · on Oct 20, 2009

The browser can be scripted to mimic human behavior -- clicks, keyboard entry, mouse movements, etc. How do you determine that a human is behind the browser?