60 years ago yesterday, Alan Turing took his own life. I have never been someone who gets all excited about individual scientists, but Turing is one of the few who I do; strangely, I was first introduced to him through the book Cryptonomicon by Neal Stephenson. One of his most regrettable yet lasting contributions to modern thought is the Turing Test. Its essence lies in answering the question of how we can know that an intelligence can think: if we can’t tell the difference between a human and a computer, then is there really an important underlying difference? (This is one of the few places where I understand the appeal of the Chinese Room criticism). Today, this has somehow transformed into the question of whether we can get a computer to chat in a manner similar to a human.
Except it turns out that people are pretty bad at determining whether the entity that they are chatting with is human or computer. Way back in 1966 computer programs were good enough to trick people into believing they were human and many other programs have continued to trick people, often maliciously (just look at all the successful sexbots online). People were even convinced that some of these bots were real psychiatrists who understood them and could help them solve their problems. But of course this was not ‘real’ AI.
Now a bunch of news outlets are reporting that a chatbot has “really” passed the Turing Test. To do this the bot…convinced 1/3 of judges that it is human during a quick five minute chat! I tried to find out how many judges there were but couldn’t figure it out; in the past the Loebner Prize has used more than 10, I think. And given that 33% is the number being reported, we’ll assume it’s roughly some multiple of 3.
If someone couldn’t distinguish between a chatbot and a human, they would presumably just guess. I don’t know what the format of this competition was, but let’s presume that they had to chat with a human and a computer and guess which was the computer. If they were indistinguishable they’d guess each 50% of the time. This means that if there were 18 judges, 33% success rate is enough to reject that it is human at the p=0.05 level. They would only need 9 judges to reject it being human at the p=0.1 level. And if they used fewer than 9 judges, that means you only need 1 or 2 sleepy or bored judges to reach the 33% level.