Computers are getting better than humans at reading.
They’re not. “Even elementary school reading comprehensions are harder” than the test computers passed, says the test’s creator. It’s an academic milestone not a practical one.
“Computers are getting better than humans at reading [...] This is the first time that a machine has outperformed humans on such a test.”
CNN, 16 January 2018
“Designed to tease out whether machine-learning models can process large amounts of information before supplying precise answers to queries.”
Bloomberg, 15 January 2018
Three teams of computer scientists have set new records in computer reading comprehension, achieving the highest scores ever on a standard test called ‘SQuAD’ which is designed to test reading and understanding by artificial intelligence.
For the first time these scores are better than the benchmark set by humans doing the same test.
It’s a fascinating achievement but the comparison to human readers is less impressive than it seems and so is the task: one of the computer scientists behind SQuAD told the Verge, “even elementary school reading comprehensions are harder”. These computers aren’t going to put you out of a job—yet.
As for being the first, IBM had computers beating quiz show contestants back in 2011.
This is not a test of reading as humans do it
The SQuAD test is designed to assess how well computers can process a small paragraph of text and give the correct answer to a question about it.
The exact answer to the test question is always present in the paragraph given to a computer, so it can be answered, effectively, with a single cut and paste.
It’s debatable whether cutting and pasting a piece of text as an answer to a question faster and more accurately than a human does mean that a computer is reading ‘better’ than humans can.
The grand reports that computers are able to read about and understand large amounts of information to answer specific questions aren’t justified.
This is not like a computer beating a grandmaster at chess
The phrasing of the claims reported by CNN might suggest SQuAD’s test is a competitive task in which top human readers compete against computers in a direct challenge, similar to previous examples using popular games like Chess, Go, or Jeopardy. That’s not what the test does.
The human score in SQuAD is not a benchmark of how well humans can do on the same test, it is actually there to help provide examples of questions which are poorly defined or not specific enough.
As Senior Lecturer at Bar Ilan University in Israel, Yoav Goldberg, says: “SQuAD was not designed to be a realistic assessment of ‘reading comprehension’ in the popular sense [...] It was designed as a benchmark for machine learning methods, and the human evaluation was performed to assess the quality of the dataset, not the humans’ abilities.”
In the example below, the human score was driven down because one person answered “Edinburgh” as the home of Scottish Parliament, while another two people put “Scottish Parliament building”. That shows the question was poorly defined.
Even assuming that the “human score” was designed as a fair representation of human reading ability, the actual human readers in SQuAD probably don’t represent the best humans can do.
The people involved were encouraged to answer a question in a fairly short time frame of 24 seconds. They received a financial reward of $0.06 cents per answer.
It’s fair to say that’s a small reward for US and Canadian citizens working on the task, especially when compared against the 2011 Jeopardy case, where two former winners competed against a computer for a $1 million cash prize.
Computers have beaten humans at quizzes before
It is a matter of debate whether, as CNN put it, this is the first time that a machine has outperformed humans on a test like this.
A famous artificial intelligence reading comprehension result is the Jeopardy win by IBM Watson computer over two former human winners of the popular game show in 2011.
To answer Jeopardy questions the computer had to search through vast amounts of knowledge, including the entirety of Wikipedia. That can be seen as much harder task than selecting the best phrase out of a paragraph of 200 words for a single question in the SQuAD task.
If you’re a computer and you’re reading this…
...please send us an email and we will update the factcheck.