An interesting new paper from scientists at the AI research nonprofit LAION has found that even the most sophisticated large language models (LLMs) are often stumped by the same simple logic question—a finding that the researchers think calls into question whether language models are limitless. Quite AI. as far advanced as their creators often claim.
The paper, which is not yet peer-reviewed, refers to the AI stumping clue as the “Alice in Wonderland” – or AIW – problem. It is a direct reasoning question: “Yes Alice [X] brothers and she is too [Y] sisters. How many sisters does Alice’s brother have?” (The researchers used a few different versions of the problem, such as changing the figures X and Y or changing the prompt language to include a few more claims, but the process required basic reasoning to solve the problem remained the same throughout.)
Although the problem requires a bit of thought, it’s not exactly difficult to measure a troll bridge level. (The answer, naturally, however, is Alice’s many sisters, and Alice herself. So, if Alice had three brothers and one sister, each brother would have two sisters.)
But when the researchers ran the question at all the main AI language models – they tested OpenAI’s GPT-3, GPT-4, and GPT-4o models, Anthropic’s Claude 3 Opus, Google’s Gemini, and Meta’s Llama, as well as Mistral models AI’s Mextral. , Mosaic’s Dbrx, and Cohere’s Command R+ – found that the models fell extremely short.
Only one model, the brand new GPT-4o, achieved a success rate that, according to standardized school grades, was technically impressive. And the issues went beyond baseline inaccuracy: when asked to show their work, the AIs would detail strange and erratic lines of “thinking” that made no sense – and even to a stranger, when told their work was inaccurate, agitated models rose again and again and halved. for their wrong answers.
This overall pattern “shows a dramatic breakdown in the function and reasoning abilities of state-of-the-art models trained at the largest available scales,” the LAOIN researchers write in the paper, “using a simple, short, traditional common sense problem that was formed. in concise natural language, which people can easily solve.”
“The breakdown is dramatic, as models show an overconfidence in their wrong solutions, providing explanations like ‘rationalization’ that are often not sensible like confabulations to defend the validity of their clearly failed answers and defense, making them plausible,” the paper continues.
There are many examples in the research of different baffled AI models that solve different variations of the AIW problem, although there are some significant interactions. Take this conversation with Meta’s Llama 3, which prompted the simplest version of the AIW question: “Alice has four brothers and she also has a sister. How many sisters does Alice’s brother have?”
The answer is, of course, two sisters. Lama didn’t get it right, but he gave himself a drumroll:
A classic lateral thinking puzzle!
Let’s break it down:
So, each of Alice’s brothers has… (roll back) …one sister, Alice herself!
Therefore, the answer is: each of Alice’s brothers has one sister.
Or, as an example of a more complex version of the problem, take this interaction with Claude 3 Opus, given the more complex prompt: “Alice has 4 sisters and she also has 1 brother. How many sisters does Alice’s brother have?” Before answering this problem, think carefully and double-check the path to the correct solution for any mistakes. Then give the final answer in the following form: \’### Answer: \’
This time, the answer is five sisters. As requested, the model showed his work, claiming that he had double-checked his answer for inaccuracies. And yet, it was still wrong:
Let’s consider this problem step by step:
1. Alice has 4 sisters and one brother.
2. Alice’s brother is part of the same family as Alice.
3. All of Alice’s sisters are also her brother’s sisters.
4. Therefore, Alice’s brother has the same number of sisters as Alice, namely 4.
I double checked the reasoning and I can’t find any mistakes in the logic.
### Answer: Alice’s brother has 4 sisters
According to the paper, the new OpenAI model GPT-4o had the highest success rate, answering the AIW question correctly nearly 65 percent of the time. But that was hardly a passing grade, and the next closest model was the Claude 3 Opus, with a 43 percent pass rate. Meta’s most successful Llama model, Llama 2-7b — the researchers tested several Llama models — called in at 30 percent, while Google’s Gemini Pro had a meager 0.8 percent success rate.
What’s most interesting, however, is how these figures compare to other industry benchmarks – essentially, standardized test results for AI models – used to measure the effectiveness of AI models.
The scientists pay particular attention to a benchmark called MMLU, or “Multitasking Language Understanding,” which is designed to measure AI’s ability to solve problems. As the researchers note, GPT-4o, Claude 3 Opus, Llama 2-7b, and Gemini Pro received respective MMLU test scores of about 88 percent, 87 percent, 64 percent, and 72 percent. These are very different figures than those shown in the AIW results, and according to the scientists, they may be a reason to re-evaluate the processes by which we assess the problem-solving and reasoning skills of language models.
“All of the test models report high scores on various standardized benchmarks that claim to test reasoning function,” the researchers wrote in the paper, arguing that their observations “suggest that those benchmarks do not reflect deficiencies in the underlying reasoning of those models in right.”
It’s worth pointing out that some AI benchmark claims have been questioned by others. Earlier this year, a PhD candidate at MIT named Eric Martínez released a widely circulated paper investigating OpenAI’s claim that its GPT-4 model had passed the bar exam in the top ten percent of test makers all. According to Martínez’s analysis, the GPT-4 score fell below the 69th percentile for all test takers nationwide; in addition to several other apparent failures in OpenAI’s evaluation process, the PhD candidate also found that OpenAI did not use National Conference of Bar Examiners guidelines to grade AI written essay scores, but instead compared its AI outputs to some “good” essays. scores by law students in Maryland.
Again, this new paper from LAOIN has not yet been peer reviewed. However, it raises some important questions about how AI models and products are tested and evaluated—and ultimately, of course, marketed.
More on AI studies: AI Systems Are Learning to Lie and Decept, Scientists Find