Mathematicians devised novel problems to challenge advanced AIs' reasoning skills — and they failed almost every test

When you buy through links on our internet site , we may earn an affiliate commission . Here ’s how it ferment .

Mathematicians have stumped the most advanced generativeartificial intelligence(AI ) models with a series of mind - turn novel mathematics problems .

These problems typically take doctorate - floor mathematicians hour to Day to resolve , concord to the research instituteEpoch AI . But in the new test , the most ripe AI models on the market got right answers on less than 2 % of these problems .

Equations shown in a digital format.

The researchers tested six state-of-the-art AI models against the new benchmark and the best score registered by a single system was 2%.

In the preceding decennium , a act of AI tests have been developed to find whether the answer these models render are actually right . In many case , AI models now breeze through these benchmarks .

For example , in the normally used Measuring Massive Multitask Language Understanding ( MMLU ) benchmark test , today ’s AI models answer 98 % of math job correctly .

Most of these benchmarks are gear toward testing AI ’s power to do high - schooltime and college - level math , Elliot Glazer , a mathematician at Epoch AI , and colleagues wrote in a new report post on the preprint databasearXiv . ( The paper has not yet been peer - reviewed or published in a scientific daybook . )

Numbers and mathematical symbols in the shape of a human head.

associate : Scientists design raw ' AGI bench mark ' that indicates whether any future AI example could have ' catastrophic harm '

The young solidification of benchmark , call FrontierMath , aims for a higher level of logical thinking . Epoch AI developed the questions with the help of math prof , including some winners of the Fields Medal , perhaps the most prestigious prize in math . The problems cover a wide chain of subfields , from act theory to algebraic geometry , and are usable onEpoch AI ’s internet site .

" These are passing intriguing , " 2006 Fields Medal winnerTerence Tao , a mathematician at UCLA , write in a limited review of the problems for Epoch AI . " I think that in the near term fundamentally the only way to solve them , curtly of having a real domain expert in the arena , is by a compounding of a semi - expert like a graduate student in a related to field , maybe paired with some compounding of a modern AI and slew of other algebra bundle . "

Robot and young woman face to face.

The problems were also unique — a step fill to ensure that none of the problems were already in the AI models ' training data . When complex reasoning problems are included in the education information , the AI may appear to solve the problem , but in world , it already has a " cheat sheet , " since it has been school on the resolution .

The researcher try six province - of - the - art AI models : Google ’s Gemini 1.5 Pro ( 002 ) , Anthropic ’s Claude 3.5 Sonnet , OpenAI ’s o1 - prevue , o1 - mini , and GPT4o and xAI ’s Grok-2 genus Beta . Gemini and Claude managed to resolve 2 % , which was just slightly safe than the showings from o1 - preview , o1 - mini and GPT-4o ’s 1 % . Grok-2 Beta fail to get any job right .

However , these ranking are misguide because the scummy success charge per unit imply that a single correct solution can have an outsize impact on each theoretical account ’s overall account , the researchers monish .

Artificial intelligence brain in network node.

— Claude 3 Opus has stupefy AI researchers with its intellect and ' ego - awareness ' — does this think it can think for itself ?

— New Chinese AI model ' better than industriousness leader ' in key metric unit

— ' Student of Games ' is the first AI that can dominate dissimilar types of game , like chess and poker

an illustration of a line of robots working on computers

" [ E]ven when a good example obtained the correct result , this does not think that its abstract thought was correct , " the report author wrote . " For instance , on one of these problems running a few simple simulation was sufficient to make precise guesses without any deeper numerical understanding . However , models ' miserable overall truth shows that such guessing scheme do not work on the overwhelming bulk of FrontierMath problems . "

The finding show that flop now , AI models do n’t possess research - level math reasoning , Epoch AI ’s collaborator concluded . However , as AI model advance , these benchmark test will provide a path to observe out if their logical thinking abilities are deepening .

" By regularly measure state - of - the - art exemplar and collaborate with the AI inquiry community , " the squad compose in the assertion , " we take aim to intensify our apprehension of AI ’s capacity and limitations . "

Disintegration of digital brain on blue background (3D Illustration).