When you buy through links on our internet site , we may earn an affiliate commission . Here ’s how it ferment .
Mathematicians have stumped the most advanced generativeartificial intelligence(AI ) models with a series of mind - turn novel mathematics problems .
These problems typically take doctorate - floor mathematicians hour to Day to resolve , concord to the research instituteEpoch AI . But in the new test , the most ripe AI models on the market got right answers on less than 2 % of these problems .
The researchers tested six state-of-the-art AI models against the new benchmark and the best score registered by a single system was 2%.
In the preceding decennium , a act of AI tests have been developed to find whether the answer these models render are actually right . In many case , AI models now breeze through these benchmarks .
For example , in the normally used Measuring Massive Multitask Language Understanding ( MMLU ) benchmark test , today ’s AI models answer 98 % of math job correctly .
Most of these benchmarks are gear toward testing AI ’s power to do high - schooltime and college - level math , Elliot Glazer , a mathematician at Epoch AI , and colleagues wrote in a new report post on the preprint databasearXiv . ( The paper has not yet been peer - reviewed or published in a scientific daybook . )
associate : Scientists design raw ' AGI bench mark ' that indicates whether any future AI example could have ' catastrophic harm '
The young solidification of benchmark , call FrontierMath , aims for a higher level of logical thinking . Epoch AI developed the questions with the help of math prof , including some winners of the Fields Medal , perhaps the most prestigious prize in math . The problems cover a wide chain of subfields , from act theory to algebraic geometry , and are usable onEpoch AI ’s internet site .
" These are passing intriguing , " 2006 Fields Medal winnerTerence Tao , a mathematician at UCLA , write in a limited review of the problems for Epoch AI . " I think that in the near term fundamentally the only way to solve them , curtly of having a real domain expert in the arena , is by a compounding of a semi - expert like a graduate student in a related to field , maybe paired with some compounding of a modern AI and slew of other algebra bundle . "
The problems were also unique — a step fill to ensure that none of the problems were already in the AI models ' training data . When complex reasoning problems are included in the education information , the AI may appear to solve the problem , but in world , it already has a " cheat sheet , " since it has been school on the resolution .
The researcher try six province - of - the - art AI models : Google ’s Gemini 1.5 Pro ( 002 ) , Anthropic ’s Claude 3.5 Sonnet , OpenAI ’s o1 - prevue , o1 - mini , and GPT4o and xAI ’s Grok-2 genus Beta . Gemini and Claude managed to resolve 2 % , which was just slightly safe than the showings from o1 - preview , o1 - mini and GPT-4o ’s 1 % . Grok-2 Beta fail to get any job right .
However , these ranking are misguide because the scummy success charge per unit imply that a single correct solution can have an outsize impact on each theoretical account ’s overall account , the researchers monish .
— Claude 3 Opus has stupefy AI researchers with its intellect and ' ego - awareness ' — does this think it can think for itself ?
— New Chinese AI model ' better than industriousness leader ' in key metric unit
— ' Student of Games ' is the first AI that can dominate dissimilar types of game , like chess and poker
" [ E]ven when a good example obtained the correct result , this does not think that its abstract thought was correct , " the report author wrote . " For instance , on one of these problems running a few simple simulation was sufficient to make precise guesses without any deeper numerical understanding . However , models ' miserable overall truth shows that such guessing scheme do not work on the overwhelming bulk of FrontierMath problems . "
The finding show that flop now , AI models do n’t possess research - level math reasoning , Epoch AI ’s collaborator concluded . However , as AI model advance , these benchmark test will provide a path to observe out if their logical thinking abilities are deepening .
" By regularly measure state - of - the - art exemplar and collaborate with the AI inquiry community , " the squad compose in the assertion , " we take aim to intensify our apprehension of AI ’s capacity and limitations . "