GPT-4 didn't ace the bar exam after all, MIT research suggests — it didn't even break the 70th percentile

When you purchase through link on our website , we may pull in an affiliate commission . Here ’s how it sour .

GPT-4 did n’t actually score in the top 10 % on the stripe exam after all , new research advise .

OpenAI , the company behind the large language example ( LLM ) that power its chatbot ChatGPT , made the claimin March last year , and the announcement sentshock waves around the weband the legal profession .

an illustration with two silhouettes of faces facing each other, with gears in their heads

Now , a new sketch has break that the much - hyped 90th - percentile figure was actually skew toward repeat test - takers who had already failed the test one or more times — a much low-down - marking group than those who generally take the test . The researcher bring out his findings March 30 in the journalArtificial Intelligence and Law .

" It seems the most exact comparison would be against first - time test takers or to the extent that you think that the centile should reflect GPT-4 ’s execution as compare to an actual lawyer ; then the most accurate comparability would be to those who pass the exam , " study authorEric Martínez , a doctoral student at MIT ’s Department of Brain and Cognitive Sciences , said at aNew York State Bar Association continuing sound education course of study .

concern : AI can ' false ' empathy but also encourage Nazism , disturbing sketch suggests

Disintegration of digital brain on blue background (3D Illustration).

To make it at its claim , OpenAI used a2023 studyin which researcher made GPT-4 suffice questions from the Uniform Bar Examination ( UBE ) . The AI manikin ’s results were telling : It score 298 out of 400 , which localise it in the top tenth of exam takers .

But it turn out theartificial intelligence(AI ) exemplar only nock in the top 10 % when equate with repetition test takers . When Martínez contrast the model ’s public presentation more generally , the LLM score in the 69th percentile of all mental testing taker and in the 48th percentile of those taking the mental testing for the first time .

Martínez ’s study also suggested that the model ’s results lay out from mediocre to below norm in the essay - compose section of the tryout . It landed in the forty-eighth centile of all exam taker and in the 15th percentile of those taking the test for the first time .

Illustration of opening head with binary code

To investigate the results further , Martínez made GPT-4 retell the examination again according to the argument set by the authors of the original sketch . The UBE typically consists of three components : the multiple - selection Multistate Bar Examination ( MBE ) ; the Multistate Performance Test ( MPT ) that makes examinees perform various lawyering tasks ; and the written Multistate Essay Examination ( MEE ) .

Martínez was able-bodied to replicate the GPT-4 ’s score for the multiple - choice MBE but spotted " several methodological emergence " in the leveling of the MPT and MEE role of the exam . He take down that the original study did not use essay - order rule of thumb set by the National Conference of Bar Examiners , which administers the bar exam . Instead , the researchers plainly compared answers to " good answers " from those in the state of Maryland .

This is significant . Martínez said that the essay - writing section is the closest proxy in the bar examination to the task execute by a practicing lawyer , and it was the section of the exam the AI performed the worst in .

Robot and young woman face to face.

" Although the leap from GPT-3.5 was doubtless impressive and very much worthy of aid , the fact that GPT-4 peculiarly struggled on essay written material compare to rehearse attorney indicates that heavy language models , at least on their own , shinny on chore that more close resemble what a attorney does on a daily basis , " Martínez said .

The minimum passing score varies from land to nation between260 and 272 , so GPT-4 ’s essay score would have to be black for it to give out the overall exam . But a drop in its essay score of just nine point would draw its score to the bottom quarter of MBE takers and beneath the fifth percentile of accredited attorneys , according to the study .

— Scientists create ' toxic AI ' that is rewarded for remember up the worst possible questions we could imagine

Artificial intelligence brain in network node.

— Claude 3 Opus has stupefy AI researchers with its intellect and ' self - knowingness ' — does this mean it can think for itself ?

— investigator gave AI an ' intimate monologue , ' and it massively improved its public presentation

Martínez say his determination revealed that , while doubtlessly still impressive , current AI system should be carefully evaluated before they are used in sound preferences " in an unintentionally harmful or catastrophic way . "

Robotic hand using laptop.

The warning seems to be seasonable . Despite their tendency to produce delusion — fabricating fact or connections that do n’t live — AI systems are being considered for multiple software in the legal creation . For example , on May 29 , a Union ingathering court evaluator suggested that AI programme couldhelp construe the contents of effectual texts .

In response to an email about the study ’s finding , an OpenAI voice refer Live Science to " Appendix A on page 24 " of theGPT-4 expert paper . The relevant line there reads : " The Uniform Bar Exam was run by our collaborators at CaseText and Stanford CodeX. "

lady justice with a circle of neon blue and a dark background