It generated a lot of media excitement last year when it was claimed that OpenAI’s GPT-4 model outperformed 90% of trainee lawyers on the bar test. Yet a recent study indicates that these assertions were probably exaggerated.
According to fresh study, GPT-4 may not have finished in the top 10% on the bar exam after all.
The large language model (LLM) that drives OpenAI’s chatbot ChatGPT was announced in March of last year, and the disclosure shocked the legal community and the internet at large.
The much-discussed 90th percentile score was actually biassed towards repeat test-takers who had previously failed the exam one or more times, a significantly lower scoring group than those who typically take the test, according to a recent research. On March 30, the scientist released his research results in the journal Artificial Intelligence and Law.
Study author Eric Martínez, a doctoral student at MIT’s Department of Brain and Cognitive Sciences, said at a New York State Bar Association continuing legal education course, “It seems the most accurate comparison would be against first-time test takers or to the extent that you think the percentile should reflect GPT-4’s performance as compared to an actual lawyer; then the most accurate comparison would be to those who pass the exam.”
OpenAI based its assertion on a 2023 study when researchers forced GPT-4 to respond to Uniform Bar Examination (UBE) questions. The AI model performed admirably, placing in the top 10% of exam takers with a score of 298 out of 400.
However, when compared to test-takers who took the same exam again, the artificial intelligence (AI) model only received a score in the top 10%. In Martínez’s comparison of the model’s overall performance, the LLM scored in the 48th percentile for first-time exam takers and in the 69th percentile for all test takers.
According to Martínez’s research, the model’s performance on the test’s essay-writing component varied from mediocre to below average. It scored in the 15th percentile for individuals taking the test for the first time, and in the 48th percentile for all test takers.
In order to explore the findings more thoroughly, Martínez had GPT-4 retake the exam under the guidelines established by the original study’s authors. Three parts usually make up the Uniform Bar Examination (UBE): the written Multistate Essay Examination (MEE), the Multistate Performance Test (MPT), which requires examinees to complete a variety of legal activities, and the multiple-choice Multistate Bar Examination (MBE).
For the multiple-choice MBE, Martínez was able to duplicate the GPT-4 score; nevertheless, he saw “several methodological issues” in the MPT and MEE exam scoring. He pointed out that the National Conference of Bar Examiners, the organisation that conducts the examination, did not apply essay grading rules in the initial study. Rather, the researchers only contrasted responses with “good answers” provided by Marylanders.
This is important to note. According to Martínez, the bar exam’s essay-writing portion is the most similar to the work done by an actual lawyer, and it was also the one where the AI did the poorest.
“Although the leap from GPT-3.5 was undoubtedly impressive and very much worthy of attention, the fact that GPT-4 particularly struggled on essay writing compared to practicing lawyers indicates that large language models, at least on their own, struggle on tasks that more closely resemble what a lawyer does on a daily basis,” Martinez stated.
Since state-by-state minimum passing scores range from 260 to 272, a GPT-4 essay score of disastrous proportions would be required to fail the exam as a whole. However, the analysis found that a mere nine-point decrease in its essay score would push it down to the lowest quarter of MBE takers and below the fifth percentile of licenced solicitors.
According to Martínez, his research shows that although present AI systems are clearly still outstanding, they should be thoroughly assessed before being applied in legal contexts “in an unintentionally harmful or catastrophic manner.”
It seems that the warning is appropriate. Artificial intelligence (AI) systems are being examined for a number of legal applications, despite their propensity to cause hallucinations—the creation of facts or connections that don’t exist. On May 29, for instance, a federal appeals court judge proposed that artificial intelligence (AI) tools could assist in the interpretation of legal documents.
A representative for OpenAI directed Live Science to “Appendix A on page 24” of the GPT-4 technical report in response to an email concerning the study’s conclusions. “The Uniform Bar Exam was run by our collaborators at CaseText and Stanford CodeX,” is the pertinent line there.