Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Nearly all US jurisdictions require a professional license exam (the Bar Exam) to practice law
- To sit for the exam, applicants must complete 7 years of post-secondary education, including 3 years at an accredited law school
- Despite significant investment of time and capital, 1 in 5 test-takers still fail on their first try
- OpenAI’s GPT-3.5 model was tested on the MBE section of the exam
- Hyperparameter optimization and prompt engineering positively impacted GPT-3.5’s zero-shot performance
- GPT-3.5 achieved a headline correct rate of 50.3% on a complete NCBE MBE practice exam, significantly higher than the 25% baseline guessing rate
Paper Content
Introduction
- The legal system is becoming increasingly complex.
- Technology is needed to assist with the quantity, quality, and accessibility of legal services.
- Artificial intelligence and process engineering have been used to help non-professional and professional users of legal systems.
- Research and development has gone into use cases like search and legal aid, automated argumentation, pre-and post-execution contract processes, due diligence and e-discovery, and judicial analysis.
- Legal language is complex and requires a lot of education and training to understand and generate.
- Legal language has a different grammar than normal language and is full of semantic nuance and history.
- Neural network research and transformer architectures have revolutionized machine learning research.
- OpenAI’s GPT-3 is an autoregressive language model with 175 billion parameters.
- OpenAI’s APIs offer text completion, code completion, image generation, and embedding generation endpoints.
- OpenAI’s ChatGPT is a public-facing chatbot version of GPT-3.5.
- GPT-3.5 is trained on a combination of curated CommonCrawl data and high-quality reference data.
- GPT-3.5 was tested on the multistate multiple choice section of the Bar Exam.
Data
- Professional licensure exams are common across many professional fields, including law, medicine, dentistry, pharmacy, accounting, and engineering.
- The National Conference of Bar Examiners (NCBE) is responsible for designing most of the bar examination materials used across the United States.
- To pass the bar exam, test-takers need to have a large amount of theoretical knowledge and the ability to understand and answer exam-specific questions.
- Most test-takers are required to complete at least seven years of post-secondary education and post-graduation Bar preparation training.
- Approximately one out of every five test-takers fail to pass the exam on their initial attempt.
- The Uniform Bar Examination (UBE) features three components: a multiple choice test, an essay test, and a scenario-based performance test.
- The multiple choice component, referred to as the Multistate Bar Examination or MBE, is typically worth 50% of an overall bar exam score.
- The MBE tests legal knowledge and reading comprehension skills.
- The NCBE provides a public sample of an MBE question on their website.
Methods
- Implemented an experiment using zero-shot prompts for the text-davinci-003 text completion API
- Experiment included design and iteration of prompts, API hyperparameters, and an attempt at fine-tuning the mode
- Prompt engineering is critical to replication of studies involving large language models
- Experimented with 7 different prompt types
- Last prompt type improved model correctness substantially
- Prompts related to traditional textual entailment tasks
- Problem formulated relative to another statement or body of knowledge
- In some cases, multiple choices may be correct from an entailment perspective
(hyper)parameters for gpt-3
- Machine learning and computational research results are sensitive to model parameters and hyperparameters.
- Evaluated how hyperparameters like model “temperature” impacted the performance of the model.
- Tested values for temperature, top p, best of, and max tokens.
Fine-tuning
- LLMs like GPT-3.5 have good zero-shot or few-shot performance
- OpenAI API allows for some control of the training process
- Attempted to fine tune text-davinci-003 with simulated MBE bar exam questions
- Altered training prompts, responses, batch size, learning rate, and prompt weighting
- Fine-tuned model significantly underperformed text-davinci-003
Results
- 107 sample exams were executed
- Prompt style #7 performed best
- GPT is not yet passing the overall multiple choice exam
- GPT is exceeding the baseline random chance rate of 25%
- GPT is trailing human test-takers by approximately 17%
- GPT’s second best answer is highly correlated with correctness
Conclusion and future work
- GPT-3.5 outperformed the baseline rate of random guessing
- GPT-3.5 achieved a passing rate on two categories of the Bar and achieved parity with human test-takers on one
- GPT-3.5’s rank-ordering of possible choices is strongly correlated with correctness
- GPT-3.5 exceeded expectations for performance on this task
- GPT-3.5 is at parity with humans for Evidence questions, but has a gap of up to 36% for other categories