Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Nearly all US jurisdictions require a professional license exam (the Bar Exam) to practice law
To sit for the exam, applicants must complete 7 years of post-secondary education, including 3 years at an accredited law school
Despite significant investment of time and capital, 1 in 5 test-takers still fail on their first try
OpenAI’s GPT-3.5 model was tested on the MBE section of the exam
Hyperparameter optimization and prompt engineering positively impacted GPT-3.5’s zero-shot performance
GPT-3.5 achieved a headline correct rate of 50.3% on a complete NCBE MBE practice exam, significantly higher than the 25% baseline guessing rate

The legal system is becoming increasingly complex.
Technology is needed to assist with the quantity, quality, and accessibility of legal services.
Artificial intelligence and process engineering have been used to help non-professional and professional users of legal systems.
Research and development has gone into use cases like search and legal aid, automated argumentation, pre-and post-execution contract processes, due diligence and e-discovery, and judicial analysis.
Legal language is complex and requires a lot of education and training to understand and generate.
Legal language has a different grammar than normal language and is full of semantic nuance and history.
Neural network research and transformer architectures have revolutionized machine learning research.
OpenAI’s GPT-3 is an autoregressive language model with 175 billion parameters.
OpenAI’s APIs offer text completion, code completion, image generation, and embedding generation endpoints.
OpenAI’s ChatGPT is a public-facing chatbot version of GPT-3.5.
GPT-3.5 is trained on a combination of curated CommonCrawl data and high-quality reference data.
GPT-3.5 was tested on the multistate multiple choice section of the Bar Exam.

Professional licensure exams are common across many professional fields, including law, medicine, dentistry, pharmacy, accounting, and engineering.
The National Conference of Bar Examiners (NCBE) is responsible for designing most of the bar examination materials used across the United States.
To pass the bar exam, test-takers need to have a large amount of theoretical knowledge and the ability to understand and answer exam-specific questions.
Most test-takers are required to complete at least seven years of post-secondary education and post-graduation Bar preparation training.
Approximately one out of every five test-takers fail to pass the exam on their initial attempt.
The Uniform Bar Examination (UBE) features three components: a multiple choice test, an essay test, and a scenario-based performance test.
The multiple choice component, referred to as the Multistate Bar Examination or MBE, is typically worth 50% of an overall bar exam score.
The MBE tests legal knowledge and reading comprehension skills.
The NCBE provides a public sample of an MBE question on their website.

Implemented an experiment using zero-shot prompts for the text-davinci-003 text completion API
Experiment included design and iteration of prompts, API hyperparameters, and an attempt at fine-tuning the mode
Prompt engineering is critical to replication of studies involving large language models
Experimented with 7 different prompt types
Last prompt type improved model correctness substantially
Prompts related to traditional textual entailment tasks
Problem formulated relative to another statement or body of knowledge
In some cases, multiple choices may be correct from an entailment perspective

Machine learning and computational research results are sensitive to model parameters and hyperparameters.
Evaluated how hyperparameters like model “temperature” impacted the performance of the model.
Tested values for temperature, top p, best of, and max tokens.

LLMs like GPT-3.5 have good zero-shot or few-shot performance
OpenAI API allows for some control of the training process
Attempted to fine tune text-davinci-003 with simulated MBE bar exam questions
Altered training prompts, responses, batch size, learning rate, and prompt weighting
Fine-tuned model significantly underperformed text-davinci-003

GPT-3.5 outperformed the baseline rate of random guessing
GPT-3.5 achieved a passing rate on two categories of the Bar and achieved parity with human test-takers on one
GPT-3.5’s rank-ordering of possible choices is strongly correlated with correctness
GPT-3.5 exceeded expectations for performance on this task
GPT-3.5 is at parity with humans for Evidence questions, but has a gap of up to 36% for other categories