Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Nearly all US jurisdictions require a professional license exam (the Bar Exam) to practice law
  • To sit for the exam, applicants must complete 7 years of post-secondary education, including 3 years at an accredited law school
  • Despite significant investment of time and capital, 1 in 5 test-takers still fail on their first try
  • OpenAI’s GPT-3.5 model was tested on the MBE section of the exam
  • Hyperparameter optimization and prompt engineering positively impacted GPT-3.5’s zero-shot performance
  • GPT-3.5 achieved a headline correct rate of 50.3% on a complete NCBE MBE practice exam, significantly higher than the 25% baseline guessing rate

Paper Content

Introduction

  • The legal system is becoming increasingly complex.
  • Technology is needed to assist with the quantity, quality, and accessibility of legal services.
  • Artificial intelligence and process engineering have been used to help non-professional and professional users of legal systems.
  • Research and development has gone into use cases like search and legal aid, automated argumentation, pre-and post-execution contract processes, due diligence and e-discovery, and judicial analysis.
  • Legal language is complex and requires a lot of education and training to understand and generate.
  • Legal language has a different grammar than normal language and is full of semantic nuance and history.
  • Neural network research and transformer architectures have revolutionized machine learning research.
  • OpenAI’s GPT-3 is an autoregressive language model with 175 billion parameters.
  • OpenAI’s APIs offer text completion, code completion, image generation, and embedding generation endpoints.
  • OpenAI’s ChatGPT is a public-facing chatbot version of GPT-3.5.
  • GPT-3.5 is trained on a combination of curated CommonCrawl data and high-quality reference data.
  • GPT-3.5 was tested on the multistate multiple choice section of the Bar Exam.

Data

  • Professional licensure exams are common across many professional fields, including law, medicine, dentistry, pharmacy, accounting, and engineering.
  • The National Conference of Bar Examiners (NCBE) is responsible for designing most of the bar examination materials used across the United States.
  • To pass the bar exam, test-takers need to have a large amount of theoretical knowledge and the ability to understand and answer exam-specific questions.
  • Most test-takers are required to complete at least seven years of post-secondary education and post-graduation Bar preparation training.
  • Approximately one out of every five test-takers fail to pass the exam on their initial attempt.
  • The Uniform Bar Examination (UBE) features three components: a multiple choice test, an essay test, and a scenario-based performance test.
  • The multiple choice component, referred to as the Multistate Bar Examination or MBE, is typically worth 50% of an overall bar exam score.
  • The MBE tests legal knowledge and reading comprehension skills.
  • The NCBE provides a public sample of an MBE question on their website.

Methods

  • Implemented an experiment using zero-shot prompts for the text-davinci-003 text completion API
  • Experiment included design and iteration of prompts, API hyperparameters, and an attempt at fine-tuning the mode
  • Prompt engineering is critical to replication of studies involving large language models
  • Experimented with 7 different prompt types
  • Last prompt type improved model correctness substantially
  • Prompts related to traditional textual entailment tasks
  • Problem formulated relative to another statement or body of knowledge
  • In some cases, multiple choices may be correct from an entailment perspective

(hyper)parameters for gpt-3

  • Machine learning and computational research results are sensitive to model parameters and hyperparameters.
  • Evaluated how hyperparameters like model “temperature” impacted the performance of the model.
  • Tested values for temperature, top p, best of, and max tokens.

Fine-tuning

  • LLMs like GPT-3.5 have good zero-shot or few-shot performance
  • OpenAI API allows for some control of the training process
  • Attempted to fine tune text-davinci-003 with simulated MBE bar exam questions
  • Altered training prompts, responses, batch size, learning rate, and prompt weighting
  • Fine-tuned model significantly underperformed text-davinci-003

Results

  • 107 sample exams were executed
  • Prompt style #7 performed best
  • GPT is not yet passing the overall multiple choice exam
  • GPT is exceeding the baseline random chance rate of 25%
  • GPT is trailing human test-takers by approximately 17%
  • GPT’s second best answer is highly correlated with correctness

Conclusion and future work

  • GPT-3.5 outperformed the baseline rate of random guessing
  • GPT-3.5 achieved a passing rate on two categories of the Bar and achieved parity with human test-takers on one
  • GPT-3.5’s rank-ordering of possible choices is strongly correlated with correctness
  • GPT-3.5 exceeded expectations for performance on this task
  • GPT-3.5 is at parity with humans for Evidence questions, but has a gap of up to 36% for other categories