Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Global economy is increasingly dependent on knowledge workers
  • AICPA developed Uniform CPA Examination to measure capability readiness for knowledge workers
  • OpenAI’s text-davinci-003 and prior versions of GPT evaluated on sample Regulation exam and assessment of multiple-choice questions
  • text-davinci-003 achieves 14.4% correct rate on sample REG exam section, underperforming human capabilities
  • text-davinci-003 approaching human-level performance on Remembering & Understanding and Application skill levels
  • Recent generations of GPT-3 demonstrate material improvements, rising from 30% to 57%

Paper Content

Introduction

  • Knowledge work is an important part of the global economy
  • Leading management theorists have studied knowledge workers for nearly seven decades
  • Hundreds of millions to billions of people are considered knowledge workers
  • Organizations require knowledge workers to demonstrate their preparedness through assessments
  • Public accounting is a multidisciplinary practice that requires legal, financial, accounting, auditing, technology, and ethical knowledge and skills
  • The CPA Exam is the most comprehensive assessment of knowledge work readiness
  • The CPA Exam is divided into four sections: Auditing and Attestation, Business Environment and Concepts, Financial Accounting and Reporting, and Regulation
  • AI has not been able to perform knowledge work
  • Recent research has shown potential to address capability gaps
  • GPT-3 has demonstrated state-of-the-art performance on a wide range of tasks
  • GPT-3 was evaluated on the Bar Exam and achieved near-parity with human test-takers
  • GPT-3 was evaluated on the CPA Exam to evaluate its usefulness for knowledge work
  • Analysis suggests areas where GPT-3 may be useful and areas where research is still required

Aicpa exam

  • The Uniform CPA Examination is a computerized assessment based on psychometric and statistical techniques.
  • It is a dynamic, adaptive exam.
  • It is divided into four sections: Auditing and Attestation, Business Environment and Concepts, Financial Accounting and Reporting, and Regulation.
  • It is designed to assess candidates on their readiness across a broad range of concepts and skill levels.

Data

  • Research exists on quantitative reasoning with fine-tuning or few-shot contexts
  • Results in this study are constrained to zero-shot prompts
  • Two separate assessments are prepared to isolate arithmetic or quantitative capabilities from other elements of the Exam

Assessment 1: sample exam -regulation

  • Assessment is intended to approximate the real Uniform CPA Examination
  • Utilize the REG section as it contains the most balanced distribution of skill types and quantitative and qualitative reasoning
  • Transcribed on January 3rd, 2023, including correct answers
  • 40 test questions across five testlets
  • 15 multiple-choice questions with four to six options each
  • 24 questions require the test-taker to indicate the correct financial amount
  • 1 question requires the test-taker to research authoritative material made available within the exam

Assessment 2: synthetic mcq assessment

  • The Uniform CPA Examination is organized around Bloom’s cognitive taxonomy.
  • The taxonomy is divided into six levels.
  • The AICPA has adapted these skill levels into four simpler groups.
  • The authors reviewed material from the AICPA, McGraw-Hill Education, and Becker Professional Education.

Methods

  • Evaluated OpenAI’s models using MCQ assessments
  • Stripped and reformatted answers for automated scoring
  • Generated zero-shot prompts for text-davinci-003 API
  • Fully open-sourced source code and questions for Assessment 2

Prompt engineering and responses

  • Limited scientific understanding of large language models
  • Proprietary nature of OpenAI’s models
  • Writing prompts is referred to as “prompt engineering”
  • Experimented with answer types, contextualization, and justification in prompt engineering

Model (hyper)parameters

  • LLMs are sensitive to small changes in inputs and parameters.
  • Evaluated how altering model parameters impacts performance.
  • Tested temperature and best of values.

Fine-tuning and historical models

  • OpenAI provides an API for fine-tuning models
  • This paper focuses on the zero-shot performance of the model
  • Fine-tuning at small sample sizes would not improve performance
  • Fine-tuning can result in unexplained model degradation
  • Tested with other models provided by OpenAI API

Results

  • Conducted over 50,000 questions in 700 independent assessment sessions
  • Performance values summarized in Table 6

Assessment 1

  • GPT-3.5 performed poorly on Assessment 1, with an average range of 5.7-9.4%.
  • GPT-3.5 also struggled with arithmetic on the 15 MCQs on Assessment 1, scoring only 4-6% above the baseline rate.
  • Future research could improve GPT-3.5’s performance by expanding the prompt to include “scratchpads”.

Assessment 2

  • Created 208 MCQs to evaluate GPT-3.5
  • Baseline guessing rate is 25%
  • Assessed GPT-3.5 180 times
  • Performance ranged between 51.1% and 56.9%
  • GPT-3.5 achieved 70% in Business Environment and Concepts, 57% in Auditing and Attestation, 53% in Regulation, and 51% in Financial Accounting and Reporting

Gpt model progression

  • Prior work showed improvements in GPT models
  • Table 8 and Figure 2 show similar results for Bar Exam
  • Text-davinci-001 was able to follow instructions and answer above random chance
  • Spread over random guessing increased from less than 5% to over 30%

Conclusion and future work

  • Developed two assessments of knowledge worker readiness based on AICPA’s Uniform CPA Examination Blueprints
  • Assessment 1 includes quantitative reasoning and calculations
  • Assessment 2 covers foundational skill levels, excluding quantitative reasoning and calculations
  • Assessments cover a broad, practical curriculum including law, finance, accounting, and technology
  • Experimentally evaluated GPT-3.5 on two assessments
  • GPT-3.5 achieved 14.4% correct rate on Assessment 1
  • GPT-3.5 achieved 57% correct rate on Assessment 2
  • GPT-3.5 approaching or on par with anecdotal testtaker performance
  • GPT-3.5 demonstrates strong non-entailment capabilities and improving explanation capabilities
  • GPT-3.5 has potential to transform quality and efficiency of knowledge work