Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Global economy is increasingly dependent on knowledge workers
AICPA developed Uniform CPA Examination to measure capability readiness for knowledge workers
OpenAI’s text-davinci-003 and prior versions of GPT evaluated on sample Regulation exam and assessment of multiple-choice questions
text-davinci-003 achieves 14.4% correct rate on sample REG exam section, underperforming human capabilities
text-davinci-003 approaching human-level performance on Remembering & Understanding and Application skill levels
Recent generations of GPT-3 demonstrate material improvements, rising from 30% to 57%

Paper Content

Introduction

Knowledge work is an important part of the global economy
Leading management theorists have studied knowledge workers for nearly seven decades
Hundreds of millions to billions of people are considered knowledge workers
Organizations require knowledge workers to demonstrate their preparedness through assessments
Public accounting is a multidisciplinary practice that requires legal, financial, accounting, auditing, technology, and ethical knowledge and skills
The CPA Exam is the most comprehensive assessment of knowledge work readiness
The CPA Exam is divided into four sections: Auditing and Attestation, Business Environment and Concepts, Financial Accounting and Reporting, and Regulation
AI has not been able to perform knowledge work
Recent research has shown potential to address capability gaps
GPT-3 has demonstrated state-of-the-art performance on a wide range of tasks
GPT-3 was evaluated on the Bar Exam and achieved near-parity with human test-takers
GPT-3 was evaluated on the CPA Exam to evaluate its usefulness for knowledge work
Analysis suggests areas where GPT-3 may be useful and areas where research is still required

Aicpa exam

The Uniform CPA Examination is a computerized assessment based on psychometric and statistical techniques.
It is a dynamic, adaptive exam.
It is divided into four sections: Auditing and Attestation, Business Environment and Concepts, Financial Accounting and Reporting, and Regulation.
It is designed to assess candidates on their readiness across a broad range of concepts and skill levels.

Data

Research exists on quantitative reasoning with fine-tuning or few-shot contexts
Results in this study are constrained to zero-shot prompts
Two separate assessments are prepared to isolate arithmetic or quantitative capabilities from other elements of the Exam

Assessment 1: sample exam -regulation

Assessment is intended to approximate the real Uniform CPA Examination
Utilize the REG section as it contains the most balanced distribution of skill types and quantitative and qualitative reasoning
Transcribed on January 3rd, 2023, including correct answers
40 test questions across five testlets
15 multiple-choice questions with four to six options each
24 questions require the test-taker to indicate the correct financial amount
1 question requires the test-taker to research authoritative material made available within the exam

Assessment 2: synthetic mcq assessment

The Uniform CPA Examination is organized around Bloom’s cognitive taxonomy.
The taxonomy is divided into six levels.
The AICPA has adapted these skill levels into four simpler groups.
The authors reviewed material from the AICPA, McGraw-Hill Education, and Becker Professional Education.

Methods

Evaluated OpenAI’s models using MCQ assessments
Stripped and reformatted answers for automated scoring
Generated zero-shot prompts for text-davinci-003 API
Fully open-sourced source code and questions for Assessment 2

Prompt engineering and responses

Limited scientific understanding of large language models
Proprietary nature of OpenAI’s models
Writing prompts is referred to as “prompt engineering”
Experimented with answer types, contextualization, and justification in prompt engineering

Model (hyper)parameters

LLMs are sensitive to small changes in inputs and parameters.
Evaluated how altering model parameters impacts performance.
Tested temperature and best of values.

Fine-tuning and historical models

OpenAI provides an API for fine-tuning models
This paper focuses on the zero-shot performance of the model
Fine-tuning at small sample sizes would not improve performance
Fine-tuning can result in unexplained model degradation
Tested with other models provided by OpenAI API

Results

Conducted over 50,000 questions in 700 independent assessment sessions
Performance values summarized in Table 6

Assessment 1

GPT-3.5 performed poorly on Assessment 1, with an average range of 5.7-9.4%.
GPT-3.5 also struggled with arithmetic on the 15 MCQs on Assessment 1, scoring only 4-6% above the baseline rate.
Future research could improve GPT-3.5’s performance by expanding the prompt to include “scratchpads”.

Assessment 2

Created 208 MCQs to evaluate GPT-3.5
Baseline guessing rate is 25%
Assessed GPT-3.5 180 times
Performance ranged between 51.1% and 56.9%
GPT-3.5 achieved 70% in Business Environment and Concepts, 57% in Auditing and Attestation, 53% in Regulation, and 51% in Financial Accounting and Reporting

Gpt model progression

Prior work showed improvements in GPT models
Table 8 and Figure 2 show similar results for Bar Exam
Text-davinci-001 was able to follow instructions and answer above random chance
Spread over random guessing increased from less than 5% to over 30%

Conclusion and future work

Developed two assessments of knowledge worker readiness based on AICPA’s Uniform CPA Examination Blueprints
Assessment 1 includes quantitative reasoning and calculations
Assessment 2 covers foundational skill levels, excluding quantitative reasoning and calculations
Assessments cover a broad, practical curriculum including law, finance, accounting, and technology
Experimentally evaluated GPT-3.5 on two assessments
GPT-3.5 achieved 14.4% correct rate on Assessment 1
GPT-3.5 achieved 57% correct rate on Assessment 2
GPT-3.5 approaching or on par with anecdotal testtaker performance
GPT-3.5 demonstrates strong non-entailment capabilities and improving explanation capabilities
GPT-3.5 has potential to transform quality and efficiency of knowledge work

Link to paper#

Abstract#

Paper Content#

Introduction#

Aicpa exam#

Data#

Assessment 1: sample exam -regulation#

Assessment 2: synthetic mcq assessment#

Methods#

Prompt engineering and responses#

Model (hyper)parameters#

Fine-tuning and historical models#

Results#

Assessment 1#

Assessment 2#

Gpt model progression#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Aicpa exam

Data

Assessment 1: sample exam -regulation

Assessment 2: synthetic mcq assessment

Methods

Prompt engineering and responses

Model (hyper)parameters

Fine-tuning and historical models

Results

Assessment 1

Assessment 2

Gpt model progression

Conclusion and future work