Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Global economy is increasingly dependent on knowledge workers
- AICPA developed Uniform CPA Examination to measure capability readiness for knowledge workers
- OpenAI’s
text-davinci-003
and prior versions of GPT evaluated on sample Regulation exam and assessment of multiple-choice questions text-davinci-003
achieves 14.4% correct rate on sample REG exam section, underperforming human capabilitiestext-davinci-003
approaching human-level performance on Remembering & Understanding and Application skill levels- Recent generations of GPT-3 demonstrate material improvements, rising from 30% to 57%
Paper Content
Introduction
- Knowledge work is an important part of the global economy
- Leading management theorists have studied knowledge workers for nearly seven decades
- Hundreds of millions to billions of people are considered knowledge workers
- Organizations require knowledge workers to demonstrate their preparedness through assessments
- Public accounting is a multidisciplinary practice that requires legal, financial, accounting, auditing, technology, and ethical knowledge and skills
- The CPA Exam is the most comprehensive assessment of knowledge work readiness
- The CPA Exam is divided into four sections: Auditing and Attestation, Business Environment and Concepts, Financial Accounting and Reporting, and Regulation
- AI has not been able to perform knowledge work
- Recent research has shown potential to address capability gaps
- GPT-3 has demonstrated state-of-the-art performance on a wide range of tasks
- GPT-3 was evaluated on the Bar Exam and achieved near-parity with human test-takers
- GPT-3 was evaluated on the CPA Exam to evaluate its usefulness for knowledge work
- Analysis suggests areas where GPT-3 may be useful and areas where research is still required
Aicpa exam
- The Uniform CPA Examination is a computerized assessment based on psychometric and statistical techniques.
- It is a dynamic, adaptive exam.
- It is divided into four sections: Auditing and Attestation, Business Environment and Concepts, Financial Accounting and Reporting, and Regulation.
- It is designed to assess candidates on their readiness across a broad range of concepts and skill levels.
Data
- Research exists on quantitative reasoning with fine-tuning or few-shot contexts
- Results in this study are constrained to zero-shot prompts
- Two separate assessments are prepared to isolate arithmetic or quantitative capabilities from other elements of the Exam
Assessment 1: sample exam -regulation
- Assessment is intended to approximate the real Uniform CPA Examination
- Utilize the REG section as it contains the most balanced distribution of skill types and quantitative and qualitative reasoning
- Transcribed on January 3rd, 2023, including correct answers
- 40 test questions across five testlets
- 15 multiple-choice questions with four to six options each
- 24 questions require the test-taker to indicate the correct financial amount
- 1 question requires the test-taker to research authoritative material made available within the exam
Assessment 2: synthetic mcq assessment
- The Uniform CPA Examination is organized around Bloom’s cognitive taxonomy.
- The taxonomy is divided into six levels.
- The AICPA has adapted these skill levels into four simpler groups.
- The authors reviewed material from the AICPA, McGraw-Hill Education, and Becker Professional Education.
Methods
- Evaluated OpenAI’s models using MCQ assessments
- Stripped and reformatted answers for automated scoring
- Generated zero-shot prompts for text-davinci-003 API
- Fully open-sourced source code and questions for Assessment 2
Prompt engineering and responses
- Limited scientific understanding of large language models
- Proprietary nature of OpenAI’s models
- Writing prompts is referred to as “prompt engineering”
- Experimented with answer types, contextualization, and justification in prompt engineering
Model (hyper)parameters
- LLMs are sensitive to small changes in inputs and parameters.
- Evaluated how altering model parameters impacts performance.
- Tested temperature and best of values.
Fine-tuning and historical models
- OpenAI provides an API for fine-tuning models
- This paper focuses on the zero-shot performance of the model
- Fine-tuning at small sample sizes would not improve performance
- Fine-tuning can result in unexplained model degradation
- Tested with other models provided by OpenAI API
Results
- Conducted over 50,000 questions in 700 independent assessment sessions
- Performance values summarized in Table 6
Assessment 1
- GPT-3.5 performed poorly on Assessment 1, with an average range of 5.7-9.4%.
- GPT-3.5 also struggled with arithmetic on the 15 MCQs on Assessment 1, scoring only 4-6% above the baseline rate.
- Future research could improve GPT-3.5’s performance by expanding the prompt to include “scratchpads”.
Assessment 2
- Created 208 MCQs to evaluate GPT-3.5
- Baseline guessing rate is 25%
- Assessed GPT-3.5 180 times
- Performance ranged between 51.1% and 56.9%
- GPT-3.5 achieved 70% in Business Environment and Concepts, 57% in Auditing and Attestation, 53% in Regulation, and 51% in Financial Accounting and Reporting
Gpt model progression
- Prior work showed improvements in GPT models
- Table 8 and Figure 2 show similar results for Bar Exam
- Text-davinci-001 was able to follow instructions and answer above random chance
- Spread over random guessing increased from less than 5% to over 30%
Conclusion and future work
- Developed two assessments of knowledge worker readiness based on AICPA’s Uniform CPA Examination Blueprints
- Assessment 1 includes quantitative reasoning and calculations
- Assessment 2 covers foundational skill levels, excluding quantitative reasoning and calculations
- Assessments cover a broad, practical curriculum including law, finance, accounting, and technology
- Experimentally evaluated GPT-3.5 on two assessments
- GPT-3.5 achieved 14.4% correct rate on Assessment 1
- GPT-3.5 achieved 57% correct rate on Assessment 2
- GPT-3.5 approaching or on par with anecdotal testtaker performance
- GPT-3.5 demonstrates strong non-entailment capabilities and improving explanation capabilities
- GPT-3.5 has potential to transform quality and efficiency of knowledge work