Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Computational notebooks are used by data scientists to perform data wrangling and analytic tasks.
  • ARCADE is a benchmark of 1082 code generation problems using the pandas data analysis framework.
  • PaChiNCo is a 62B code language model for Python computational notebooks.
  • Few-shot prompting strategies are explored to improve the diversity and explainability of model predictions.

Paper Content

Introduction

  • Data science is the process of extracting insights from data
  • It has become an integral part of decision making and knowledge discovery
  • Data scientists and ML practitioners use computational notebooks
  • Data scientists spend significant amount of time on data wrangling and EDA
  • Computational notebooks present unique challenges to LLMs
  • Existing datasets usually contain independent tasks with isolated contexts
  • ARCADE is a new benchmark for code generation for data wrangling and EDA tasks in computational notebooks
  • It consists of 1,082 problems spanning across 136 notebooks for 106 ML datasets
  • 67% of problems in ARCADE require complex solutions using 5 or more pandas API calls
  • Annotators are instructed to frame their intents when interacting with an AI system
  • Mitigate ambiguity in NL intents by providing extra specification
  • Re-purposing notebooks in the wild for the benchmark is not easy
  • 660 problems are created from scratch using 70 tabular ML datasets
  • Annotators create tasks that require at least 5 pandas API calls

Dataset analysis

  • ARCADE consists of 1,082 NL-to-code problems from 133 notebooks
  • NL intents are often under-specified
  • Around 55% of intents are unambiguous
  • 30% lack description of target columns in output DataFrames
  • 25% imply complex schema
  • 25% have lists of entities as answers
  • 20% require complex structures or post-processing
  • 30% of under-specified intents can be disambiguated by referring to notebook context
  • New Tasks split has more challenging problems than Existing Tasks
  • 67% of New Tasks require at least 5 API calls
  • ARCADE has manually annotated, concise NL intents
  • ARCADE contains more interrelated problems in a single notebook
  • ARCADE challenges LLMs with grounded language understanding
  • Number of pandas APIs used in each problem is on par with DS-1000
  • 70% of problems and notebooks are curated from scratch targeting recent ML datasets

Evaluation by fuzzy output matching

  • Recent code generation datasets have shifted from surface form evaluation metrics to functional correctness
  • Evaluation methods assume model generates code that strictly follows the signature of unit tests
  • Aim to evaluate accuracy of code LLMs in synthesizing code for data science notebooks
  • Use heuristics to match execution output of predicted program with reference
  • Heuristics cover two scenarios: canonicalizing variables to same type and partial matching between complex DataFrame variables
  • Evaluation metric is reliable in identifying solutions with alternative output structures

Experiments

Models and setup

  • PACHINCO is compared to existing large code language models
  • CODEGEN is a code LM trained on NL and code data from GitHub
  • INCODER is a code LM for left-to-right generation and infilling tasks
  • PALM is also evaluated, as well as a Python code LM from code finetuning
  • Inference is done using nucleus sampling with a top probability of 0.95 and a temperature 0.8
  • Performance is measured using the pass@k metric, estimated by drawing 50 samples for each problem

Lm prompting strategies

  • Conducted two types of prompting experiments
  • Prompt consists of notebook context and extra exemplars
  • Three prompt prefixes for each of four different styles
  • Prompt prefixes have four exemplars
  • Prompting engineering not put much effort into
  • Motivation to nudge LM to generate code with different styles
  • Focus on prompting LM to generate code with multi-line, step-by-step decomposition structure
  • Found to improve accuracy of natural language reasoning and program induction tasks
  • Improves diversity of predicted code solutions

Results

  • PACHINCO model registers best performance on both Existing and New Tasks splits
  • Truncate prompt up to 900 tokens using PALM tokenizer
  • PACHINCO outperforms most public code LMs
  • Having 3 context cells lifts pass@30 to 72% and 36% on the two splits
  • Results plateau after including 5-7 context cells
  • More context helps reduce schema understanding errors

Few-shot prompting beyond test notebook contexts

  • Diminishing returns after including more notebook context cells
  • Use few-shot prompting with additional exemplars to teach the model to perform challenging task
  • Evaluate functional correctness and metrics evaluating the style of model-predicted code
  • PACHINCO used due to prompt length exceeding limit of public code LMs
  • Step-by-Step Prompting (SbS) improves pass@30 over baseline
  • SbS prompting is especially helpful for problems without adequate preceding notebook context
  • SbS prompting leads to more diverse solutions
  • SbS prompting improves accuracy, code style, and explainability
  • SbS prompting produces more diverse solution approaches
  • Diverse predictions are effective for post-hoc reranking such as self-consistency decoding

Error analysis

  • LLMs make errors on challenging problems on ARCADE
  • Errors decreased after two-stage code fine-tuning
  • 50 randomly sampled incorrect predictions analyzed
  • Errors grouped into 5 categories
  • Complex problems most common cause of errors
  • Errors caused by under-specified intents
  • Fuzzy output matching effective in identifying alternative solutions
  • Step-wise explanations helpful in improving accuracy and understanding code
  • LLMs have been trained on code
  • Generalist LMs also have source code as part of their training data
  • LLMs have registered impressive performance on code benchmarks
  • Automating the data science lifecycle is needed
  • Automating feature engineering and finding best-performing models is a central theme of AutoML research
  • Automating data wrangling and EDA tasks is also important
  • Context-driven code generation is used to map utterances to programs

Conclusion

  • Presented ARCADE, a code generation benchmark for data wrangling and EDA tasks in computational notebooks
  • Developed PACHINCO, a 62B LM tailed for data science
  • PACHINCO outperforms public code LMs on AR-CADE
  • Few-shot learning to improve code style and solution diversity
  • Create simple, short intents while describing the desired outputs without much ambiguity
  • Annotators create NL intents and code solutions for interesting data wrangling and EDA tasks
  • Tasks should be complex and require multiple steps and pandas API functions
  • PALM 62B model finetuned on Python data outperforms PALM-CODER 540B model on all benchmarks
  • Linearize notebooks to Python source code using nbconverter
  • Evaluate PALM 62B model on HumanEval, MBPP and Transcoder
  • At inference time, left-truncate notebook context up to 900 tokens
  • PACHINCO significantly outperforms CODEX-cushman-001 API
  • AST size correlates better with pass@k than number of API usage
  • Using more notebook context cells reduces NameErrors and KeyErrors
  • Few-shot prompting still effective compared to baseline on Existing Tasks
  • Analysis of diversity in solution patterns shows step-by-step prompting leads to increased diversity
  • Top execution error is KeyError associated with pd.groupby API call