Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Computational notebooks are used by data scientists to perform data wrangling and analytic tasks.
- ARCADE is a benchmark of 1082 code generation problems using the pandas data analysis framework.
- PaChiNCo is a 62B code language model for Python computational notebooks.
- Few-shot prompting strategies are explored to improve the diversity and explainability of model predictions.
Paper Content
Introduction
- Data science is the process of extracting insights from data
- It has become an integral part of decision making and knowledge discovery
- Data scientists and ML practitioners use computational notebooks
- Data scientists spend significant amount of time on data wrangling and EDA
- Computational notebooks present unique challenges to LLMs
- Existing datasets usually contain independent tasks with isolated contexts
- ARCADE is a new benchmark for code generation for data wrangling and EDA tasks in computational notebooks
- It consists of 1,082 problems spanning across 136 notebooks for 106 ML datasets
- 67% of problems in ARCADE require complex solutions using 5 or more pandas API calls
- Annotators are instructed to frame their intents when interacting with an AI system
- Mitigate ambiguity in NL intents by providing extra specification
- Re-purposing notebooks in the wild for the benchmark is not easy
- 660 problems are created from scratch using 70 tabular ML datasets
- Annotators create tasks that require at least 5 pandas API calls
Dataset analysis
- ARCADE consists of 1,082 NL-to-code problems from 133 notebooks
- NL intents are often under-specified
- Around 55% of intents are unambiguous
- 30% lack description of target columns in output DataFrames
- 25% imply complex schema
- 25% have lists of entities as answers
- 20% require complex structures or post-processing
- 30% of under-specified intents can be disambiguated by referring to notebook context
- New Tasks split has more challenging problems than Existing Tasks
- 67% of New Tasks require at least 5 API calls
- ARCADE has manually annotated, concise NL intents
- ARCADE contains more interrelated problems in a single notebook
- ARCADE challenges LLMs with grounded language understanding
- Number of pandas APIs used in each problem is on par with DS-1000
- 70% of problems and notebooks are curated from scratch targeting recent ML datasets
Evaluation by fuzzy output matching
- Recent code generation datasets have shifted from surface form evaluation metrics to functional correctness
- Evaluation methods assume model generates code that strictly follows the signature of unit tests
- Aim to evaluate accuracy of code LLMs in synthesizing code for data science notebooks
- Use heuristics to match execution output of predicted program with reference
- Heuristics cover two scenarios: canonicalizing variables to same type and partial matching between complex DataFrame variables
- Evaluation metric is reliable in identifying solutions with alternative output structures
Experiments
Models and setup
- PACHINCO is compared to existing large code language models
- CODEGEN is a code LM trained on NL and code data from GitHub
- INCODER is a code LM for left-to-right generation and infilling tasks
- PALM is also evaluated, as well as a Python code LM from code finetuning
- Inference is done using nucleus sampling with a top probability of 0.95 and a temperature 0.8
- Performance is measured using the pass@k metric, estimated by drawing 50 samples for each problem
Lm prompting strategies
- Conducted two types of prompting experiments
- Prompt consists of notebook context and extra exemplars
- Three prompt prefixes for each of four different styles
- Prompt prefixes have four exemplars
- Prompting engineering not put much effort into
- Motivation to nudge LM to generate code with different styles
- Focus on prompting LM to generate code with multi-line, step-by-step decomposition structure
- Found to improve accuracy of natural language reasoning and program induction tasks
- Improves diversity of predicted code solutions
Results
- PACHINCO model registers best performance on both Existing and New Tasks splits
- Truncate prompt up to 900 tokens using PALM tokenizer
- PACHINCO outperforms most public code LMs
- Having 3 context cells lifts pass@30 to 72% and 36% on the two splits
- Results plateau after including 5-7 context cells
- More context helps reduce schema understanding errors
Few-shot prompting beyond test notebook contexts
- Diminishing returns after including more notebook context cells
- Use few-shot prompting with additional exemplars to teach the model to perform challenging task
- Evaluate functional correctness and metrics evaluating the style of model-predicted code
- PACHINCO used due to prompt length exceeding limit of public code LMs
- Step-by-Step Prompting (SbS) improves pass@30 over baseline
- SbS prompting is especially helpful for problems without adequate preceding notebook context
- SbS prompting leads to more diverse solutions
- SbS prompting improves accuracy, code style, and explainability
- SbS prompting produces more diverse solution approaches
- Diverse predictions are effective for post-hoc reranking such as self-consistency decoding
Error analysis
- LLMs make errors on challenging problems on ARCADE
- Errors decreased after two-stage code fine-tuning
- 50 randomly sampled incorrect predictions analyzed
- Errors grouped into 5 categories
- Complex problems most common cause of errors
- Errors caused by under-specified intents
- Fuzzy output matching effective in identifying alternative solutions
- Step-wise explanations helpful in improving accuracy and understanding code
Related works
- LLMs have been trained on code
- Generalist LMs also have source code as part of their training data
- LLMs have registered impressive performance on code benchmarks
- Automating the data science lifecycle is needed
- Automating feature engineering and finding best-performing models is a central theme of AutoML research
- Automating data wrangling and EDA tasks is also important
- Context-driven code generation is used to map utterances to programs
Conclusion
- Presented ARCADE, a code generation benchmark for data wrangling and EDA tasks in computational notebooks
- Developed PACHINCO, a 62B LM tailed for data science
- PACHINCO outperforms public code LMs on AR-CADE
- Few-shot learning to improve code style and solution diversity
- Create simple, short intents while describing the desired outputs without much ambiguity
- Annotators create NL intents and code solutions for interesting data wrangling and EDA tasks
- Tasks should be complex and require multiple steps and pandas API functions
- PALM 62B model finetuned on Python data outperforms PALM-CODER 540B model on all benchmarks
- Linearize notebooks to Python source code using nbconverter
- Evaluate PALM 62B model on HumanEval, MBPP and Transcoder
- At inference time, left-truncate notebook context up to 900 tokens
- PACHINCO significantly outperforms CODEX-cushman-001 API
- AST size correlates better with pass@k than number of API usage
- Using more notebook context cells reduces NameErrors and KeyErrors
- Few-shot prompting still effective compared to baseline on Existing Tasks
- Analysis of diversity in solution patterns shows step-by-step prompting leads to increased diversity
- Top execution error is KeyError associated with pd.groupby API call