Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

LLMs struggle with hierarchical multi-step reasoning like generating complex programs.
Parsel is a framework that enables automatic implementation and validation of complex algorithms with code LLMs.
Parsel can be used across domains requiring hierarchical reasoning, e.g. code synthesis, theorem proving, and robotic planning.
Parsel allows problem-solving with high-level algorithmic designs, benefiting both students and professional programmers.

Paper Content

Introduction

Computer scientists believe it is easier to verify a solution than to write one
Generating solutions is often the bottleneck for large language models
As the number of parts in a program grows, exponentially more samples are needed for a chance of a working program
Human programmers have strategies for implementing complex programs, such as making a high-level algorithmic design
Parsel is a compiler framework to implement and extend a high-level algorithmic design
Parsel aims to perform algorithmic problem-solving by mirroring a common pattern in human reasoning
Parsel can generate code satisfying the original unit tests
Parsel solutions are 76-90% shorter than the original code

Specifying programs in parsel

Developed a simple language for Parsel
Language designed considering programmers, code LLMs, and students
Each line contains a description, constraint, or reference to a previous description
Scoping rules and nuances per target language

Descriptions

Function description is represented as a function name, input arguments, and a description
Function generated from a description can call functions or references below the description in the function graph

Constraints

Constraints are represented as a function’s input values and expected output
Complex constraints can be applied to higher-order functions
Constraints can only be applied following descriptions, not to references
Satisfying constraints to validate a program varies from language to language

References

A reference is the name of a function in the current scope.
A reference allows and encourages the parent function to use the referenced function.
A reference without a description in scope is not valid.
The Collatz conjecture is an example of a reference.
The call graph and its strongly connected components (SCC) can be visualized.

Scoping

Parsel uses indentation to define scope.
Scope of a function includes the set of functions that can be used as a reference.
Indentations between the current function and the referenced function must be decreasing.

Variations due to target language requirements

Different languages have different nuances and representations of constraints.
Different languages have different evaluation functions.
Different languages require different prompts for language models.

Implementing programs in parsel

Compiling a Parsel program requires several steps
High-level pseudocode is provided in Figure 3

Constrained implementation

Generate functions with code LLM and test them to identify an implementation that satisfies constraints
Prompt code language model with description and function signature
Treat implementation as “fixed” and leverage it in functions calling it
User is responsible for ensuring correctness of implementation

Leaves and merges

Parsel implements functions by either querying a code language model or referencing the implementations of its direct children.
Parsel creates a dependency graph and identifies strongly connected components.
Parsel compiles SCCs by finding an implementation string that satisfies the functions’ constraints.

Sequential case

Parsel programs can have functions with constraints and no recursive dependencies.
Parsel implements functions with post-order traversal from the root.
Without any cycles in the call graph, leaf functions are implemented first, then their parents, etc.

Strongly connected components of functions

Parsel requires functions to have accompanying constraints
Recursion requires functions to be implemented jointly with cyclic dependencies
Automatic “elaboration” requires functions to be defined without constraints
Solution is to consider all possible implementations of functions until a valid one is found
Solution is inspired by Cocke & Kennedy (1977)
Strongly connected components of the function call graph are identified
All possible sets of function implementations are considered until one satisfies all constraints
Multiprocessing with a timeout is used
Functions with no constraints are reformulated as a “test dependency graph”

Caching

Reduce number of queries necessary
Keep programs generated mostly stable

Headers

Programs can have headers to create global contexts.
A line of special characters (e.g. “###”) is used to separate the body and text.
This line is passed as a prefix to all prompts.

Automatic function infilling

Function generated by language model may call a function that is not yet implemented
Option to automatically generate and implement it based on its usage
Function incorporated into call graph as unit-test-less child of the function which calls it
Limit number of times process can be applied to a function to avoid infinite recursion and inefficient use of language model quota

Automatic decomposition

Language models are capable of decomposing tasks into steps in natural language.
Parsel has an optional flag that prompts a language model to generate code for any additional subfunctions needed.
Parsel can be used to recursively decompose tasks into steps.

Experiments

Apps dataset

LLMs can generate Parsel programs with few-shot prompts
Existing code in other languages can be used to construct datasets of Parsel programs
Call graph representation is broadly convenient
APPS dataset is a collection of coding challenges with solutions in Python
Parsel solutions can be automatically generated from APPS solutions
Parsel code is substantially shorter than other languages

Case studies

Example function included in Figure 5
Corresponding Python code in appendix
58 non-whitespace lines of code
17 lines of comments, 2 asserts, 39 lines implementing functions
Automatic decomposition discussed in Subsection 3.8

Theorem proving in lean

Framework can generate proofs in formal theorem-proving languages such as Lean
Translated version included in appendix
Running Lean on proof with no errors/warnings indicates proof is correct, but not a guarantee that proof statement matches claim

Robotic planning in virtualhome

Parsel can be used for complex robotic planning
VirtualHome is a simulation environment with objects and agents to interact with
Parsel can generate programs to solve tasks in the VirtualHome environment
Parsel can generate action plans in natural language
Parsel can translate natural language action plans to formal VirtualHome instructions
Parsel can successfully execute and decompose tasks in VirtualHome

Step by step problem solving with lms

Step-by-step reasoning improves LLM performance
Guidance and tool use can further improve performance
Humans tend to provide step-by-step hierarchical descriptions with verification steps
Problems can be learned efficiently when decomposed

Program synthesis

Program synthesis is the challenge of generating programs from high-level specifications
Library learning is a way to make progress on complex programs by inducing a library from simpler synthesis problems
Parsel synthesizes complex programs by decomposing them into smaller functions, but the user specifies the decomposition

Lms for formal environment multi-step planning

Huang et al. and Brohan et al. used language models to generate step-by-step algorithms for robotic agents
Jiang et al. proposed a framework to generate formal proofs from informal proofs
Cheng et al. explored a language model-based evaluation function
Beurer-Kellner et al. explored a related SQL-style LM-querying language
Dohan et al. presented an inference-focused framework for language models

Testing code language model outputs

Assert statements can be used to constrain the generation space of LLMs for code on individual functions
Merrill et al. (2021) proved essential constraints on what can be learned from assertions alone

Implications

Parsel is a natural language compiler framework that bridges the gap between natural language and programming language
Programming language models are limited to individual functions
Parsel allows code language models to stay closer to natural language when generating code
Parsel allows complex but standard methods to be described concisely
Parsel can generate solutions recursively

For students

Traditional programming languages are less robust to variations in wording than natural languages.
Traditional programming languages require many tokens for syntactic details and can take many lines to express something that can be expressed more simply.
LLMs have shown remarkable algorithmic generalization ability.
Conversational code generation has been explored, but primarily focuses on imperative programming structures.

Limitations

Parsel relies on a code language model to generate implementations of individual functions, which can vary in quality depending on the model used and the complexity of the function.
Complex functions can be broken down into simpler ones to improve the quality of the generated code.
Generated code may be less efficient or less readable than manually written code.

Conclusion and future work

Parsel provides a language for robust code generation without the need to evaluate the underlying code
Parsel allows the teaching of algorithmic reasoning with less emphasis on syntax and more emphasis on problem-solving
Parsel allows for a general framework for hierarchical task decomposition
Parsel could be used for automatic unit test generation
Parsel could incorporate ways of varying the “confidence threshold” of the language model
Parsel could incorporate value functions to allow decomposition to be done more methodically
Parsel could reference already-defined functions
Parsel could use a code language model to determine which function to next evaluate
Parsel could incorporate multiple base tools with different kinds of specialized models
Parsel could use a calculator to solve more complex math word problems
Parsel could generate task-specific language model cascades
Parsel could integrate Synchromesh to ensure generated words are possible within the grammar of the given formal language
Parsel could be used as a framework for bootstrapping increasingly complex program generation
Parsel could generate a new cost matrix with a new node corresponding to the sky
Parsel could return an adjacency matrix corresponding to the minimum spanning tree
Parsel could return a list of the nodes connected to the final node
Parsel could return a list of cities that should have airports built in them to minimize the total cost
Parsel could return the longest palindrome in a string
Parsel could return a string that is either “kebab”, “snake”, “camel”, or “none” depending on the input string

Link to paper#

Abstract#

Paper Content#

Introduction#

Specifying programs in parsel#

Descriptions#

Constraints#

References#

Scoping#

Variations due to target language requirements#

Implementing programs in parsel#

Constrained implementation#

Leaves and merges#

Sequential case#

Strongly connected components of functions#

Caching#

Headers#

Automatic function infilling#

Automatic decomposition#

Experiments#

Apps dataset#

Case studies#

Theorem proving in lean#

Robotic planning in virtualhome#

Related works#

Step by step problem solving with lms#

Program synthesis#

Lms for formal environment multi-step planning#

Testing code language model outputs#

Implications#

For students#

Limitations#

Conclusion and future work#

Link to paper

Abstract

Paper Content

Introduction

Specifying programs in parsel

Descriptions

Constraints

References

Scoping

Variations due to target language requirements

Implementing programs in parsel

Constrained implementation

Leaves and merges

Sequential case

Strongly connected components of functions

Caching

Headers

Automatic function infilling

Automatic decomposition

Experiments

Apps dataset

Case studies

Theorem proving in lean

Robotic planning in virtualhome

Related works

Step by step problem solving with lms

Program synthesis

Lms for formal environment multi-step planning

Testing code language model outputs

Implications

For students

Limitations

Conclusion and future work