Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- LLMs struggle with hierarchical multi-step reasoning like generating complex programs.
- Parsel is a framework that enables automatic implementation and validation of complex algorithms with code LLMs.
- Parsel can be used across domains requiring hierarchical reasoning, e.g. code synthesis, theorem proving, and robotic planning.
- Parsel allows problem-solving with high-level algorithmic designs, benefiting both students and professional programmers.
Paper Content
Introduction
- Computer scientists believe it is easier to verify a solution than to write one
- Generating solutions is often the bottleneck for large language models
- As the number of parts in a program grows, exponentially more samples are needed for a chance of a working program
- Human programmers have strategies for implementing complex programs, such as making a high-level algorithmic design
- Parsel is a compiler framework to implement and extend a high-level algorithmic design
- Parsel aims to perform algorithmic problem-solving by mirroring a common pattern in human reasoning
- Parsel can generate code satisfying the original unit tests
- Parsel solutions are 76-90% shorter than the original code
Specifying programs in parsel
- Developed a simple language for Parsel
- Language designed considering programmers, code LLMs, and students
- Each line contains a description, constraint, or reference to a previous description
- Scoping rules and nuances per target language
Descriptions
- Function description is represented as a function name, input arguments, and a description
- Function generated from a description can call functions or references below the description in the function graph
Constraints
- Constraints are represented as a function’s input values and expected output
- Complex constraints can be applied to higher-order functions
- Constraints can only be applied following descriptions, not to references
- Satisfying constraints to validate a program varies from language to language
References
- A reference is the name of a function in the current scope.
- A reference allows and encourages the parent function to use the referenced function.
- A reference without a description in scope is not valid.
- The Collatz conjecture is an example of a reference.
- The call graph and its strongly connected components (SCC) can be visualized.
Scoping
- Parsel uses indentation to define scope.
- Scope of a function includes the set of functions that can be used as a reference.
- Indentations between the current function and the referenced function must be decreasing.
Variations due to target language requirements
- Different languages have different nuances and representations of constraints.
- Different languages have different evaluation functions.
- Different languages require different prompts for language models.
Implementing programs in parsel
- Compiling a Parsel program requires several steps
- High-level pseudocode is provided in Figure 3
Constrained implementation
- Generate functions with code LLM and test them to identify an implementation that satisfies constraints
- Prompt code language model with description and function signature
- Treat implementation as “fixed” and leverage it in functions calling it
- User is responsible for ensuring correctness of implementation
Leaves and merges
- Parsel implements functions by either querying a code language model or referencing the implementations of its direct children.
- Parsel creates a dependency graph and identifies strongly connected components.
- Parsel compiles SCCs by finding an implementation string that satisfies the functions’ constraints.
Sequential case
- Parsel programs can have functions with constraints and no recursive dependencies.
- Parsel implements functions with post-order traversal from the root.
- Without any cycles in the call graph, leaf functions are implemented first, then their parents, etc.
Strongly connected components of functions
- Parsel requires functions to have accompanying constraints
- Recursion requires functions to be implemented jointly with cyclic dependencies
- Automatic “elaboration” requires functions to be defined without constraints
- Solution is to consider all possible implementations of functions until a valid one is found
- Solution is inspired by Cocke & Kennedy (1977)
- Strongly connected components of the function call graph are identified
- All possible sets of function implementations are considered until one satisfies all constraints
- Multiprocessing with a timeout is used
- Functions with no constraints are reformulated as a “test dependency graph”
Caching
- Reduce number of queries necessary
- Keep programs generated mostly stable
Headers
- Programs can have headers to create global contexts.
- A line of special characters (e.g. “###”) is used to separate the body and text.
- This line is passed as a prefix to all prompts.
Automatic function infilling
- Function generated by language model may call a function that is not yet implemented
- Option to automatically generate and implement it based on its usage
- Function incorporated into call graph as unit-test-less child of the function which calls it
- Limit number of times process can be applied to a function to avoid infinite recursion and inefficient use of language model quota
Automatic decomposition
- Language models are capable of decomposing tasks into steps in natural language.
- Parsel has an optional flag that prompts a language model to generate code for any additional subfunctions needed.
- Parsel can be used to recursively decompose tasks into steps.
Experiments
Apps dataset
- LLMs can generate Parsel programs with few-shot prompts
- Existing code in other languages can be used to construct datasets of Parsel programs
- Call graph representation is broadly convenient
- APPS dataset is a collection of coding challenges with solutions in Python
- Parsel solutions can be automatically generated from APPS solutions
- Parsel code is substantially shorter than other languages
Case studies
- Example function included in Figure 5
- Corresponding Python code in appendix
- 58 non-whitespace lines of code
- 17 lines of comments, 2 asserts, 39 lines implementing functions
- Automatic decomposition discussed in Subsection 3.8
Theorem proving in lean
- Framework can generate proofs in formal theorem-proving languages such as Lean
- Translated version included in appendix
- Running Lean on proof with no errors/warnings indicates proof is correct, but not a guarantee that proof statement matches claim
Robotic planning in virtualhome
- Parsel can be used for complex robotic planning
- VirtualHome is a simulation environment with objects and agents to interact with
- Parsel can generate programs to solve tasks in the VirtualHome environment
- Parsel can generate action plans in natural language
- Parsel can translate natural language action plans to formal VirtualHome instructions
- Parsel can successfully execute and decompose tasks in VirtualHome
Related works
Step by step problem solving with lms
- Step-by-step reasoning improves LLM performance
- Guidance and tool use can further improve performance
- Humans tend to provide step-by-step hierarchical descriptions with verification steps
- Problems can be learned efficiently when decomposed
Program synthesis
- Program synthesis is the challenge of generating programs from high-level specifications
- Library learning is a way to make progress on complex programs by inducing a library from simpler synthesis problems
- Parsel synthesizes complex programs by decomposing them into smaller functions, but the user specifies the decomposition
Lms for formal environment multi-step planning
- Huang et al. and Brohan et al. used language models to generate step-by-step algorithms for robotic agents
- Jiang et al. proposed a framework to generate formal proofs from informal proofs
- Cheng et al. explored a language model-based evaluation function
- Beurer-Kellner et al. explored a related SQL-style LM-querying language
- Dohan et al. presented an inference-focused framework for language models
Testing code language model outputs
- Assert statements can be used to constrain the generation space of LLMs for code on individual functions
- Merrill et al. (2021) proved essential constraints on what can be learned from assertions alone
Implications
- Parsel is a natural language compiler framework that bridges the gap between natural language and programming language
- Programming language models are limited to individual functions
- Parsel allows code language models to stay closer to natural language when generating code
- Parsel allows complex but standard methods to be described concisely
- Parsel can generate solutions recursively
For students
- Traditional programming languages are less robust to variations in wording than natural languages.
- Traditional programming languages require many tokens for syntactic details and can take many lines to express something that can be expressed more simply.
- LLMs have shown remarkable algorithmic generalization ability.
- Conversational code generation has been explored, but primarily focuses on imperative programming structures.
Limitations
- Parsel relies on a code language model to generate implementations of individual functions, which can vary in quality depending on the model used and the complexity of the function.
- Complex functions can be broken down into simpler ones to improve the quality of the generated code.
- Generated code may be less efficient or less readable than manually written code.
Conclusion and future work
- Parsel provides a language for robust code generation without the need to evaluate the underlying code
- Parsel allows the teaching of algorithmic reasoning with less emphasis on syntax and more emphasis on problem-solving
- Parsel allows for a general framework for hierarchical task decomposition
- Parsel could be used for automatic unit test generation
- Parsel could incorporate ways of varying the “confidence threshold” of the language model
- Parsel could incorporate value functions to allow decomposition to be done more methodically
- Parsel could reference already-defined functions
- Parsel could use a code language model to determine which function to next evaluate
- Parsel could incorporate multiple base tools with different kinds of specialized models
- Parsel could use a calculator to solve more complex math word problems
- Parsel could generate task-specific language model cascades
- Parsel could integrate Synchromesh to ensure generated words are possible within the grammar of the given formal language
- Parsel could be used as a framework for bootstrapping increasingly complex program generation
- Parsel could generate a new cost matrix with a new node corresponding to the sky
- Parsel could return an adjacency matrix corresponding to the minimum spanning tree
- Parsel could return a list of the nodes connected to the final node
- Parsel could return a list of cities that should have airports built in them to minimize the total cost
- Parsel could return the longest palindrome in a string
- Parsel could return a string that is either “kebab”, “snake”, “camel”, or “none” depending on the input string