Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large language models have enabled progress in multi-step reasoning over text.
- When applied to text generation from semi-structured data, these methods suffer from low semantic coverage, hallucination, and logical inconsistency.
- MURMUR is a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning.
- MURMUR uses neural and symbolic modules, a grammar, and value functions to generate reasoning paths.
- Experiments on two data-to-text generation tasks show MURMUR obtains significant improvements over baselines and comparable performance to GPT-2.
- Human evaluation shows MURMUR generates highly faithful and correct reasoning paths.
Paper Content
Introduction
- Data-to-text generation is the task of generating summaries of semi-structured data
- Data can be represented in diverse structures, like meaning representations, graphs, or tables
- Text generation from such data is challenging because it requires various reasoning and compositionality skills
- Recent works fine-tune pre-trained language models as the de-facto standard for building supervised data-to-text generation systems
- Few-shot prompting has been successful in multi-step reasoning over text
- Data-to-text generation is posed as multi-step reasoning over data
- Challenges include generation quality and transformation-invariance
- MURMUR is a modular multi-step reasoning approach to text generation from data
- MURMUR has three features: modularity, grammar, and value functions
- MURMUR can perform multi-step generative reasoning on simple to complex semi-structured data-to-text generation tasks
- MURMUR obtains significant improvements in semantic coverage and hallucinations of generated summaries over other few-shot baselines
- MURMUR demonstrates good out-of-domain generalizability
- MURMUR significantly improves the logical consistency of summaries over direct prompting
Definitions: reasoning step and path
- Reasoning Step is a triple (M, X , y) where a module M performs a certain skill by conditioning on an input X to generate an output y
- Reasoning Path is a sequence of Reasoning Steps
- MURMUR consists of four components: modules, grammar, value function(s), and a search algorithm
- MURMUR generates textual summaries from semi-structured data by constructing reasoning paths
- Two data-to-text generation tasks: WebNLG and LogicNLG
- Neural modules for linguistic skills and symbolic modules for logical operations
- Surface Realization module converts structured data to unstructured text
- Text Fusion module combines two pieces of text into a coherent text
- Symbolic modules perform logical operations over tables
Grammar over modules
- Grammar is used to determine possible modules for reasoning steps
- Production rules of grammar define multiple permissible modules
Value functions
- MURMUR introduces value functions to assess the quality of each plausible reasoning step
- Value functions measure fluency and semantic consistency of generated text
- Value functions also assess correctness of intermediate reasoning paths for table-to-text generation
- MURMUR uses a best-first search algorithm to generate reasoning paths
- Algorithm takes modules, grammar, value function, and number of reasoning paths as input
- Algorithm scores, ranks, and selects top-b paths, returns top-p paths and corresponding summaries
Graph-to-text generation
- WebNLG uses RDF triples from DBPedia
- Test split consists of two parts, one with seen categories and one with unseen categories
- Implemented modules as few-shot neural models
- Value function uses fluency score and entailment probability
- Mixing ratio of 0.05 between two scorers
- MURMUR scores and ranks intermediate generations in the queue
Table-to-text generation
- Implemented logical modules with PYTHON functions
- Used BERT-base model to classify tables and partial reasoning paths as correct or incorrect
- Obtained training data from Logic2Text dataset
- Created 1500 correct and incorrect training samples from 221 (table, reasoning path) pairs
- Beam size of search set to 20
Experiments on graph-to-text generation
- MURMUR is compared to several state-of-the-art supervised methods
- Surface realization step is left unchanged
- Saliency metric is removed from MURMUR
- Few-shot methods use 1 randomly chosen demonstration from training data
- BLEU scores are used to compare methods
- Human evaluations are conducted for logical correctness of MURMUR
- All reasoning paths generated by MURMUR are valid due to grammar component
- Performance drops when search algorithm is replaced with fine-tuned BART model
Human evaluation of final generations and intermediate reasoning steps
- Conducted two steps of human evaluations
- Compared final summaries generated by DP and MURMUR
- Evaluated faithfulness and correctness of individual reasoning steps of MURMUR
- Evaluated generation at each reasoning step for grammaticality, module faithfulness, and correctness
- Both modules generate outputs that are almost always grammatical
- Module faithfulness is significantly high
- 64% of fusion generations are fully correct
Effect of number of demonstrations
- DP performance improves with more demonstrations
- MURMUR performance only marginally improves with more demonstrations
- DP implicitly learns step-wise reasoning process
- MURMUR captures reasoning process with one demonstration
- MURMUR is robust to variations in demonstrations
Experiments on table-to-text generation
- LogicNLG dataset was studied
- Results of the study were found
Human evaluation of logical correctness
- Conducted human evaluation to assess logical correctness of generations from Direct Prompting and MURMUR
- 40 randomly chosen generations from 8 different tables were annotated by two NLP experts
- Annotators classified each generation into ungrammatical, incorrect, partially correct, or fully correct
- For correct generations, annotators noted whether they involved any underlying logical operations or were surface realizations of the table content
- Results showed MURMUR generated 26% more correct outputs and 95% of those involved some logical operations
Multi-step reasoning over text
- Recent developments in large language models have enabled progress in few-shot methods for logical reasoning tasks
- Chain-of-thought prompting encourages language models to output intermediate reasoning steps
- Lack of explicit conditioning between steps can lead to unfaithful reasoning
- MURMUR develops granular modules to explicitly condition on previous reasoning steps
- Similar to Selection-Inference architecture
- Concurrent works propose neuro-symbolic approaches for reasoning over text
Modular reasoning over text
- Neural Module Networks (NMN) learn and execute programs over modules.
- Prior works have used text-in text-out modules with input and output data types as strings.
- MURMUR’s modules are a generalization of text-in text-out modules, able to capture operations with diverse signatures.
- Transition from data to text is clearly represented through compositions of modules.
- Interpretability of attention maps-based modules has been debated.
Data-to-text generation
- Supervised methods use seq2seq pre-trained language models
- Pipeline approaches use different modules
- Few-shot methods use data augmentation or retrieving similar examples
- MURMUR uses few-shot neural or symbolic modules without manual intervention
- MURMUR works with as few as one demonstration, no unlabeled corpus needed
Discussion and conclusion
- MURMUR is a neuro-symbolic modular reasoning approach for data-to-text generation
- MURMUR outperforms few-shot baselines and achieves comparable performance to fine-tuned LMs
- MURMUR generates significantly more logical summaries
- MURMUR breaks a task down into sub-problems and solves them through separate modules
- MURMUR utilizes the power of large language models in solving linguistic subtasks
- MURMUR generalizes the concept of modules by treating them as functions
- MURMUR introduces a grammar for explaining module compositions
- MURMUR can be extended for text generation tasks involving multiple modalities
- Training data is created by perturbing a gold path
- Varying the amount of supervision affects metric accuracy and downstream performance
- Direct Prompting and Chain-of-Thought Prompting are used for WebNLG and LogicNLG
- MURMUR generates logically consistent summaries
- MURMUR uses modules to perform logical operations over tables
- Grammars are defined for WebNLG and LogicNLG
- Direct Prompting summaries include logical inconsistencies and hallucinations
- MURMUR generates reasoning paths and converts them to logically consistent summaries
- Few-shot methods are compared on WebNLG and LogicNLG
- Beam size and amount of supervision affect BLEU scores