Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large language models have enabled progress in multi-step reasoning over text.
When applied to text generation from semi-structured data, these methods suffer from low semantic coverage, hallucination, and logical inconsistency.
MURMUR is a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning.
MURMUR uses neural and symbolic modules, a grammar, and value functions to generate reasoning paths.
Experiments on two data-to-text generation tasks show MURMUR obtains significant improvements over baselines and comparable performance to GPT-2.
Human evaluation shows MURMUR generates highly faithful and correct reasoning paths.

Paper Content

Introduction

Data-to-text generation is the task of generating summaries of semi-structured data
Data can be represented in diverse structures, like meaning representations, graphs, or tables
Text generation from such data is challenging because it requires various reasoning and compositionality skills
Recent works fine-tune pre-trained language models as the de-facto standard for building supervised data-to-text generation systems
Few-shot prompting has been successful in multi-step reasoning over text
Data-to-text generation is posed as multi-step reasoning over data
Challenges include generation quality and transformation-invariance
MURMUR is a modular multi-step reasoning approach to text generation from data
MURMUR has three features: modularity, grammar, and value functions
MURMUR can perform multi-step generative reasoning on simple to complex semi-structured data-to-text generation tasks
MURMUR obtains significant improvements in semantic coverage and hallucinations of generated summaries over other few-shot baselines
MURMUR demonstrates good out-of-domain generalizability
MURMUR significantly improves the logical consistency of summaries over direct prompting

Definitions: reasoning step and path

Reasoning Step is a triple (M, X , y) where a module M performs a certain skill by conditioning on an input X to generate an output y
Reasoning Path is a sequence of Reasoning Steps
MURMUR consists of four components: modules, grammar, value function(s), and a search algorithm
MURMUR generates textual summaries from semi-structured data by constructing reasoning paths
Two data-to-text generation tasks: WebNLG and LogicNLG
Neural modules for linguistic skills and symbolic modules for logical operations
Surface Realization module converts structured data to unstructured text
Text Fusion module combines two pieces of text into a coherent text
Symbolic modules perform logical operations over tables

Grammar over modules

Grammar is used to determine possible modules for reasoning steps
Production rules of grammar define multiple permissible modules

Value functions

MURMUR introduces value functions to assess the quality of each plausible reasoning step
Value functions measure fluency and semantic consistency of generated text
Value functions also assess correctness of intermediate reasoning paths for table-to-text generation
MURMUR uses a best-first search algorithm to generate reasoning paths
Algorithm takes modules, grammar, value function, and number of reasoning paths as input
Algorithm scores, ranks, and selects top-b paths, returns top-p paths and corresponding summaries

Graph-to-text generation

WebNLG uses RDF triples from DBPedia
Test split consists of two parts, one with seen categories and one with unseen categories
Implemented modules as few-shot neural models
Value function uses fluency score and entailment probability
Mixing ratio of 0.05 between two scorers
MURMUR scores and ranks intermediate generations in the queue

Table-to-text generation

Implemented logical modules with PYTHON functions
Used BERT-base model to classify tables and partial reasoning paths as correct or incorrect
Obtained training data from Logic2Text dataset
Created 1500 correct and incorrect training samples from 221 (table, reasoning path) pairs
Beam size of search set to 20

Experiments on graph-to-text generation

MURMUR is compared to several state-of-the-art supervised methods
Surface realization step is left unchanged
Saliency metric is removed from MURMUR
Few-shot methods use 1 randomly chosen demonstration from training data
BLEU scores are used to compare methods
Human evaluations are conducted for logical correctness of MURMUR
All reasoning paths generated by MURMUR are valid due to grammar component
Performance drops when search algorithm is replaced with fine-tuned BART model

Human evaluation of final generations and intermediate reasoning steps

Conducted two steps of human evaluations
Compared final summaries generated by DP and MURMUR
Evaluated faithfulness and correctness of individual reasoning steps of MURMUR
Evaluated generation at each reasoning step for grammaticality, module faithfulness, and correctness
Both modules generate outputs that are almost always grammatical
Module faithfulness is significantly high
64% of fusion generations are fully correct

Effect of number of demonstrations

DP performance improves with more demonstrations
MURMUR performance only marginally improves with more demonstrations
DP implicitly learns step-wise reasoning process
MURMUR captures reasoning process with one demonstration
MURMUR is robust to variations in demonstrations

Experiments on table-to-text generation

LogicNLG dataset was studied
Results of the study were found

Human evaluation of logical correctness

Conducted human evaluation to assess logical correctness of generations from Direct Prompting and MURMUR
40 randomly chosen generations from 8 different tables were annotated by two NLP experts
Annotators classified each generation into ungrammatical, incorrect, partially correct, or fully correct
For correct generations, annotators noted whether they involved any underlying logical operations or were surface realizations of the table content
Results showed MURMUR generated 26% more correct outputs and 95% of those involved some logical operations

Multi-step reasoning over text

Recent developments in large language models have enabled progress in few-shot methods for logical reasoning tasks
Chain-of-thought prompting encourages language models to output intermediate reasoning steps
Lack of explicit conditioning between steps can lead to unfaithful reasoning
MURMUR develops granular modules to explicitly condition on previous reasoning steps
Similar to Selection-Inference architecture
Concurrent works propose neuro-symbolic approaches for reasoning over text

Modular reasoning over text

Neural Module Networks (NMN) learn and execute programs over modules.
Prior works have used text-in text-out modules with input and output data types as strings.
MURMUR’s modules are a generalization of text-in text-out modules, able to capture operations with diverse signatures.
Transition from data to text is clearly represented through compositions of modules.
Interpretability of attention maps-based modules has been debated.

Data-to-text generation

Supervised methods use seq2seq pre-trained language models
Pipeline approaches use different modules
Few-shot methods use data augmentation or retrieving similar examples
MURMUR uses few-shot neural or symbolic modules without manual intervention
MURMUR works with as few as one demonstration, no unlabeled corpus needed

Discussion and conclusion

MURMUR is a neuro-symbolic modular reasoning approach for data-to-text generation
MURMUR outperforms few-shot baselines and achieves comparable performance to fine-tuned LMs
MURMUR generates significantly more logical summaries
MURMUR breaks a task down into sub-problems and solves them through separate modules
MURMUR utilizes the power of large language models in solving linguistic subtasks
MURMUR generalizes the concept of modules by treating them as functions
MURMUR introduces a grammar for explaining module compositions
MURMUR can be extended for text generation tasks involving multiple modalities
Training data is created by perturbing a gold path
Varying the amount of supervision affects metric accuracy and downstream performance
Direct Prompting and Chain-of-Thought Prompting are used for WebNLG and LogicNLG
MURMUR generates logically consistent summaries
MURMUR uses modules to perform logical operations over tables
Grammars are defined for WebNLG and LogicNLG
Direct Prompting summaries include logical inconsistencies and hallucinations
MURMUR generates reasoning paths and converts them to logically consistent summaries
Few-shot methods are compared on WebNLG and LogicNLG
Beam size and amount of supervision affect BLEU scores

Link to paper#

Abstract#

Paper Content#

Introduction#

Definitions: reasoning step and path#

Grammar over modules#

Value functions#

Graph-to-text generation#

Table-to-text generation#

Experiments on graph-to-text generation#

Human evaluation of final generations and intermediate reasoning steps#

Effect of number of demonstrations#

Experiments on table-to-text generation#

Human evaluation of logical correctness#

Multi-step reasoning over text#

Modular reasoning over text#

Data-to-text generation#

Discussion and conclusion#

Link to paper

Abstract

Paper Content

Introduction

Definitions: reasoning step and path

Grammar over modules

Value functions

Graph-to-text generation

Table-to-text generation

Experiments on graph-to-text generation

Human evaluation of final generations and intermediate reasoning steps

Effect of number of demonstrations

Experiments on table-to-text generation

Human evaluation of logical correctness

Multi-step reasoning over text

Modular reasoning over text

Data-to-text generation

Discussion and conclusion