Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Large language models can generate step-by-step reasoning to justify their final answers.
It is difficult to objectively study the correctness of the reasoning steps without reliable methods.
ROSCOE is a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality.
ROSCOE is empirically verified on five human annotated and six programmatically perturbed diagnostics datasets.

Paper Content

Introduction

Scaling language models has improved performance on NLP benchmarks
Large language models (LLMs) perform well as few-shot learners
LLMs struggle with tasks including math word problems, symbolic manipulation, and commonsense reasoning
Recent work has shown that prompting or fine-tuning LLMs can lead to improvements on reasoning tasks
ROSCOE is a suite of interpretable and fine-grained step-by-step generation evaluation metrics

Contributions.

Proposed new taxonomy for reasoning errors
Proposed new suite of metrics for step-by-step reasoning analysis
Comparative analysis of 11 datasets of varied complex reasoning problems

Free-form natural language explanations of model decisions should enable accurate representation of the reasoning process.
Qualitative assessment of NL explanations with correctness labels collected from human judges was presented in (Camburu et al., 2018).
Automatic metrics for NLG evaluation exist, but are not equipped to measure logical inconsistencies or information gain.
Human evaluation is expensive, domain specific, and time-consuming.

Error type definition

Grammar mistakes
Factual inaccuracies

Hallucination

Problem statement does not provide necessary information
Explanation contains redundant information

Missing step

Content of generated reasoning is incomplete and lacks required information to produce correct answer.
Alignment scores measure grounding of token and step-wise reasoning with respect to source text.
BARTScore measures probability of generated text from source to target.
CTC unifies different perspectives of tasks into information alignment.

Coherency

Steps can contradict each other or not make sense
Commonsense Model lacks knowledge of general world
Two metrics introduced to evaluate similarity between texts
Metrics focus on determining type of error in reasoning path

Semantic alignment metrics (roscoe-sa)

ROSCOE semantic alignment 2 metrics measure the grounding of step-wise reasoning with respect to the source text
12 error perturbation rules are applied to construct diagnostics datasets
Human judged datasets are selected from complex reasoning tasks
GPT-3 LLM is prompted with few-shot in-context examples to obtain step-by-step reasoning sequences
Error types in a taxonomy are used as human evaluation perspectives of reasoning errors
SimCSE is finetuned on multi-step reasoning datasets
Text generation evaluation metrics are used as baseline metrics
Somers’ D is used to meta-evaluate each scorer against synthetic and human scores

Experimental results

ROSCOE outperforms all other reference-free methods on 6 diagnostic datasets
ROSCOE-SS gains are more pronounced in 4 out of 6 datasets
ROSCOE is specifically focused on reference-free settings
Finetuning SimCSE gives highest improvements on ASDIV dataset
Repetition* scores can separate perturbed and non-perturbed chains
Finetuning helps to improve ROSCOE scores

Analysis

ROSCOE metrics are evaluated for sensitivity to level of errors
Errors are injected into MATH and EntailmentBank datasets
ROSCOE-SA and ROSCOE-SS show consistent behavior
Baseline metrics perform better on EntailmentBank
ROSCOE-LC and baseline metrics have minimal impact on MATH
Scores are not dataset-agnostic and require calibration for drifts

Conclusion

Introduced ROSCOE, a suite of interpretable, unsupervised metrics for evaluating step-by-step reasoning generations of LMs

Link to paper#

Abstract#

Paper Content#

Introduction#

Contributions.#

Related work#

Error type definition#

Hallucination#

Missing step#

Coherency#

Semantic alignment metrics (roscoe-sa)#

Experimental results#

Analysis#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Contributions.

Related work

Error type definition

Hallucination

Missing step

Coherency

Semantic alignment metrics (roscoe-sa)

Experimental results

Analysis

Conclusion