Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Large language models can generate step-by-step reasoning to justify their final answers.
  • It is difficult to objectively study the correctness of the reasoning steps without reliable methods.
  • ROSCOE is a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
  • ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality.
  • ROSCOE is empirically verified on five human annotated and six programmatically perturbed diagnostics datasets.

Paper Content

Introduction

  • Scaling language models has improved performance on NLP benchmarks
  • Large language models (LLMs) perform well as few-shot learners
  • LLMs struggle with tasks including math word problems, symbolic manipulation, and commonsense reasoning
  • Recent work has shown that prompting or fine-tuning LLMs can lead to improvements on reasoning tasks
  • ROSCOE is a suite of interpretable and fine-grained step-by-step generation evaluation metrics

Contributions.

  • Proposed new taxonomy for reasoning errors
  • Proposed new suite of metrics for step-by-step reasoning analysis
  • Comparative analysis of 11 datasets of varied complex reasoning problems
  • Free-form natural language explanations of model decisions should enable accurate representation of the reasoning process.
  • Qualitative assessment of NL explanations with correctness labels collected from human judges was presented in (Camburu et al., 2018).
  • Automatic metrics for NLG evaluation exist, but are not equipped to measure logical inconsistencies or information gain.
  • Human evaluation is expensive, domain specific, and time-consuming.

Error type definition

  • Grammar mistakes
  • Factual inaccuracies

Hallucination

  • Problem statement does not provide necessary information
  • Explanation contains redundant information

Missing step

  • Content of generated reasoning is incomplete and lacks required information to produce correct answer.
  • Alignment scores measure grounding of token and step-wise reasoning with respect to source text.
  • BARTScore measures probability of generated text from source to target.
  • CTC unifies different perspectives of tasks into information alignment.

Coherency

  • Steps can contradict each other or not make sense
  • Commonsense Model lacks knowledge of general world
  • Two metrics introduced to evaluate similarity between texts
  • Metrics focus on determining type of error in reasoning path

Semantic alignment metrics (roscoe-sa)

  • ROSCOE semantic alignment 2 metrics measure the grounding of step-wise reasoning with respect to the source text
  • 12 error perturbation rules are applied to construct diagnostics datasets
  • Human judged datasets are selected from complex reasoning tasks
  • GPT-3 LLM is prompted with few-shot in-context examples to obtain step-by-step reasoning sequences
  • Error types in a taxonomy are used as human evaluation perspectives of reasoning errors
  • SimCSE is finetuned on multi-step reasoning datasets
  • Text generation evaluation metrics are used as baseline metrics
  • Somers’ D is used to meta-evaluate each scorer against synthetic and human scores

Experimental results

  • ROSCOE outperforms all other reference-free methods on 6 diagnostic datasets
  • ROSCOE-SS gains are more pronounced in 4 out of 6 datasets
  • ROSCOE is specifically focused on reference-free settings
  • Finetuning SimCSE gives highest improvements on ASDIV dataset
  • Repetition* scores can separate perturbed and non-perturbed chains
  • Finetuning helps to improve ROSCOE scores

Analysis

  • ROSCOE metrics are evaluated for sensitivity to level of errors
  • Errors are injected into MATH and EntailmentBank datasets
  • ROSCOE-SA and ROSCOE-SS show consistent behavior
  • Baseline metrics perform better on EntailmentBank
  • ROSCOE-LC and baseline metrics have minimal impact on MATH
  • Scores are not dataset-agnostic and require calibration for drifts

Conclusion

  • Introduced ROSCOE, a suite of interpretable, unsupervised metrics for evaluating step-by-step reasoning generations of LMs