Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Large language models can generate step-by-step reasoning to justify their final answers.
- It is difficult to objectively study the correctness of the reasoning steps without reliable methods.
- ROSCOE is a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
- ROSCOE can measure semantic consistency, logicality, informativeness, fluency, and factuality.
- ROSCOE is empirically verified on five human annotated and six programmatically perturbed diagnostics datasets.
Paper Content
Introduction
- Scaling language models has improved performance on NLP benchmarks
- Large language models (LLMs) perform well as few-shot learners
- LLMs struggle with tasks including math word problems, symbolic manipulation, and commonsense reasoning
- Recent work has shown that prompting or fine-tuning LLMs can lead to improvements on reasoning tasks
- ROSCOE is a suite of interpretable and fine-grained step-by-step generation evaluation metrics
Contributions.
- Proposed new taxonomy for reasoning errors
- Proposed new suite of metrics for step-by-step reasoning analysis
- Comparative analysis of 11 datasets of varied complex reasoning problems
Related work
- Free-form natural language explanations of model decisions should enable accurate representation of the reasoning process.
- Qualitative assessment of NL explanations with correctness labels collected from human judges was presented in (Camburu et al., 2018).
- Automatic metrics for NLG evaluation exist, but are not equipped to measure logical inconsistencies or information gain.
- Human evaluation is expensive, domain specific, and time-consuming.
Error type definition
- Grammar mistakes
- Factual inaccuracies
Hallucination
- Problem statement does not provide necessary information
- Explanation contains redundant information
Missing step
- Content of generated reasoning is incomplete and lacks required information to produce correct answer.
- Alignment scores measure grounding of token and step-wise reasoning with respect to source text.
- BARTScore measures probability of generated text from source to target.
- CTC unifies different perspectives of tasks into information alignment.
Coherency
- Steps can contradict each other or not make sense
- Commonsense Model lacks knowledge of general world
- Two metrics introduced to evaluate similarity between texts
- Metrics focus on determining type of error in reasoning path
Semantic alignment metrics (roscoe-sa)
- ROSCOE semantic alignment 2 metrics measure the grounding of step-wise reasoning with respect to the source text
- 12 error perturbation rules are applied to construct diagnostics datasets
- Human judged datasets are selected from complex reasoning tasks
- GPT-3 LLM is prompted with few-shot in-context examples to obtain step-by-step reasoning sequences
- Error types in a taxonomy are used as human evaluation perspectives of reasoning errors
- SimCSE is finetuned on multi-step reasoning datasets
- Text generation evaluation metrics are used as baseline metrics
- Somers’ D is used to meta-evaluate each scorer against synthetic and human scores
Experimental results
- ROSCOE outperforms all other reference-free methods on 6 diagnostic datasets
- ROSCOE-SS gains are more pronounced in 4 out of 6 datasets
- ROSCOE is specifically focused on reference-free settings
- Finetuning SimCSE gives highest improvements on ASDIV dataset
- Repetition* scores can separate perturbed and non-perturbed chains
- Finetuning helps to improve ROSCOE scores
Analysis
- ROSCOE metrics are evaluated for sensitivity to level of errors
- Errors are injected into MATH and EntailmentBank datasets
- ROSCOE-SA and ROSCOE-SS show consistent behavior
- Baseline metrics perform better on EntailmentBank
- ROSCOE-LC and baseline metrics have minimal impact on MATH
- Scores are not dataset-agnostic and require calibration for drifts
Conclusion
- Introduced ROSCOE, a suite of interpretable, unsupervised metrics for evaluating step-by-step reasoning generations of LMs