Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Logical reasoning of text is an important ability that requires understanding of the information present in the text and their interconnections.
Prior works on improving logical reasoning ability of language models require complex processing of training data.
We propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities.
We select a subset of Wikipedia for continued pretraining of a language model.
We use two self-supervised loss functions to teach the model to distinguish between entailment and contradiction types of sentences.
We demonstrate the effectiveness of APOLLO by comparing it with prior baselines on two logical reasoning datasets.

Paper Content

Introduction

Logical reasoning is an important ability for humans and for text understanding
Logical reasoning is important for downstream tasks such as open-domain question answering and machine reading comprehension
Logical keywords are essential for understanding the context
Recently, there has been an increasing focus on evaluating the logical reasoning abilities of language models
Large pre-trained language models (PLMs) are increasingly being used across a wide variety of real-world tasks
There have been some recent works on improving the logical reasoning abilities of PLMs
These works typically generate a dataset containing symbolic structures and train the LM using custom loss objectives
These complex processing steps usually require task-specific design choices
APOLLO is a continued pretraining-based approach to inject logical reasoning abilities in language models
APOLLO uses a set of logical inference keywords to select sentences from a text corpus
APOLLO masks words based on their parts-of-speech tags
APOLLO adds a sentence-level classification loss to predict entailment or contradiction
APOLLO is task format and downstream dataset agnostic
APOLLO achieves state-of-the-art performance on LogiQA and comparable performance on ReClor

Method

Collect dataset of reasoning-related sentences using keyword-based selection strategy
Filter Wikipedia using logical keywords to create IMPLICATION dataset
Pretrain model using two loss objectives: S-MLM and E-CLS
Fine-tune model on task-specific training dataset

Dataset selection

PLMs are trained on internet data and finetuned on downstream datasets
Aim is to teach PLMs generalizable logical reasoning abilities
Training data should contain more logical sentences
IMPLICATION dataset created by selecting specific keywords with positive/negative implications
Keywords used to filter Wikipedia and increase probability of logically rich sentences

Loss function design

S-MLM is a modified version of MLM loss used in BERT
MLM masks tokens in a sentence and the model learns to predict them
Not all masked tokens require the same degree of reasoning to predict
APOLLO simplifies the problem by considering tokens with specific POS tags
E-CLS predicts whether a sentence contains reasoning aspects of entailment or contradiction
E-CLS labels are bootstrapped using a heuristic of checking the type of implication keyword present in the sentence

Continued pretraining

APOLLO uses a multi-task loss for continued pretraining
The multi-task loss consists of S-MLM and E-CLS losses, weighted equally
No need to add MLM loss to avoid catastrophic forgetting, as S-MLM is close to MLM objective

Finetuning

Loss functions are not specific to a task format
Follow Devlin et al. (2019) and add a MLP layer to the pretrained model
Finetune the combined model on the downstream dataset using cross-entropy loss

Experimental setup

Described details of datasets used to evaluate APOLLO
Described baselines compared to APOLLO
Described implementation details of training procedure

Datasets

Evaluated APOLLO on two logical reasoning datasets: ReClor and LogiQA
ReClor has 4,638/500/1,000 instances in train/dev/test split
LogiQA has 7,376/651/651 instances in train/dev/test split

Baselines

APOLLO is compared to three prominent baselines
The baselines are LRReasoner, FOCAL REASONER, and MERIt
All the models use additional data to improve logical reasoning abilities

Implementation details

Used Wikipedia version provided by HuggingFace Datasets as main corpus
Used list of keywords to filter sentences from Wikipedia (listed in Appendix A)
Experimented with RoBERTa-Large, DeBERTa-v3, and DeBERTa-v2-xxlarge as base models for APOLLO
Pretrained last two layers of Transformer for 3 epochs, batch size of 4096 (details in Appendix B)

Results

Overall results

APOLLO outperforms all baselines on LogiQA
MERIt + APOLLO achieves best performance on LogiQA and ReClor Test-H
APOLLO applicable across different LM architectures

Performance on glue benchmark

APOLLO pretraining does not include standard MLM loss
Finetuning APOLLO on GLUE benchmark and evaluating on Dev set
Omitting evaluation on WNLI set
APOLLO can slightly improve overall performance on GLUE benchmark
Ablation of various design choices in constructing IMPLICATION dataset and proposed method
IMPLICATION dataset leads to consistent improvements
Both S-MLM and E-CLS loss lead to improvements over MLM loss
IMPLICATION-Positive leads to better performance than IMPLICATION-Negative
Masking nouns and pronouns leads to significant performance drop
Using remaining categories for selective masking leads to some drop in performance
Training topmost two layers of model leads to best performance
Overall change in attribution scores for implication keywords increases significantly
MERIt uses Wikipedia to generate logical graphs
LRReasoner and FOCAL REASONER use data augmentation specific to task
Clark et al. (2020) use synthetically generated rulebases
APOLLO is a generalized solution to improving logical reasoning in language models
Pretraining steps are independent of dataset and downstream task format
Trade-off between completeness and noise in training data

Link to paper#

Abstract#

Paper Content#

Introduction#

Method#

Dataset selection#

Loss function design#

Continued pretraining#

Finetuning#

Experimental setup#

Datasets#

Baselines#

Implementation details#

Results#

Overall results#

Performance on glue benchmark#

Link to paper

Abstract

Paper Content

Introduction

Method

Dataset selection

Loss function design

Continued pretraining

Finetuning

Experimental setup

Datasets

Baselines

Implementation details

Results

Overall results

Performance on glue benchmark