Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Logical reasoning of text is an important ability that requires understanding of the information present in the text and their interconnections.
- Prior works on improving logical reasoning ability of language models require complex processing of training data.
- We propose APOLLO, an adaptively pretrained language model that has improved logical reasoning abilities.
- We select a subset of Wikipedia for continued pretraining of a language model.
- We use two self-supervised loss functions to teach the model to distinguish between entailment and contradiction types of sentences.
- We demonstrate the effectiveness of APOLLO by comparing it with prior baselines on two logical reasoning datasets.
Paper Content
Introduction
- Logical reasoning is an important ability for humans and for text understanding
- Logical reasoning is important for downstream tasks such as open-domain question answering and machine reading comprehension
- Logical keywords are essential for understanding the context
- Recently, there has been an increasing focus on evaluating the logical reasoning abilities of language models
- Large pre-trained language models (PLMs) are increasingly being used across a wide variety of real-world tasks
- There have been some recent works on improving the logical reasoning abilities of PLMs
- These works typically generate a dataset containing symbolic structures and train the LM using custom loss objectives
- These complex processing steps usually require task-specific design choices
- APOLLO is a continued pretraining-based approach to inject logical reasoning abilities in language models
- APOLLO uses a set of logical inference keywords to select sentences from a text corpus
- APOLLO masks words based on their parts-of-speech tags
- APOLLO adds a sentence-level classification loss to predict entailment or contradiction
- APOLLO is task format and downstream dataset agnostic
- APOLLO achieves state-of-the-art performance on LogiQA and comparable performance on ReClor
Method
- Collect dataset of reasoning-related sentences using keyword-based selection strategy
- Filter Wikipedia using logical keywords to create IMPLICATION dataset
- Pretrain model using two loss objectives: S-MLM and E-CLS
- Fine-tune model on task-specific training dataset
Dataset selection
- PLMs are trained on internet data and finetuned on downstream datasets
- Aim is to teach PLMs generalizable logical reasoning abilities
- Training data should contain more logical sentences
- IMPLICATION dataset created by selecting specific keywords with positive/negative implications
- Keywords used to filter Wikipedia and increase probability of logically rich sentences
Loss function design
- S-MLM is a modified version of MLM loss used in BERT
- MLM masks tokens in a sentence and the model learns to predict them
- Not all masked tokens require the same degree of reasoning to predict
- APOLLO simplifies the problem by considering tokens with specific POS tags
- E-CLS predicts whether a sentence contains reasoning aspects of entailment or contradiction
- E-CLS labels are bootstrapped using a heuristic of checking the type of implication keyword present in the sentence
Continued pretraining
- APOLLO uses a multi-task loss for continued pretraining
- The multi-task loss consists of S-MLM and E-CLS losses, weighted equally
- No need to add MLM loss to avoid catastrophic forgetting, as S-MLM is close to MLM objective
Finetuning
- Loss functions are not specific to a task format
- Follow Devlin et al. (2019) and add a MLP layer to the pretrained model
- Finetune the combined model on the downstream dataset using cross-entropy loss
Experimental setup
- Described details of datasets used to evaluate APOLLO
- Described baselines compared to APOLLO
- Described implementation details of training procedure
Datasets
- Evaluated APOLLO on two logical reasoning datasets: ReClor and LogiQA
- ReClor has 4,638/500/1,000 instances in train/dev/test split
- LogiQA has 7,376/651/651 instances in train/dev/test split
Baselines
- APOLLO is compared to three prominent baselines
- The baselines are LRReasoner, FOCAL REASONER, and MERIt
- All the models use additional data to improve logical reasoning abilities
Implementation details
- Used Wikipedia version provided by HuggingFace Datasets as main corpus
- Used list of keywords to filter sentences from Wikipedia (listed in Appendix A)
- Experimented with RoBERTa-Large, DeBERTa-v3, and DeBERTa-v2-xxlarge as base models for APOLLO
- Pretrained last two layers of Transformer for 3 epochs, batch size of 4096 (details in Appendix B)
Results
Overall results
- APOLLO outperforms all baselines on LogiQA
- MERIt + APOLLO achieves best performance on LogiQA and ReClor Test-H
- APOLLO applicable across different LM architectures
Performance on glue benchmark
- APOLLO pretraining does not include standard MLM loss
- Finetuning APOLLO on GLUE benchmark and evaluating on Dev set
- Omitting evaluation on WNLI set
- APOLLO can slightly improve overall performance on GLUE benchmark
- Ablation of various design choices in constructing IMPLICATION dataset and proposed method
- IMPLICATION dataset leads to consistent improvements
- Both S-MLM and E-CLS loss lead to improvements over MLM loss
- IMPLICATION-Positive leads to better performance than IMPLICATION-Negative
- Masking nouns and pronouns leads to significant performance drop
- Using remaining categories for selective masking leads to some drop in performance
- Training topmost two layers of model leads to best performance
- Overall change in attribution scores for implication keywords increases significantly
- MERIt uses Wikipedia to generate logical graphs
- LRReasoner and FOCAL REASONER use data augmentation specific to task
- Clark et al. (2020) use synthetically generated rulebases
- APOLLO is a generalized solution to improving logical reasoning in language models
- Pretraining steps are independent of dataset and downstream task format
- Trade-off between completeness and noise in training data