Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Legal Judgment Prediction (LJP) from text on European Court of Human Rights cases is cast as an entailment task.
- The case outcome is classified from a combined input of case facts and convention articles.
- Model is evaluated on its ability to generalize to zero-shot settings.
- Domain adaptation methods are applied to improve zero-shot transfer performance.
Paper Content
Introduction
- Legal Judgment Prediction (LJP) is a task to classify/predict the outcome of a case based on a textual description of case facts
- Legal practitioners determine relevant rules from legal sources to deduce the outcome of the case
- Most current LJP approaches tackle this as a classification problem with the textual descriptions of case facts as the sole input
- This work casts LJP into an entailment task to enable the model to learn more authentic reasoning between rules and case facts
- The task of LJP as entailment has been explored on Chinese criminal case corpora and US tax law
- This work develops and evaluates the model on a public dataset of cases by the European Court of Human Rights
- The model pairs case fact descriptions with candidate ECHR articles and assigns a binary target label
- Results show that the entailment model outperforms the traditional classification setup
- The work extends LJP as entailment to the zero-shot transfer setting
- Domain adaptation improves the model’s performance on unseen articles
- Domain specific pre-trained encoders have an impact on the zero shot transferability of LJP systems
Related work
- Legal Judgement Prediction (LJP) has been studied using corpora from different jurisdictions
- Early works used bag-of-words features
- Large pre-trained transformer models have become the dominant model family
- Legal-domain specific pre-trained variants have been employed
- Going beyond case fact classification, prior work on Chinese criminal case corpora treat LJP as an entailment problem
- This is the first work to adapt the similar approach of entailment to ECHR corpus
- Domain Adaptation (DA) is tackled under three different settings
- This work is the first to benchmark domain adaptation for LJP
- Methods proposed to deal with domain adaptation settings can be categorized into four types
- Loss based methods are employed to deal with domain adaptation settings of LJP
Dataset, tasks & settings
- ECHR dataset provided by LexGLUE consists of 11k case fact descriptions and target label information
- Chronologically split into training (2001-2016), validation (2016-2017) and test set (2017-2019)
- Label set includes 10 prominent ECHR articles
- Model predicts target from fact description alone
- Dataset augmented with texts of 10 articles in label set
- Formulate entailment variant for both tasks
- Binary outcome of whether article has been alleged/found to be violated
- Domain adaptation to determine outcomes based on case facts with regard to particular convention article
- Zero-shot transfer to determine violation/allegation of case facts with respect to unseen articles
- Two settings: UDA and ADA
- Dataset split into two non-overlapping groups of articles of various frequencies
- Evaluate UDA and ADA on split_0 as source and split_1 as target, and vice-versa
Method
- Employs hierarchical neural entailment model to take case fact description and article as input and output binary outcome
- Adapted to deal with long input sequences using hierarchical attention networks
- Experiments with two domain adaptation components based on adversarial training
Entailment model
- Model outputs a binary label
- Model contains an encoding layer, interaction layer, post-interaction encoding layer, and classification header
- LegalBERT used to encode case facts
- Token attention used to aggregate sentence level representations
- Dot product attention used to interact case facts and articles
- Article-dependent final representation of case facts obtained using two step procedure
- Sentence attention used to obtain article representation
- Article representation used to condition GRU layer for case facts
- Non-linear projection used to classify entailment outcome
Domain adaptation components
- Domain Adaptation seeks to make models generalize from one domain to another.
- The domains are mapped to a common latent space to reduce differences between their distributions.
- The model is trained to read two texts and interrelate them towards an outcome determination.
- A two layer feed forward network is used as a discriminator to predict the domain.
- A min-max game adversary objective optimization is used to maximize the model’s ability to capture information for the entailment outcome task.
Experiments & discussion
Models
- Employed entailment architecture with fact based encoding and 10 classes in output layer
- Binary cross entropy loss used to train model
- Weights in LegalBERT sentence encoder frozen to save resources and reduce susceptibility to shallow surface signals
Does entailment perform better than fact classification?
- Micro-F1 and macro-F1 scores for both tasks A and B are given in Table 1.
- Entailment performance is better than classification.
- Macro-F1 score shows greater improvement, indicating entailment approach helps with sparser articles.
- Task A saw greater improvement than Task B, as Task B can be understood as topic classification.
Does domain adaptation help to improve zero shot transferability ?
- Baseline model performs worse than domain adaptation counterparts on target data
- UDA Wasserstein distance performs better on target data than Domain Discriminator
- UDA Wasserstein distance performs worse on source data than baseline
- ADA Domain Discriminator and Wasserstein distance are comparable on target data
- ADA Wasserstein distance performs better on source data in Task A
- Zero-shot transfer entailment task is difficult and discrepancy between source and target data is still large
How does encoder pre-training influence zero-shot transferability ?
- Replacing LegalBERT embeddings with BERT base embeddings in an experiment on Task A resulted in worse performance on the target data.
- Domain specific pre-training is beneficial for generalizing to unseen target articles.
- LegalBERT may have injected domain-specific information about the target articles into the encoding.
How does article relatedness affect zero-shot transferability ?
- Experiment tested whether article relatedness affects performance
- Experiment used Article P1-1 as target domain
- Constructed one related and one unrelated source domain
- Related domain consists of Articles 6 and 8
- Unrelated domain consists of Articles 2, 3 and 5
- Related source domain performs better
- UDA achieves higher performance overall
- Wasserstein method outperforms Domain Discriminator for related source, vice versa for unrelated source
Conclusion
- LJP cast into an entailment task with non-finetuned encoders has benefit over a simple case fact classification model
- Created a zero-shot benchmark on the ECtHR corpus
- Task difficulty, absolute performance, and zero shot transferability depend on how case facts are drafted
- Major hurdle dealing with legal domain corpora is their lengthy nature
- Hierarchical models limited in that tokens across long distances cannot directly attend to one another
- Weights in LegalBERT sentence encoder frozen to save computational resources and reduce model’s susceptibility to shallow surface signals
- Experiment with publicly available datasets of ECtHR decisions
- Task of legal judgment prediction raises ethical, civil rights, and legal policy concerns
- Aim to make incremental technical progress to enable systems to acquire legal reasoning capability
- Models developed and trained on Google Colab
- Models incorporate pre-trained language models and do not train them from scratch
- Employ maximum sentence length of 256 and document length of 50