Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Legal Judgment Prediction (LJP) from text on European Court of Human Rights cases is cast as an entailment task.
  • The case outcome is classified from a combined input of case facts and convention articles.
  • Model is evaluated on its ability to generalize to zero-shot settings.
  • Domain adaptation methods are applied to improve zero-shot transfer performance.

Paper Content

Introduction

  • Legal Judgment Prediction (LJP) is a task to classify/predict the outcome of a case based on a textual description of case facts
  • Legal practitioners determine relevant rules from legal sources to deduce the outcome of the case
  • Most current LJP approaches tackle this as a classification problem with the textual descriptions of case facts as the sole input
  • This work casts LJP into an entailment task to enable the model to learn more authentic reasoning between rules and case facts
  • The task of LJP as entailment has been explored on Chinese criminal case corpora and US tax law
  • This work develops and evaluates the model on a public dataset of cases by the European Court of Human Rights
  • The model pairs case fact descriptions with candidate ECHR articles and assigns a binary target label
  • Results show that the entailment model outperforms the traditional classification setup
  • The work extends LJP as entailment to the zero-shot transfer setting
  • Domain adaptation improves the model’s performance on unseen articles
  • Domain specific pre-trained encoders have an impact on the zero shot transferability of LJP systems
  • Legal Judgement Prediction (LJP) has been studied using corpora from different jurisdictions
  • Early works used bag-of-words features
  • Large pre-trained transformer models have become the dominant model family
  • Legal-domain specific pre-trained variants have been employed
  • Going beyond case fact classification, prior work on Chinese criminal case corpora treat LJP as an entailment problem
  • This is the first work to adapt the similar approach of entailment to ECHR corpus
  • Domain Adaptation (DA) is tackled under three different settings
  • This work is the first to benchmark domain adaptation for LJP
  • Methods proposed to deal with domain adaptation settings can be categorized into four types
  • Loss based methods are employed to deal with domain adaptation settings of LJP

Dataset, tasks & settings

  • ECHR dataset provided by LexGLUE consists of 11k case fact descriptions and target label information
  • Chronologically split into training (2001-2016), validation (2016-2017) and test set (2017-2019)
  • Label set includes 10 prominent ECHR articles
  • Model predicts target from fact description alone
  • Dataset augmented with texts of 10 articles in label set
  • Formulate entailment variant for both tasks
  • Binary outcome of whether article has been alleged/found to be violated
  • Domain adaptation to determine outcomes based on case facts with regard to particular convention article
  • Zero-shot transfer to determine violation/allegation of case facts with respect to unseen articles
  • Two settings: UDA and ADA
  • Dataset split into two non-overlapping groups of articles of various frequencies
  • Evaluate UDA and ADA on split_0 as source and split_1 as target, and vice-versa

Method

  • Employs hierarchical neural entailment model to take case fact description and article as input and output binary outcome
  • Adapted to deal with long input sequences using hierarchical attention networks
  • Experiments with two domain adaptation components based on adversarial training

Entailment model

  • Model outputs a binary label
  • Model contains an encoding layer, interaction layer, post-interaction encoding layer, and classification header
  • LegalBERT used to encode case facts
  • Token attention used to aggregate sentence level representations
  • Dot product attention used to interact case facts and articles
  • Article-dependent final representation of case facts obtained using two step procedure
  • Sentence attention used to obtain article representation
  • Article representation used to condition GRU layer for case facts
  • Non-linear projection used to classify entailment outcome

Domain adaptation components

  • Domain Adaptation seeks to make models generalize from one domain to another.
  • The domains are mapped to a common latent space to reduce differences between their distributions.
  • The model is trained to read two texts and interrelate them towards an outcome determination.
  • A two layer feed forward network is used as a discriminator to predict the domain.
  • A min-max game adversary objective optimization is used to maximize the model’s ability to capture information for the entailment outcome task.

Experiments & discussion

Models

  • Employed entailment architecture with fact based encoding and 10 classes in output layer
  • Binary cross entropy loss used to train model
  • Weights in LegalBERT sentence encoder frozen to save resources and reduce susceptibility to shallow surface signals

Does entailment perform better than fact classification?

  • Micro-F1 and macro-F1 scores for both tasks A and B are given in Table 1.
  • Entailment performance is better than classification.
  • Macro-F1 score shows greater improvement, indicating entailment approach helps with sparser articles.
  • Task A saw greater improvement than Task B, as Task B can be understood as topic classification.

Does domain adaptation help to improve zero shot transferability ?

  • Baseline model performs worse than domain adaptation counterparts on target data
  • UDA Wasserstein distance performs better on target data than Domain Discriminator
  • UDA Wasserstein distance performs worse on source data than baseline
  • ADA Domain Discriminator and Wasserstein distance are comparable on target data
  • ADA Wasserstein distance performs better on source data in Task A
  • Zero-shot transfer entailment task is difficult and discrepancy between source and target data is still large

How does encoder pre-training influence zero-shot transferability ?

  • Replacing LegalBERT embeddings with BERT base embeddings in an experiment on Task A resulted in worse performance on the target data.
  • Domain specific pre-training is beneficial for generalizing to unseen target articles.
  • LegalBERT may have injected domain-specific information about the target articles into the encoding.

How does article relatedness affect zero-shot transferability ?

  • Experiment tested whether article relatedness affects performance
  • Experiment used Article P1-1 as target domain
  • Constructed one related and one unrelated source domain
  • Related domain consists of Articles 6 and 8
  • Unrelated domain consists of Articles 2, 3 and 5
  • Related source domain performs better
  • UDA achieves higher performance overall
  • Wasserstein method outperforms Domain Discriminator for related source, vice versa for unrelated source

Conclusion

  • LJP cast into an entailment task with non-finetuned encoders has benefit over a simple case fact classification model
  • Created a zero-shot benchmark on the ECtHR corpus
  • Task difficulty, absolute performance, and zero shot transferability depend on how case facts are drafted
  • Major hurdle dealing with legal domain corpora is their lengthy nature
  • Hierarchical models limited in that tokens across long distances cannot directly attend to one another
  • Weights in LegalBERT sentence encoder frozen to save computational resources and reduce model’s susceptibility to shallow surface signals
  • Experiment with publicly available datasets of ECtHR decisions
  • Task of legal judgment prediction raises ethical, civil rights, and legal policy concerns
  • Aim to make incremental technical progress to enable systems to acquire legal reasoning capability
  • Models developed and trained on Google Colab
  • Models incorporate pre-trained language models and do not train them from scratch
  • Employ maximum sentence length of 256 and document length of 50