Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Development of event extraction systems hindered by lack of large-scale datasets
  • GLEN dataset created to make event extraction systems more accessible
  • GLEN covers 3,465 different event types, 20x larger than current datasets
  • GLEN created using DWD Overlay and PropBank annotation
  • New multi-stage event detection model proposed
  • Model exhibits 10% F1 gain compared to classification baselines and definition-based models
  • Label noise still largest challenge for improving performance

Paper Content

Introduction

  • ACE 2005 is the current standard benchmark for event extraction, but it has limited ontology and domain.
  • MAVEN is the largest effort towards event extraction annotation, but it also has limited domain diversity.
  • Current benchmarks focus on restricted ontologies in limited domains, which is harmful to developers and users.
  • Event extraction packages are scarce compared to NLP packages.
  • Annotation quality is a long-standing issue for event extraction.

Event type ranking

  • Perform event type ranking over entire event ontology for each sentence
  • Design decisions to improve efficiency: ranking done for whole sentence, sentence and event type definitions encoded separately
  • Use same model architecture as Col-BERT
  • Encode sentence and event type definition separately
  • Compute similarity score between sentence and event type by sum of maximum similarity between sentence and event embeddings
  • Train model using distant supervision data and margin loss

Event type classification

  • Task is to classify event triggers into event types
  • Formulated as Yes/No QA task
  • Model takes probability of predicting “yes” or “no”
  • Binary cross-entropy loss used to train model
  • Label noise due to many-to-one mapping
  • Incremental self-labeling procedure used to handle partial labels
  • GLEN benchmark created using ontology and distantly-supervised training data

Event ontology construction

  • DWD Overlay is an effort to align Wiki-Data Qnodes to PropBank rolesets, argument structures, and LDC tagsets.
  • Each event type entry includes properties from WikiData, PropBank, and LDC.
  • Ontology is a superset of ACE and ERE.
  • Cognitive events without physical state change are removed.
  • Manual clean up of event types associated with rolesets.

Distant supervision data

  • Reuse existing PropBank annotations
  • Partial label challenge when using distantly-supervised data
  • Perform sentence de-duplication, remove sentences with less than 3 tokens, omit special tokens
  • Ensure every trigger is a continuous token span, remove events with overlapping triggers
  • Remove triggers with a part-of-speech tag of MD or TO
  • Manually inspect most popular rolesets, remove general/ambiguous ones
  • Remove rolesets with less than 3 event mentions across all datasets

Data split and annotation

  • Dataset split into train, dev, and test sets based on documents
  • Stratified sampling used for datasets with multiple genres
  • Annotation task formulated as multiple-choice question
  • Mechanical Turk workers screened for development set
  • Manual inspection for PropBank rolesets labeled as “None of the above”
  • Third pass of annotation for affected instances and instances with disagreement

Data analysis

  • Examined event ontology by visualizing it as a hierarchy
  • Ontology offers wider range of diverse events
  • Event type distribution closely mirrors real-world event distributions
  • Compared type distribution to ACE and MAVEN
  • Part-of-speech distribution of trigger words similar to ACE and MAVEN

Method

  • Goal is to find trigger word offset and event type for every event
  • Challenge is large ontology size and partial labels from distant supervision
  • Separate trigger identification from event typing to learn with clean data
  • Break event typing into two stages to narrow down search space
  • Self-labeling procedure to mitigate effect of partial labels

Trigger identification

  • Goal is to identify location of event trigger words in sentence
  • Obtain sentence token representations using pre-trained language model
  • Compute scores for each token being start, end, or part of event trigger
  • Compute probability of span being an event trigger
  • Model is trained using binary cross entropy loss on candidate spans

Experiments

Experiment setting

  • Evaluation Metrics used: Trigger Identification F1, Trigger Classification F1, Hit@K
  • Baselines used: Token Classification, InstructGPT (Ouyang et al., 2022)
  • Implementation Details: Hyperparameters, maximum token span length of 10, designed loss function with τ = 1.0, single Tesla P100 GPU with 16GB DRAM

Results

  • DMBERT has lower trigger identification score if more than 1 token is predicted
  • TokCls and SpanCls have similar performance on final trigger classification
  • ZED utilizes event type definitions but only achieves minor F1 boost
  • Ours, TokCls, and ZED have different predictions for examples
  • InstructGPT with in-context learning does not perform well
  • Decoupling trigger identification from classification improves TI performance
  • Joint encoding of definition and context improves TC performance
  • Self-labeling can help improve top-1 classification performance

Analysis

  • Investigating which component is the main bottleneck for performance
  • Investigating if model suffers from label imbalance between types
  • Investigating how self-labeling procedure contributes to performance

Per-stage performance

  • Model has 3 components: trigger identification, event type ranking, and event type classification
  • Evaluation of model used precision, recall, and F1 scores
  • Maximum input length was 2049 tokens
  • Hit@k metrics used to assess performance of event type ranking and classification
  • Primary bottleneck in precision of trigger identification and Hit@1 score of type classification

Type imbalance

  • Figure 4 shows the long-tailed label distribution in the dataset.
  • Performance per group is calculated and shown in Figure 8.
  • Popular groups have highest F1 score, but remaining groups have comparable scores.
  • Two factors define dataset difficulty: ambiguity of event type and frequency of event type.
  • Table 6 shows that main performance gain comes from self-labeled data.

Remaining errors

  • Most errors come from noisy annotation
  • Extended Roleset refers to predicted event associated with same predicate as ground truth
  • 22.6% of cases predict event close to ground truth on XPO hierarchy
  • Uncatagorized errors due to imperfect recall of event ranking module or no connections in hierarchy
  • ACE05 set the event extraction paradigm which consists of event detection and argument extraction
  • MAVEN dataset has a larger ontology selected from a subset of FrameNet
  • Few-Event dataset is a compilation of ACE, KBP, and Wikipedia data
  • DCFEE, ChFinAnn, RAMS, WikiEvents, Do-cEE datasets focus on argument extraction
  • Weak supervision used to increase amount of labeled data for event extraction
  • WordNet provides a larger source of possible event triggers

Conclusions and future work

  • Introduced GLEN dataset with 3k event types for all domains
  • Multistage model designed to handle large ontology size and partial labels
  • Model can be used as off-the-shelf event detection tool
  • Hyperparameters for baseline models listed in Table 9
  • GLEN offers broader coverage of events than ACE05
  • Label distribution of GLEN, ACE05 and MAVEN shown in Figures 4, 5 and 6
  • Relationship between threshold and accuracy of label selection shown in Figure 7
  • Relationship between number of instances of different event types and model performance shown in Figure 8
  • Categorization of remaining errors from system shown in Figure 9
  • Annotation interface built with Amazon Mechanical Turk shown in Figure 10
  • Truncated version of prompt to InstructGPT shown in Figure 11
  • Statistics of GLEN dataset compared to ACE05 and MAVEN shown in Figure 3
  • Training hyperparameters of model shown in Table 9
  • Comparison across different systems shown in Figure 8
  • Correct predictions shown in green, predictions that map to same PropBank roleset shown in orange
  • Hit@1 scores before and after self-labeling on different categories of PropBank rolesets shown
  • Examples of erroneous type predictions and removed Qnodes in XPO Overlay shown