Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Development of event extraction systems hindered by lack of large-scale datasets
GLEN dataset created to make event extraction systems more accessible
GLEN covers 3,465 different event types, 20x larger than current datasets
GLEN created using DWD Overlay and PropBank annotation
New multi-stage event detection model proposed
Model exhibits 10% F1 gain compared to classification baselines and definition-based models
Label noise still largest challenge for improving performance

Paper Content

Introduction

ACE 2005 is the current standard benchmark for event extraction, but it has limited ontology and domain.
MAVEN is the largest effort towards event extraction annotation, but it also has limited domain diversity.
Current benchmarks focus on restricted ontologies in limited domains, which is harmful to developers and users.
Event extraction packages are scarce compared to NLP packages.
Annotation quality is a long-standing issue for event extraction.

Event type ranking

Perform event type ranking over entire event ontology for each sentence
Design decisions to improve efficiency: ranking done for whole sentence, sentence and event type definitions encoded separately
Use same model architecture as Col-BERT
Encode sentence and event type definition separately
Compute similarity score between sentence and event type by sum of maximum similarity between sentence and event embeddings
Train model using distant supervision data and margin loss

Event type classification

Task is to classify event triggers into event types
Formulated as Yes/No QA task
Model takes probability of predicting “yes” or “no”
Binary cross-entropy loss used to train model
Label noise due to many-to-one mapping
Incremental self-labeling procedure used to handle partial labels
GLEN benchmark created using ontology and distantly-supervised training data

Event ontology construction

DWD Overlay is an effort to align Wiki-Data Qnodes to PropBank rolesets, argument structures, and LDC tagsets.
Each event type entry includes properties from WikiData, PropBank, and LDC.
Ontology is a superset of ACE and ERE.
Cognitive events without physical state change are removed.
Manual clean up of event types associated with rolesets.

Distant supervision data

Reuse existing PropBank annotations
Partial label challenge when using distantly-supervised data
Perform sentence de-duplication, remove sentences with less than 3 tokens, omit special tokens
Ensure every trigger is a continuous token span, remove events with overlapping triggers
Remove triggers with a part-of-speech tag of MD or TO
Manually inspect most popular rolesets, remove general/ambiguous ones
Remove rolesets with less than 3 event mentions across all datasets

Data split and annotation

Dataset split into train, dev, and test sets based on documents
Stratified sampling used for datasets with multiple genres
Annotation task formulated as multiple-choice question
Mechanical Turk workers screened for development set
Manual inspection for PropBank rolesets labeled as “None of the above”
Third pass of annotation for affected instances and instances with disagreement

Data analysis

Examined event ontology by visualizing it as a hierarchy
Ontology offers wider range of diverse events
Event type distribution closely mirrors real-world event distributions
Compared type distribution to ACE and MAVEN
Part-of-speech distribution of trigger words similar to ACE and MAVEN

Method

Goal is to find trigger word offset and event type for every event
Challenge is large ontology size and partial labels from distant supervision
Separate trigger identification from event typing to learn with clean data
Break event typing into two stages to narrow down search space
Self-labeling procedure to mitigate effect of partial labels

Trigger identification

Goal is to identify location of event trigger words in sentence
Obtain sentence token representations using pre-trained language model
Compute scores for each token being start, end, or part of event trigger
Compute probability of span being an event trigger
Model is trained using binary cross entropy loss on candidate spans

Experiments

Experiment setting

Evaluation Metrics used: Trigger Identification F1, Trigger Classification F1, Hit@K
Baselines used: Token Classification, InstructGPT (Ouyang et al., 2022)
Implementation Details: Hyperparameters, maximum token span length of 10, designed loss function with τ = 1.0, single Tesla P100 GPU with 16GB DRAM

Results

DMBERT has lower trigger identification score if more than 1 token is predicted
TokCls and SpanCls have similar performance on final trigger classification
ZED utilizes event type definitions but only achieves minor F1 boost
Ours, TokCls, and ZED have different predictions for examples
InstructGPT with in-context learning does not perform well
Decoupling trigger identification from classification improves TI performance
Joint encoding of definition and context improves TC performance
Self-labeling can help improve top-1 classification performance

Analysis

Investigating which component is the main bottleneck for performance
Investigating if model suffers from label imbalance between types
Investigating how self-labeling procedure contributes to performance

Per-stage performance

Model has 3 components: trigger identification, event type ranking, and event type classification
Evaluation of model used precision, recall, and F1 scores
Maximum input length was 2049 tokens
Hit@k metrics used to assess performance of event type ranking and classification
Primary bottleneck in precision of trigger identification and Hit@1 score of type classification

Type imbalance

Figure 4 shows the long-tailed label distribution in the dataset.
Performance per group is calculated and shown in Figure 8.
Popular groups have highest F1 score, but remaining groups have comparable scores.
Two factors define dataset difficulty: ambiguity of event type and frequency of event type.
Table 6 shows that main performance gain comes from self-labeled data.

Remaining errors

Most errors come from noisy annotation
Extended Roleset refers to predicted event associated with same predicate as ground truth
22.6% of cases predict event close to ground truth on XPO hierarchy
Uncatagorized errors due to imperfect recall of event ranking module or no connections in hierarchy

ACE05 set the event extraction paradigm which consists of event detection and argument extraction
MAVEN dataset has a larger ontology selected from a subset of FrameNet
Few-Event dataset is a compilation of ACE, KBP, and Wikipedia data
DCFEE, ChFinAnn, RAMS, WikiEvents, Do-cEE datasets focus on argument extraction
Weak supervision used to increase amount of labeled data for event extraction
WordNet provides a larger source of possible event triggers

Conclusions and future work

Introduced GLEN dataset with 3k event types for all domains
Multistage model designed to handle large ontology size and partial labels
Model can be used as off-the-shelf event detection tool
Hyperparameters for baseline models listed in Table 9
GLEN offers broader coverage of events than ACE05
Label distribution of GLEN, ACE05 and MAVEN shown in Figures 4, 5 and 6
Relationship between threshold and accuracy of label selection shown in Figure 7
Relationship between number of instances of different event types and model performance shown in Figure 8
Categorization of remaining errors from system shown in Figure 9
Annotation interface built with Amazon Mechanical Turk shown in Figure 10
Truncated version of prompt to InstructGPT shown in Figure 11
Statistics of GLEN dataset compared to ACE05 and MAVEN shown in Figure 3
Training hyperparameters of model shown in Table 9
Comparison across different systems shown in Figure 8
Correct predictions shown in green, predictions that map to same PropBank roleset shown in orange
Hit@1 scores before and after self-labeling on different categories of PropBank rolesets shown
Examples of erroneous type predictions and removed Qnodes in XPO Overlay shown

Link to paper#

Abstract#

Paper Content#

Introduction#

Event type ranking#

Event type classification#

Event ontology construction#

Distant supervision data#

Data split and annotation#

Data analysis#

Method#

Trigger identification#

Experiments#

Experiment setting#

Results#

Analysis#

Per-stage performance#

Type imbalance#

Remaining errors#

Related work#

Conclusions and future work#

Link to paper

Abstract

Paper Content

Introduction

Event type ranking

Event type classification

Event ontology construction

Distant supervision data

Data split and annotation

Data analysis

Method

Trigger identification

Experiments

Experiment setting

Results

Analysis

Per-stage performance

Type imbalance

Remaining errors

Related work

Conclusions and future work