Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Development of event extraction systems hindered by lack of large-scale datasets
- GLEN dataset created to make event extraction systems more accessible
- GLEN covers 3,465 different event types, 20x larger than current datasets
- GLEN created using DWD Overlay and PropBank annotation
- New multi-stage event detection model proposed
- Model exhibits 10% F1 gain compared to classification baselines and definition-based models
- Label noise still largest challenge for improving performance
Paper Content
Introduction
- ACE 2005 is the current standard benchmark for event extraction, but it has limited ontology and domain.
- MAVEN is the largest effort towards event extraction annotation, but it also has limited domain diversity.
- Current benchmarks focus on restricted ontologies in limited domains, which is harmful to developers and users.
- Event extraction packages are scarce compared to NLP packages.
- Annotation quality is a long-standing issue for event extraction.
Event type ranking
- Perform event type ranking over entire event ontology for each sentence
- Design decisions to improve efficiency: ranking done for whole sentence, sentence and event type definitions encoded separately
- Use same model architecture as Col-BERT
- Encode sentence and event type definition separately
- Compute similarity score between sentence and event type by sum of maximum similarity between sentence and event embeddings
- Train model using distant supervision data and margin loss
Event type classification
- Task is to classify event triggers into event types
- Formulated as Yes/No QA task
- Model takes probability of predicting “yes” or “no”
- Binary cross-entropy loss used to train model
- Label noise due to many-to-one mapping
- Incremental self-labeling procedure used to handle partial labels
- GLEN benchmark created using ontology and distantly-supervised training data
Event ontology construction
- DWD Overlay is an effort to align Wiki-Data Qnodes to PropBank rolesets, argument structures, and LDC tagsets.
- Each event type entry includes properties from WikiData, PropBank, and LDC.
- Ontology is a superset of ACE and ERE.
- Cognitive events without physical state change are removed.
- Manual clean up of event types associated with rolesets.
Distant supervision data
- Reuse existing PropBank annotations
- Partial label challenge when using distantly-supervised data
- Perform sentence de-duplication, remove sentences with less than 3 tokens, omit special tokens
- Ensure every trigger is a continuous token span, remove events with overlapping triggers
- Remove triggers with a part-of-speech tag of MD or TO
- Manually inspect most popular rolesets, remove general/ambiguous ones
- Remove rolesets with less than 3 event mentions across all datasets
Data split and annotation
- Dataset split into train, dev, and test sets based on documents
- Stratified sampling used for datasets with multiple genres
- Annotation task formulated as multiple-choice question
- Mechanical Turk workers screened for development set
- Manual inspection for PropBank rolesets labeled as “None of the above”
- Third pass of annotation for affected instances and instances with disagreement
Data analysis
- Examined event ontology by visualizing it as a hierarchy
- Ontology offers wider range of diverse events
- Event type distribution closely mirrors real-world event distributions
- Compared type distribution to ACE and MAVEN
- Part-of-speech distribution of trigger words similar to ACE and MAVEN
Method
- Goal is to find trigger word offset and event type for every event
- Challenge is large ontology size and partial labels from distant supervision
- Separate trigger identification from event typing to learn with clean data
- Break event typing into two stages to narrow down search space
- Self-labeling procedure to mitigate effect of partial labels
Trigger identification
- Goal is to identify location of event trigger words in sentence
- Obtain sentence token representations using pre-trained language model
- Compute scores for each token being start, end, or part of event trigger
- Compute probability of span being an event trigger
- Model is trained using binary cross entropy loss on candidate spans
Experiments
Experiment setting
- Evaluation Metrics used: Trigger Identification F1, Trigger Classification F1, Hit@K
- Baselines used: Token Classification, InstructGPT (Ouyang et al., 2022)
- Implementation Details: Hyperparameters, maximum token span length of 10, designed loss function with τ = 1.0, single Tesla P100 GPU with 16GB DRAM
Results
- DMBERT has lower trigger identification score if more than 1 token is predicted
- TokCls and SpanCls have similar performance on final trigger classification
- ZED utilizes event type definitions but only achieves minor F1 boost
- Ours, TokCls, and ZED have different predictions for examples
- InstructGPT with in-context learning does not perform well
- Decoupling trigger identification from classification improves TI performance
- Joint encoding of definition and context improves TC performance
- Self-labeling can help improve top-1 classification performance
Analysis
- Investigating which component is the main bottleneck for performance
- Investigating if model suffers from label imbalance between types
- Investigating how self-labeling procedure contributes to performance
Per-stage performance
- Model has 3 components: trigger identification, event type ranking, and event type classification
- Evaluation of model used precision, recall, and F1 scores
- Maximum input length was 2049 tokens
- Hit@k metrics used to assess performance of event type ranking and classification
- Primary bottleneck in precision of trigger identification and Hit@1 score of type classification
Type imbalance
- Figure 4 shows the long-tailed label distribution in the dataset.
- Performance per group is calculated and shown in Figure 8.
- Popular groups have highest F1 score, but remaining groups have comparable scores.
- Two factors define dataset difficulty: ambiguity of event type and frequency of event type.
- Table 6 shows that main performance gain comes from self-labeled data.
Remaining errors
- Most errors come from noisy annotation
- Extended Roleset refers to predicted event associated with same predicate as ground truth
- 22.6% of cases predict event close to ground truth on XPO hierarchy
- Uncatagorized errors due to imperfect recall of event ranking module or no connections in hierarchy
Related work
- ACE05 set the event extraction paradigm which consists of event detection and argument extraction
- MAVEN dataset has a larger ontology selected from a subset of FrameNet
- Few-Event dataset is a compilation of ACE, KBP, and Wikipedia data
- DCFEE, ChFinAnn, RAMS, WikiEvents, Do-cEE datasets focus on argument extraction
- Weak supervision used to increase amount of labeled data for event extraction
- WordNet provides a larger source of possible event triggers
Conclusions and future work
- Introduced GLEN dataset with 3k event types for all domains
- Multistage model designed to handle large ontology size and partial labels
- Model can be used as off-the-shelf event detection tool
- Hyperparameters for baseline models listed in Table 9
- GLEN offers broader coverage of events than ACE05
- Label distribution of GLEN, ACE05 and MAVEN shown in Figures 4, 5 and 6
- Relationship between threshold and accuracy of label selection shown in Figure 7
- Relationship between number of instances of different event types and model performance shown in Figure 8
- Categorization of remaining errors from system shown in Figure 9
- Annotation interface built with Amazon Mechanical Turk shown in Figure 10
- Truncated version of prompt to InstructGPT shown in Figure 11
- Statistics of GLEN dataset compared to ACE05 and MAVEN shown in Figure 3
- Training hyperparameters of model shown in Table 9
- Comparison across different systems shown in Figure 8
- Correct predictions shown in green, predictions that map to same PropBank roleset shown in orange
- Hit@1 scores before and after self-labeling on different categories of PropBank rolesets shown
- Examples of erroneous type predictions and removed Qnodes in XPO Overlay shown