Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Mining large corpora is time-consuming for humans
  • Formulated a new task, D5, to automatically discover differences between two large corpora
  • Task input is a research goal and a corpus pair
  • Output is a language description of how the corpora differ
  • Built a D5 system and contributed a meta-dataset and proposed unified evaluation metrics
  • Confirmed language models can use goals to propose relevant, novel, and significant discoveries
  • System produces discoveries previously unknown to the authors on a wide range of applications

Paper Content

Introduction

  • Processes of generating discoveries from large corpora are ad hoc and laborious
  • Machine learning can potentially accelerate these discovery processes
  • Formulated one family of these processes as an ML task with unified metrics and input-output space
  • Task is goal driven discovery of differences between text distributions via language descriptions
  • Input is a “problem” comprising a description of the research goal and a corpus pair
  • Output is a “discovery” represented as a natural language predicate
  • Evaluate a discovery using two categories of criteria: validity and meaningfulness
  • Curate OPEND5, a meta-dataset with 675 open-ended D5 problems
  • Built a D5 system to tackle problems in OPEND5
  • System produces valid and meaningful discoveries in natural language as outputs
  • Evaluated system and found it produces relevant hypotheses more often than baseline
  • Automate discoveries, train better D5 systems, and analyze limitations of evaluation
  • OPEND5 allows benchmarking, automation, analysis, and learning of D5 task

Evaluation

  • Evaluate system-generated discovery by determining if more samples from Corpus A satisfy the predicate
  • Subjective judgement needed to determine if discovery is meaningful to research goal of understanding side effects

Validity

  • Requires output discovery h to be a truth predicate on a text sample
  • Define T (h, x) as certainty that h is true on x
  • Approximate T (h, x) by asking three Turkers and averaging responses
  • Define validity V as mean of T (h, x) on subset from Corpus A and B
  • Compute p-value for null hypothesis that V ≤ 0 by conducting t-test
  • Ideal discovery should have large V value and small p-value

Meaningfulness

  • Valid discoveries may not be meaningful
  • Relevance, novelty and significance can be used to rate how meaningful a discovery is
  • Relevance is based on how related the discovery is to the research goal
  • Novelty is based on how difficult it is to generate the discovery
  • Significance is based on how beneficial it is to learn the discovery
  • An ideal discovery should have high ratings for all three submetrics

Method

  • System maps from corpus pair and research goal to set of natural language predicates
  • Inspired by two-stage model of how humans discover patterns in data
  • Propose hypotheses conditioned on research goal and subset of samples from corpus pair
  • Validate each hypothesis to see if it is more often true on one corpus than the other
  • Leverage research goal to propose more meaningful hypotheses

Hypothesis proposer

  • Prompted GPT-3 to propose hypotheses
  • Included research goal in prompt to elicit meaningful hypotheses

Hypothesis validator

  • Hypotheses in H init are often invalid
  • Use language model T to simulate Turkers’ judgement and approximate validity score V of hypothesis h
  • Use FLAN-T5 to ask whether x satisfies h
  • Collect additional Turker annotations to fine-tune FLAN-T5
  • Perform t-test to compare mean value of V (h, x) on research split of Corpus A and mean value on Corpus B
  • Rule out hypotheses with p-value greater than 0.001
  • Repeat process to propose and validate hypotheses about Corpus B

Goal leads to more meaningful hypotheses

  • Added research goal to prompt when generating hypotheses
  • Sampled 100 problems from OPEND5 with distinct research goals
  • Randomly sampled 2 hypotheses from GPT-3 with and without research goal
  • Three authors rated meaningfulness based on three metrics
  • GPT-3 proposed more relevant, novel, and significant hypotheses with research goal
  • Inter-annotator agreement rate was moderate (0.37-0.56)
  • P-values for null hypothesis were highly significant and robust across evaluators

Application

  • System used to automatically generate discoveries on OPEND5
  • 402 problems in total, 3296 discoveries
  • 21 discoveries manually selected
  • Estimated validity of discoveries based on procedure described in Section 3.1
  • 15 discoveries with V that are significantly nonzero with p-value below 7%
  • Examples of discoveries given
  • Future works suggested to collect more open problems

Self-supervised learning

  • Designed a self-supervised learning algorithm to improve a language model’s ability to propose more valid hypotheses
  • Used a set of problems for training and an initial language model
  • Generated a set of prompt-completion pairs to fine-tune the language model
  • Used 33 corpora to create 4503 text clusters
  • Sampled 30,000 mini-problems for training and 200/1500 for evaluation
  • Automated validity score improved from 0.22 to 0.37, and true validity score improved from 0.07 to 0.10

Analysis

  • OPEND5 is used to analyze limitations of metrics
  • Hypotheses about corpora might not be appropriate predicates on individual samples
  • GPT-3 is used to detect and remove comparatives from hypotheses
  • Metrics do not evaluate diversity
  • Interpreting discoveries requires domain experts
  • Metrics do not evaluate causality of discoveries
  • Language models can be used to improve accuracy of zero/few-shot tasks
  • ML models can perform inductive reasoning in other modalities, such as vision
  • Automated Discovery is not new and can be done with linear regression, n-gram models, decision trees, and entity embedding models
  • D5 produces discoveries in the form of natural language predicates
  • Evaluating knowledge discoveries is not well-understood
  • Meaningfulness of a hypothesis is dependent on implicit community norms

Conclusion

  • Formalized the task of D5 to discover corpus-level differences
  • Defined evaluation metrics for D5
  • Collected meta-dataset OPEND5 to evaluate D5 systems
  • Presented 10 use cases on D5
  • Proposed a self-supervised learning algorithm
  • Analyzed the limitation of current evaluation metrics
  • Automated, benchmarked, learned, and analyzed D5 like any other traditional machine learning task
  • Selected samples to prompt GPT-3
  • Rewrote hypotheses with GPT-3
  • Optionally plugged in example hypotheses
  • Collected data by producing a list of hypotheses and automatically judging them
  • Tested cross problem generalization capability
  • Described technical and organizational challenges
  • Extended to more flexible expressions
  • Validated discoveries by comparing how often they are true on each corpus
  • Clarified ambiguous discoveries
  • Collected ABC headlines
  • Compared GPT-3 generated hypotheses to human annotators