Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Mining large corpora is time-consuming for humans
- Formulated a new task, D5, to automatically discover differences between two large corpora
- Task input is a research goal and a corpus pair
- Output is a language description of how the corpora differ
- Built a D5 system and contributed a meta-dataset and proposed unified evaluation metrics
- Confirmed language models can use goals to propose relevant, novel, and significant discoveries
- System produces discoveries previously unknown to the authors on a wide range of applications
Paper Content
Introduction
- Processes of generating discoveries from large corpora are ad hoc and laborious
- Machine learning can potentially accelerate these discovery processes
- Formulated one family of these processes as an ML task with unified metrics and input-output space
- Task is goal driven discovery of differences between text distributions via language descriptions
- Input is a “problem” comprising a description of the research goal and a corpus pair
- Output is a “discovery” represented as a natural language predicate
- Evaluate a discovery using two categories of criteria: validity and meaningfulness
- Curate OPEND5, a meta-dataset with 675 open-ended D5 problems
- Built a D5 system to tackle problems in OPEND5
- System produces valid and meaningful discoveries in natural language as outputs
- Evaluated system and found it produces relevant hypotheses more often than baseline
- Automate discoveries, train better D5 systems, and analyze limitations of evaluation
- OPEND5 allows benchmarking, automation, analysis, and learning of D5 task
Evaluation
- Evaluate system-generated discovery by determining if more samples from Corpus A satisfy the predicate
- Subjective judgement needed to determine if discovery is meaningful to research goal of understanding side effects
Validity
- Requires output discovery h to be a truth predicate on a text sample
- Define T (h, x) as certainty that h is true on x
- Approximate T (h, x) by asking three Turkers and averaging responses
- Define validity V as mean of T (h, x) on subset from Corpus A and B
- Compute p-value for null hypothesis that V ≤ 0 by conducting t-test
- Ideal discovery should have large V value and small p-value
Meaningfulness
- Valid discoveries may not be meaningful
- Relevance, novelty and significance can be used to rate how meaningful a discovery is
- Relevance is based on how related the discovery is to the research goal
- Novelty is based on how difficult it is to generate the discovery
- Significance is based on how beneficial it is to learn the discovery
- An ideal discovery should have high ratings for all three submetrics
Method
- System maps from corpus pair and research goal to set of natural language predicates
- Inspired by two-stage model of how humans discover patterns in data
- Propose hypotheses conditioned on research goal and subset of samples from corpus pair
- Validate each hypothesis to see if it is more often true on one corpus than the other
- Leverage research goal to propose more meaningful hypotheses
Hypothesis proposer
- Prompted GPT-3 to propose hypotheses
- Included research goal in prompt to elicit meaningful hypotheses
Hypothesis validator
- Hypotheses in H init are often invalid
- Use language model T to simulate Turkers’ judgement and approximate validity score V of hypothesis h
- Use FLAN-T5 to ask whether x satisfies h
- Collect additional Turker annotations to fine-tune FLAN-T5
- Perform t-test to compare mean value of V (h, x) on research split of Corpus A and mean value on Corpus B
- Rule out hypotheses with p-value greater than 0.001
- Repeat process to propose and validate hypotheses about Corpus B
Goal leads to more meaningful hypotheses
- Added research goal to prompt when generating hypotheses
- Sampled 100 problems from OPEND5 with distinct research goals
- Randomly sampled 2 hypotheses from GPT-3 with and without research goal
- Three authors rated meaningfulness based on three metrics
- GPT-3 proposed more relevant, novel, and significant hypotheses with research goal
- Inter-annotator agreement rate was moderate (0.37-0.56)
- P-values for null hypothesis were highly significant and robust across evaluators
Application
- System used to automatically generate discoveries on OPEND5
- 402 problems in total, 3296 discoveries
- 21 discoveries manually selected
- Estimated validity of discoveries based on procedure described in Section 3.1
- 15 discoveries with V that are significantly nonzero with p-value below 7%
- Examples of discoveries given
- Future works suggested to collect more open problems
Self-supervised learning
- Designed a self-supervised learning algorithm to improve a language model’s ability to propose more valid hypotheses
- Used a set of problems for training and an initial language model
- Generated a set of prompt-completion pairs to fine-tune the language model
- Used 33 corpora to create 4503 text clusters
- Sampled 30,000 mini-problems for training and 200/1500 for evaluation
- Automated validity score improved from 0.22 to 0.37, and true validity score improved from 0.07 to 0.10
Analysis
- OPEND5 is used to analyze limitations of metrics
- Hypotheses about corpora might not be appropriate predicates on individual samples
- GPT-3 is used to detect and remove comparatives from hypotheses
- Metrics do not evaluate diversity
- Interpreting discoveries requires domain experts
- Metrics do not evaluate causality of discoveries
Related work
- Language models can be used to improve accuracy of zero/few-shot tasks
- ML models can perform inductive reasoning in other modalities, such as vision
- Automated Discovery is not new and can be done with linear regression, n-gram models, decision trees, and entity embedding models
- D5 produces discoveries in the form of natural language predicates
- Evaluating knowledge discoveries is not well-understood
- Meaningfulness of a hypothesis is dependent on implicit community norms
Conclusion
- Formalized the task of D5 to discover corpus-level differences
- Defined evaluation metrics for D5
- Collected meta-dataset OPEND5 to evaluate D5 systems
- Presented 10 use cases on D5
- Proposed a self-supervised learning algorithm
- Analyzed the limitation of current evaluation metrics
- Automated, benchmarked, learned, and analyzed D5 like any other traditional machine learning task
- Selected samples to prompt GPT-3
- Rewrote hypotheses with GPT-3
- Optionally plugged in example hypotheses
- Collected data by producing a list of hypotheses and automatically judging them
- Tested cross problem generalization capability
- Described technical and organizational challenges
- Extended to more flexible expressions
- Validated discoveries by comparing how often they are true on each corpus
- Clarified ambiguous discoveries
- Collected ABC headlines
- Compared GPT-3 generated hypotheses to human annotators