Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Progress in machine learning is driven by large datasets
- LAION datasets are largely uncurated
- SemDeDup uses embeddings from pre-trained models to identify and remove semantic duplicates
- SemDeDup can remove 50% of the data with minimal performance loss, halving training time
- Performance increases out of distribution
- SemDeDup improves over prior approaches and provides efficiency gains
Paper Content
Related work
- Work in language and vision has focused on removing exact duplicates
- C4 text corpus was deduplicated by discarding repeated 3-sentence spans
- MinHash technique used to further deduplicate dataset without loss of performance
- Deduplication prevents memorization in LLMs and mitigates privacy concerns
- Model-based feature extraction used to improve similarity metric for deduplication
Semdedup
- Identifying semantic duplicates is more difficult than perceptual duplicates.
- Leverage embedding space of pre-trained foundation model to provide semantically meaningful distance metric.
- Use SemDeDup algorithm to detect and remove semantically similar images.
- Utilize pre-trained foundation models to embed data examples.
Clustering to reduce computation
- Naive de-duplication has a time complexity of O(n2)
- This approach is not practical for large web-scale data
Semdedup on laion
- Cosine similarity is used to determine semantic duplicates
- Increasing the deduplication dissimilarity threshold allows semantically redundant data pairs
- LAION dataset contains extreme amounts of semantic redundancy
- SemDeDup can discover semantic redundancy in multi-modal data
Datasets and training
- Used LAION dataset containing up to 5 billion image-text pairs
- Filtered using pre-trained CLIP model and removed short captions and small images
- De-duplication method based on image url
- Majority of experiments on LAION-440M filtered subset
- CAT filtering criteria used
- Used reduced version of LAION-400M subset containing 233 million data points
- CLIP training with OpenCLIP implementation
- CLIP evaluation on 30 different datasets
Extreme semantic redundancy at web-scale
- LAION-440M contains a lot of semantic duplicates
- 30% of images in LAION-440M have a semantic duplicate at a highly stringent distance threshold
- 50% of images in LAION-440M have a semantic duplicate at a tight threshold
- Histogram of pairwise cosine similarity in LAION-440M reveals high density of pairs at high cosine similarity, including a large contribution at 1
What do semantic duplicates look like?
- Semantic duplicates are images with distortions that evade exact de-duplication approaches
- Semantic duplicates often contain different, but highly similar captions
- Semantic redundancy is when the same concept is present, but not derived from the same image source
Training on semantically deduplicated data improves efficiency
Methods
- Train language models on deduplicated versions of C4 dataset
- Train on subsets of data that are compute optimal
- Use OPT model and training configurations
- Adjust learning rate schedule to anneal to 0
- Evaluate models on two independent validation sets
- Pass documents through pre-trained 125M OPT model
- Cluster embeddings with K = 11000
- Compare to random pruning and NearDup method
Results on language modeling
- SemDeDup outperforms random pruning in terms of perplexity and average opt_valid performance
- SemDeDup beats random pruning on every single validation set in opt_valid
- Training on smaller pruned datasets for multiple epochs can match the performance of a baseline model trained on a larger dataset
- More aggressive pruning can yield more efficiency gains
What is being pruned in language data?
- Semantic duplicates can be found in the form of templated text
- Semantically redundant duplicates can be found in the form of advertisements
- Exact string duplicates are rare
- 96.1% of data can be kept with NearDup method
- SemDeDup can keep 80% of data while matching NearDup performance
Number of k-means clusters for semdedup
- Changing the number of clusters (k) in k-means clustering affects performance
- In experiments, k was set to 50,000 for LAION dataset and 11,000 for C4 dataset
- Different values of k were tested when deduplicating LAION440M
- Choice of k affects probability of recovering all semantic duplicates and computational complexity
Pre-trained models for extracting embeddings
- SemDeDup clusters example embeddings from a pre-trained foundation model for deduplication.
- OpenAI CLIP model pre-trained on a different dataset than LAION was used to deduplicate LAION440M dataset to 40% of its size.
- Using Open AI CLIP model for extracting embeddings had a negligible impact on the performance.
Different strategies for choosing which semantic duplicates to keep
- SemDeDup is used to deduplicate a dataset
- The example with the lowest cosine similarity to the cluster centroid is kept from each group of duplicates
- Three CLIP models are trained on 40% of the deduplicated dataset
- Three options are tested for choosing the examples to keep: low similarity to centroids, random examples, and high similarity to cluster centroids
- The difference between the three methods in zero-shot accuracy on ImageNet is negligible
Training on deduplicated data for more iterations improves performance
- Training on deduplicated data requires fewer iterations.
- Good trade-off between performance and training speed.
- Training on 50% of LAION440M for same number of epochs as baseline model results in 50% of the number of training iterations.
- Tuning deduplication threshold manually to get desired deduplicated dataset size.
- Computational cost of deduplication can be amortized across efficiency gains.
Discussion
- Introduced SemDeDup to remove semantic duplicates
- Improves learning speed and out-of-distribution performance
- Efficiency gains of up to 50% on LAION and 15% on C4
- Requires access to pre-trained embedding model
A additional analysis
- To assess the impact of changing the value of k, the intersection between datasets deduplicated by SemDeDup using different values for k was measured.
- The percentage of intersection I between two datasets of the same size was defined as the percentage of data points that appear in both datasets relative to the dataset size.
- Deduplicating LAION440M dataset to 72% of its size using any value of k values (10000, 25000, 50000, 70000) results in almost the same dataset with only 3% of the examples replaced when changing k.
- SemDeDup searches for duplicates within clusters, reducing the FLOPs required for deduplication by five orders of magnitude.
- The deduplication efficiency η was defined as the fraction of duplicates detected by SemDeDup from the total number of duplicates in the datasets at a specific value of .
- SemDeDup can effectively detect more than 94% of the duplicates when keeping 63% of LAION-440M dataset and 89% of the duplicates when keeping 40%.
- Models trained on dataset de-duplicated using SemDeDup outperform the baseline model in many tasks.
- SemDeDup outperforms the baseline model in 19 out of the 30 tasks when using only 63% of LAION-440M.
- SemDeDup can match the baseline performance while keeping only 80% of the data.
- SemDeDup allows compute efficiency gains by training on much smaller datasets for slightly longer.
- Performance of SemDeDup and random pruning for different amounts of retained data.
- SemDeDup is robust to the choice of k and the impact on the zeroshot accuracy on ImageNet is small.
- Different strategies to choose the example to keep from each group of duplicates.
- By training on only 50% of LAION440M, deduplicated using SemDeDup, better performance than training on whole LAION440M.