Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Progress in machine learning is driven by large datasets
  • LAION datasets are largely uncurated
  • SemDeDup uses embeddings from pre-trained models to identify and remove semantic duplicates
  • SemDeDup can remove 50% of the data with minimal performance loss, halving training time
  • Performance increases out of distribution
  • SemDeDup improves over prior approaches and provides efficiency gains

Paper Content

  • Work in language and vision has focused on removing exact duplicates
  • C4 text corpus was deduplicated by discarding repeated 3-sentence spans
  • MinHash technique used to further deduplicate dataset without loss of performance
  • Deduplication prevents memorization in LLMs and mitigates privacy concerns
  • Model-based feature extraction used to improve similarity metric for deduplication

Semdedup

  • Identifying semantic duplicates is more difficult than perceptual duplicates.
  • Leverage embedding space of pre-trained foundation model to provide semantically meaningful distance metric.
  • Use SemDeDup algorithm to detect and remove semantically similar images.
  • Utilize pre-trained foundation models to embed data examples.

Clustering to reduce computation

  • Naive de-duplication has a time complexity of O(n2)
  • This approach is not practical for large web-scale data

Semdedup on laion

  • Cosine similarity is used to determine semantic duplicates
  • Increasing the deduplication dissimilarity threshold allows semantically redundant data pairs
  • LAION dataset contains extreme amounts of semantic redundancy
  • SemDeDup can discover semantic redundancy in multi-modal data

Datasets and training

  • Used LAION dataset containing up to 5 billion image-text pairs
  • Filtered using pre-trained CLIP model and removed short captions and small images
  • De-duplication method based on image url
  • Majority of experiments on LAION-440M filtered subset
  • CAT filtering criteria used
  • Used reduced version of LAION-400M subset containing 233 million data points
  • CLIP training with OpenCLIP implementation
  • CLIP evaluation on 30 different datasets

Extreme semantic redundancy at web-scale

  • LAION-440M contains a lot of semantic duplicates
  • 30% of images in LAION-440M have a semantic duplicate at a highly stringent distance threshold
  • 50% of images in LAION-440M have a semantic duplicate at a tight threshold
  • Histogram of pairwise cosine similarity in LAION-440M reveals high density of pairs at high cosine similarity, including a large contribution at 1

What do semantic duplicates look like?

  • Semantic duplicates are images with distortions that evade exact de-duplication approaches
  • Semantic duplicates often contain different, but highly similar captions
  • Semantic redundancy is when the same concept is present, but not derived from the same image source

Training on semantically deduplicated data improves efficiency

Methods

  • Train language models on deduplicated versions of C4 dataset
  • Train on subsets of data that are compute optimal
  • Use OPT model and training configurations
  • Adjust learning rate schedule to anneal to 0
  • Evaluate models on two independent validation sets
  • Pass documents through pre-trained 125M OPT model
  • Cluster embeddings with K = 11000
  • Compare to random pruning and NearDup method

Results on language modeling

  • SemDeDup outperforms random pruning in terms of perplexity and average opt_valid performance
  • SemDeDup beats random pruning on every single validation set in opt_valid
  • Training on smaller pruned datasets for multiple epochs can match the performance of a baseline model trained on a larger dataset
  • More aggressive pruning can yield more efficiency gains

What is being pruned in language data?

  • Semantic duplicates can be found in the form of templated text
  • Semantically redundant duplicates can be found in the form of advertisements
  • Exact string duplicates are rare
  • 96.1% of data can be kept with NearDup method
  • SemDeDup can keep 80% of data while matching NearDup performance

Number of k-means clusters for semdedup

  • Changing the number of clusters (k) in k-means clustering affects performance
  • In experiments, k was set to 50,000 for LAION dataset and 11,000 for C4 dataset
  • Different values of k were tested when deduplicating LAION440M
  • Choice of k affects probability of recovering all semantic duplicates and computational complexity

Pre-trained models for extracting embeddings

  • SemDeDup clusters example embeddings from a pre-trained foundation model for deduplication.
  • OpenAI CLIP model pre-trained on a different dataset than LAION was used to deduplicate LAION440M dataset to 40% of its size.
  • Using Open AI CLIP model for extracting embeddings had a negligible impact on the performance.

Different strategies for choosing which semantic duplicates to keep

  • SemDeDup is used to deduplicate a dataset
  • The example with the lowest cosine similarity to the cluster centroid is kept from each group of duplicates
  • Three CLIP models are trained on 40% of the deduplicated dataset
  • Three options are tested for choosing the examples to keep: low similarity to centroids, random examples, and high similarity to cluster centroids
  • The difference between the three methods in zero-shot accuracy on ImageNet is negligible

Training on deduplicated data for more iterations improves performance

  • Training on deduplicated data requires fewer iterations.
  • Good trade-off between performance and training speed.
  • Training on 50% of LAION440M for same number of epochs as baseline model results in 50% of the number of training iterations.
  • Tuning deduplication threshold manually to get desired deduplicated dataset size.
  • Computational cost of deduplication can be amortized across efficiency gains.

Discussion

  • Introduced SemDeDup to remove semantic duplicates
  • Improves learning speed and out-of-distribution performance
  • Efficiency gains of up to 50% on LAION and 15% on C4
  • Requires access to pre-trained embedding model

A additional analysis

  • To assess the impact of changing the value of k, the intersection between datasets deduplicated by SemDeDup using different values for k was measured.
  • The percentage of intersection I between two datasets of the same size was defined as the percentage of data points that appear in both datasets relative to the dataset size.
  • Deduplicating LAION440M dataset to 72% of its size using any value of k values (10000, 25000, 50000, 70000) results in almost the same dataset with only 3% of the examples replaced when changing k.
  • SemDeDup searches for duplicates within clusters, reducing the FLOPs required for deduplication by five orders of magnitude.
  • The deduplication efficiency η was defined as the fraction of duplicates detected by SemDeDup from the total number of duplicates in the datasets at a specific value of .
  • SemDeDup can effectively detect more than 94% of the duplicates when keeping 63% of LAION-440M dataset and 89% of the duplicates when keeping 40%.
  • Models trained on dataset de-duplicated using SemDeDup outperform the baseline model in many tasks.
  • SemDeDup outperforms the baseline model in 19 out of the 30 tasks when using only 63% of LAION-440M.
  • SemDeDup can match the baseline performance while keeping only 80% of the data.
  • SemDeDup allows compute efficiency gains by training on much smaller datasets for slightly longer.
  • Performance of SemDeDup and random pruning for different amounts of retained data.
  • SemDeDup is robust to the choice of k and the impact on the zeroshot accuracy on ImageNet is small.
  • Different strategies to choose the example to keep from each group of duplicates.
  • By training on only 50% of LAION440M, deduplicated using SemDeDup, better performance than training on whole LAION440M.