Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Generative models have implications beyond computer science
  • LAION-2B is a large image database with two billion images
  • Manual inspection and automated analysis of LAION-2B is difficult
  • Duplicated images in LAION-2B pose copyright problems
  • Algorithmic chain proposed to detect duplicates in LAION-2B
  • 30% of LAION-2B images likely duplicated
  • Histograms of duplication used to reveal more examples of verbatim copies
  • De-duplicated set will be distributed online

Paper Content

Introduction

  • AI models have a societal impact beyond computer science
  • Large image databases have improved computer vision
  • LAION-5B is a publicly available dataset with billions of image-caption pairs
  • Datasets are collected via automated web scrapers
  • Duplicate images can cause problems
  • Retrieval systems are used to find duplicates in large datasets
  • CLIP network has achieved SOTA performance on zero-shot and transfer tasks
  • Employs contrastive loss to align image and text feature representations
  • Used to condition text-to-image models
  • Open source repository OpenCLIP has reproduced results of original CLIP paper
  • Carlini et al. used CLIP features to filter duplicated images
  • Many methods proposed for image de-duplication, including perceptual hashes and end-to-end representations
  • LAION has released a set of CLIP features and a nearest neighbor index for L2B dataset

Clip feature compression

  • Mean squared error is used as a standard technique for feature compression
  • Compression rate is controlled by the output dimension of the encoder
  • Compression can be done on either image or text CLIP descriptors
  • Contrastive loss function proposed in [18] is used to preserve text and image feature alignment
  • Nearest neighbor search is done exhaustively by measuring L2 distances between all pairs in a chunk
  • Approximate nearest neighbor search techniques are used to search through CLIP descriptors at billions scale
  • Two-level quantizer is used to compress database vectors
  • Generated a synthetic ground truth by computing a small set of k-nearest neighbors using brute force
  • Compared index retrieval results with ground truth
  • Approach used to construct ground truth similar to one described in [11]
  • Ablations over latent space dimensions
  • Used IVFPQ index for every index built atop of descriptors
  • Compared to vanilla IVFPQ on raw features
  • Used 2 16 centroids for IVF and collections of indices with different values of M for PQ
  • Used AutoFaiss to construct and efficient PQ index
  • Used OPQ and HNSW for AutoFaiss indices
  • CLIP network demonstrated impressive performance on zero-shot ImageNet classification task
  • Explored how well descriptors preserve multi-modal information
  • Compared to AutoFaiss hosted by LAION
  • Created Pareto fronts by varying number of chunks for product quantization and n probe parameter
  • MSE and SNIP performed similarly on image similarity search
  • CLIP networks performed better on multi-modal tasks
  • SNIP seen as best of both worlds descriptor for these types of indices

De-duplicating laion 2b

  • Investigated several index creation pipelines to de-duplicate LAION-2B
  • Goal was to find many duplicates with decent precision quickly
  • Applied an adaptive threshold based on the d ADC between a query vector and its quantized version
  • Best performing indices were selected to perform the deduplication of LAION2B
  • De-duplication took only several days

Investigating verbatim copies in stable diffusion

  • Recently, a paper demonstrated that stable diffusion can copy training images on certain prompts.
  • The paper used duplication as a weak filter to select candidates for copied images and synthesized nearly 175 million images.
  • The authors concentrated on a subset of one hundred images from the most duplicated images on L2B.
  • They discovered numerous images that were verbatim copies of training data, as well as images that were highly duplicated but not copied verbatim.

Conclusions

  • Algorithmic chain de-duplicated LAION-2B-en with modest resources and decent precision
  • Code released to download deduplicated datasets, representative sets, duplication histograms and SNIP indices
  • Dataset usability improved, generative models have fewer copyright issues
  • Novel pipeline does not require pre-training or recomputation of billions of clip features
  • Investigated effects of compression on tasks requiring multi-modality of image features
  • Image feature only retrieval pareto fronts for ViT-H-14 indices on L400M
  • MSE based losses perform similarly to contrastive ones
  • Found several new verbatim copied images by Stable Diffusion