Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Generative models have implications beyond computer science
- LAION-2B is a large image database with two billion images
- Manual inspection and automated analysis of LAION-2B is difficult
- Duplicated images in LAION-2B pose copyright problems
- Algorithmic chain proposed to detect duplicates in LAION-2B
- 30% of LAION-2B images likely duplicated
- Histograms of duplication used to reveal more examples of verbatim copies
- De-duplicated set will be distributed online
Paper Content
Introduction
- AI models have a societal impact beyond computer science
- Large image databases have improved computer vision
- LAION-5B is a publicly available dataset with billions of image-caption pairs
- Datasets are collected via automated web scrapers
- Duplicate images can cause problems
- Retrieval systems are used to find duplicates in large datasets
Related work
- CLIP network has achieved SOTA performance on zero-shot and transfer tasks
- Employs contrastive loss to align image and text feature representations
- Used to condition text-to-image models
- Open source repository OpenCLIP has reproduced results of original CLIP paper
- Carlini et al. used CLIP features to filter duplicated images
- Many methods proposed for image de-duplication, including perceptual hashes and end-to-end representations
- LAION has released a set of CLIP features and a nearest neighbor index for L2B dataset
Clip feature compression
- Mean squared error is used as a standard technique for feature compression
- Compression rate is controlled by the output dimension of the encoder
- Compression can be done on either image or text CLIP descriptors
- Contrastive loss function proposed in [18] is used to preserve text and image feature alignment
- Nearest neighbor search is done exhaustively by measuring L2 distances between all pairs in a chunk
- Approximate nearest neighbor search techniques are used to search through CLIP descriptors at billions scale
- Two-level quantizer is used to compress database vectors
Image similarity search
- Generated a synthetic ground truth by computing a small set of k-nearest neighbors using brute force
- Compared index retrieval results with ground truth
- Approach used to construct ground truth similar to one described in [11]
- Ablations over latent space dimensions
- Used IVFPQ index for every index built atop of descriptors
- Compared to vanilla IVFPQ on raw features
- Used 2 16 centroids for IVF and collections of indices with different values of M for PQ
- Used AutoFaiss to construct and efficient PQ index
- Used OPQ and HNSW for AutoFaiss indices
- CLIP network demonstrated impressive performance on zero-shot ImageNet classification task
- Explored how well descriptors preserve multi-modal information
- Compared to AutoFaiss hosted by LAION
- Created Pareto fronts by varying number of chunks for product quantization and n probe parameter
- MSE and SNIP performed similarly on image similarity search
- CLIP networks performed better on multi-modal tasks
- SNIP seen as best of both worlds descriptor for these types of indices
De-duplicating laion 2b
- Investigated several index creation pipelines to de-duplicate LAION-2B
- Goal was to find many duplicates with decent precision quickly
- Applied an adaptive threshold based on the d ADC between a query vector and its quantized version
- Best performing indices were selected to perform the deduplication of LAION2B
- De-duplication took only several days
Investigating verbatim copies in stable diffusion
- Recently, a paper demonstrated that stable diffusion can copy training images on certain prompts.
- The paper used duplication as a weak filter to select candidates for copied images and synthesized nearly 175 million images.
- The authors concentrated on a subset of one hundred images from the most duplicated images on L2B.
- They discovered numerous images that were verbatim copies of training data, as well as images that were highly duplicated but not copied verbatim.
Conclusions
- Algorithmic chain de-duplicated LAION-2B-en with modest resources and decent precision
- Code released to download deduplicated datasets, representative sets, duplication histograms and SNIP indices
- Dataset usability improved, generative models have fewer copyright issues
- Novel pipeline does not require pre-training or recomputation of billions of clip features
- Investigated effects of compression on tasks requiring multi-modality of image features
- Image feature only retrieval pareto fronts for ViT-H-14 indices on L400M
- MSE based losses perform similarly to contrastive ones
- Found several new verbatim copied images by Stable Diffusion