Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

CLIP network has achieved SOTA performance on zero-shot and transfer tasks
Employs contrastive loss to align image and text feature representations
Used to condition text-to-image models
Open source repository OpenCLIP has reproduced results of original CLIP paper
Carlini et al. used CLIP features to filter duplicated images
Many methods proposed for image de-duplication, including perceptual hashes and end-to-end representations
LAION has released a set of CLIP features and a nearest neighbor index for L2B dataset

Mean squared error is used as a standard technique for feature compression
Compression rate is controlled by the output dimension of the encoder
Compression can be done on either image or text CLIP descriptors
Contrastive loss function proposed in [18] is used to preserve text and image feature alignment
Nearest neighbor search is done exhaustively by measuring L2 distances between all pairs in a chunk
Approximate nearest neighbor search techniques are used to search through CLIP descriptors at billions scale
Two-level quantizer is used to compress database vectors

Generated a synthetic ground truth by computing a small set of k-nearest neighbors using brute force
Compared index retrieval results with ground truth
Approach used to construct ground truth similar to one described in [11]
Ablations over latent space dimensions
Used IVFPQ index for every index built atop of descriptors
Compared to vanilla IVFPQ on raw features
Used 2 16 centroids for IVF and collections of indices with different values of M for PQ
Used AutoFaiss to construct and efficient PQ index
Used OPQ and HNSW for AutoFaiss indices
CLIP network demonstrated impressive performance on zero-shot ImageNet classification task
Explored how well descriptors preserve multi-modal information
Compared to AutoFaiss hosted by LAION
Created Pareto fronts by varying number of chunks for product quantization and n probe parameter
MSE and SNIP performed similarly on image similarity search
CLIP networks performed better on multi-modal tasks
SNIP seen as best of both worlds descriptor for these types of indices

Investigated several index creation pipelines to de-duplicate LAION-2B
Goal was to find many duplicates with decent precision quickly
Applied an adaptive threshold based on the d ADC between a query vector and its quantized version
Best performing indices were selected to perform the deduplication of LAION2B
De-duplication took only several days

Recently, a paper demonstrated that stable diffusion can copy training images on certain prompts.
The paper used duplication as a weak filter to select candidates for copied images and synthesized nearly 175 million images.
The authors concentrated on a subset of one hundred images from the most duplicated images on L2B.
They discovered numerous images that were verbatim copies of training data, as well as images that were highly duplicated but not copied verbatim.

Algorithmic chain de-duplicated LAION-2B-en with modest resources and decent precision
Code released to download deduplicated datasets, representative sets, duplication histograms and SNIP indices
Dataset usability improved, generative models have fewer copyright issues
Novel pipeline does not require pre-training or recomputation of billions of clip features
Investigated effects of compression on tasks requiring multi-modality of image features
Image feature only retrieval pareto fronts for ViT-H-14 indices on L400M
MSE based losses perform similarly to contrastive ones
Found several new verbatim copied images by Stable Diffusion