Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Provides an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg)
- Demonstrates how information-theoretic quantities can be obtained for deterministic networks
- Relates VICReg objective to mutual information maximization
- Derives a generalization bound for VICReg
- Presents new self-supervised learning methods derived from a mutual information maximization objective
Paper Content
Introduction
- Self-Supervised Learning (SSL) methods learn representations by optimizing a surrogate objective between inputs and self-defined signals.
- Information-theoretic methods have been used for deep learning applications and theoretical investigations.
- Some works have attempted to use information theory for SSL, such as the InfoMax principle.
- These works often present objective functions without rigorous justification and make implicit assumptions.
Background & preliminaries
- Continuous Piecewise Affine (CPA) Mappings are a rich class of functions
- A spline of order k is a mapping defined by a polynomial of order k on each region
- We focus on affine splines (k = 1)
- Spline operators have been widely used in various fields
- Deep Neural Networks (DNNs) are a nonlinear operator with parameters that map an input to a prediction
- Nonlinearities in DNNs are CPA mappings
- Self-Supervised Learning (SSL) is used to learn DNN parameters without supervision
- Contrastive methods use positive and negative examples to learn representations
- Non-contrastive methods use regularization to prevent collapsing of the representation
- Information-theoretic methods have been used to advance deep learning
- Information-theoretic objectives for deterministic DNNs assume that DNN mappings are stochastic
- Stochastic DNNs with variational bounds can be used to avoid this problem
- Goldfeld et al. (2018) introduced an auxiliary (noisy) DNN by injecting additive noise into the model
An information-theoretic perspective on ssl in deterministic dnns
- Assumptions about information-theoretic challenges in SSL
- Assumptions about data distribution
- Training sample x comes from single Gaussian distribution
- Output of DNN corresponds to mixture of truncated Gaussian distributions
- Approximation to VICReg objective can be recovered from information-theoretic principles
Self-supervised learning as an information-theoretic problem
- Formulate general SSL goal from information-theoretical perspective
- Analyze and compare different SSL methods based on ability to maximize mutual information
- MultiView InfoMax principle aims to maximize mutual information between two different views
- Maximize I(Z; X ) and I(Z ; X) using lower bound where H(Z) is entropy of Z
- In supervised learning, maximize I(Z; Y ) and minimize log-loss E x [log q(z|x)]
- Entropies not constant and can be optimized throughout learning process
- Different methods utilize different approaches to implicit regularizing information
Data distribution hypothesis
- Examines the way output random variables of a network are represented
- Assumes a distribution over the data
- Points can be seen as Gaussian random variables with a low-rank covariance matrix
- Considers the conditioning of a latent representation with respect to the mean of the observation
- Dataset is a collection of points
- Full data distribution is a sum of low-rank covariance Gaussian densities
- Assumes dataset is a mixture of Gaussians with non-overlapping support
Data distribution post the deep neural network transformation
- Affine spline operator f maps from a space of dimension D to a space of dimension K with K ≥ D.
- Affine transformation of region ω by the per-region parameters A ω , b ω .
- Density of f (X) is intractable, but can be increased by increasing the number of prototypes N.
Information optimization and optimality
- SSL algorithms for deterministic networks can be derived.
- Maximize I(Z; X ) and I(Z ; X).
- Variational approximation using expected loss.
- Conditional output density is reduced to a single Gaussian.
An information-theoretic perspective on vicreg
- Derived an objective function based on information theoretical first principles
- Estimating entropy is a classic problem in information theory
- Plots show l2 distances between raw images for different datasets
- Approximations exist to estimate entropy
- Maximizing the sum of the log eigenvalues implies maximizing the log determinant of Z
- Maximize the eigenvalues of Z using more sophisticated methods
Validation of assumptions
- Conditional output density can be reduced to a single Gaussian
- ResNet-18 model trained with SimCLR or VICReg objectives on CIFAR-10 and CIFAR-100 datasets
- 512 Gaussian samples for each image from test dataset
- Used D’Agostino and Pearson’s test to determine validity of assumption
- P-value decreases as input noise increases
- Calculated distribution of pairwise l2 distances between images for seven datasets
- Pairwise distances far from zero, effective support of datasets is non-overlapping
Self-supervised learning via mutual information maximization
- VICReg uses an approximation of the entropy based on assumptions
- Comparing VICReg to other methods such as SimCLR, BYOL and SimSiam
- Suggesting new objective functions to improve performance of SSL and understand underlying learning mechanisms
Vicreg vs. simclr
- SimCLR and VICReg have different conditional distributions (von Mises-Fisher vs Gaussian)
- SimCLR approximates entropy based on finite sum of input samples, VICReg estimates entropy based on second moment
- ResNet-18 trained on CIFAR-10 for VICReg, SimCLR and BYOL
- Entropy estimator based on distances of individual mixture component used for comparison
- SimCLR has lowest entropy, VICReg has highest entropy
Alternative entropy estimators
- Existing methods use an approximation to the entropy
- Suggest combining the invariance term of these methods with plug-in methods for optimizing the entropy
- LogDet Entropy Estimator provides a tighter upper bound
- Lower bound estimator based on pairwise distances of the individual mixture component
- Experiments conducted on CIFAR-10 using ResNet-18 architecture
- Smart selection of entropy estimators leads to better results
Information maximization for vicreg and downstream generalization
- Connection between information-theoretic principles and VICReg objective
- Deriving a downstream generalization bound for VICReg
- Relating generalization in VICReg to information maximization and implicit regularization
- Defining labeled loss, representation matrices, projection matrices, label matrix, and unknown label matrix
- Defining normalized Rademacher complexity
A generalization bound for variance-invariance-covariance regularization
- Theorem 2 shows that VICReg improves generalization on supervised downstream tasks.
- Minimizing the unlabeled invariance loss while controlling the covariance Z S Z S and the complexity of representations Rm (F) minimizes the expected labeled loss.
- Maximizing the diagonal entries of Z S Z S while minimizing its off-diagonal entries is done in VICReg.
- VICReg can be improved in the partially observable setting.
Comparison of generalization bounds
- SimCLR requires label classes to go to infinity to close the generalization gap
- VICReg does not require label classes to approach infinity for the generalization gap to go to zero
- VICReg does not use negative pairs
- VICReg bound improves as n increases
- SimCLR bound does not depend on n
Understanding theorem 2 via mutual information maximization
- Maximize mutual information I(Z; X) in SSL
- Control complexity of representations f θ
- Approximate ln |F| by 2 I(Z;X)
- Maximize predictive information I(Z; X)
- Minimize superfluous information I(Z; X|X)
Conclusions
- Examined Variance-Invariance-Covariance Regularization for self-supervised learning from an information-theoretic perspective
- Transferred stochasticity to the input distribution
- Derived VICReg objective from information-theoretic principles
- Highlighted assumptions implicit in the VICReg objective
- Derived a VICReg generalization bound for downstream tasks
- Related it to information maximization
- Proposed a new VICReg-style SSL objective
- Suggested that VICReg can be improved for partial label information
- Opened up avenues for improved estimators and investigations into SSL methods
- Lower bound on E x [log q(z|x )]
- Theorem 3: DNN output density approximates a mixture of affinely transformed distributions
- Theorem 4: Generalization bound
- Used Hoeffding’s inequality and Lemma 2 to analyze the bound
- Factor c measures the difference between labeled and unlabeled training data
- Rademacher complexity of the hypothesis space W
- Conditional expectations to take care of dependent random variables
- Hoeffding’s inequality and Lemma 2 to analyze the bound
- Minimum norm solution W S of the labeled training data and the minimum norm solution W S of the unlabeled training data
- Analyzed the term E x [log q(z|x )]