Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Provides an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg)
Demonstrates how information-theoretic quantities can be obtained for deterministic networks
Relates VICReg objective to mutual information maximization
Derives a generalization bound for VICReg
Presents new self-supervised learning methods derived from a mutual information maximization objective

Paper Content

Introduction

Self-Supervised Learning (SSL) methods learn representations by optimizing a surrogate objective between inputs and self-defined signals.
Information-theoretic methods have been used for deep learning applications and theoretical investigations.
Some works have attempted to use information theory for SSL, such as the InfoMax principle.
These works often present objective functions without rigorous justification and make implicit assumptions.

Background & preliminaries

Continuous Piecewise Affine (CPA) Mappings are a rich class of functions
A spline of order k is a mapping defined by a polynomial of order k on each region
We focus on affine splines (k = 1)
Spline operators have been widely used in various fields
Deep Neural Networks (DNNs) are a nonlinear operator with parameters that map an input to a prediction
Nonlinearities in DNNs are CPA mappings
Self-Supervised Learning (SSL) is used to learn DNN parameters without supervision
Contrastive methods use positive and negative examples to learn representations
Non-contrastive methods use regularization to prevent collapsing of the representation
Information-theoretic methods have been used to advance deep learning
Information-theoretic objectives for deterministic DNNs assume that DNN mappings are stochastic
Stochastic DNNs with variational bounds can be used to avoid this problem
Goldfeld et al. (2018) introduced an auxiliary (noisy) DNN by injecting additive noise into the model

An information-theoretic perspective on ssl in deterministic dnns

Assumptions about information-theoretic challenges in SSL
Assumptions about data distribution
Training sample x comes from single Gaussian distribution
Output of DNN corresponds to mixture of truncated Gaussian distributions
Approximation to VICReg objective can be recovered from information-theoretic principles

Self-supervised learning as an information-theoretic problem

Formulate general SSL goal from information-theoretical perspective
Analyze and compare different SSL methods based on ability to maximize mutual information
MultiView InfoMax principle aims to maximize mutual information between two different views
Maximize I(Z; X ) and I(Z ; X) using lower bound where H(Z) is entropy of Z
In supervised learning, maximize I(Z; Y ) and minimize log-loss E x [log q(z|x)]
Entropies not constant and can be optimized throughout learning process
Different methods utilize different approaches to implicit regularizing information

Data distribution hypothesis

Examines the way output random variables of a network are represented
Assumes a distribution over the data
Points can be seen as Gaussian random variables with a low-rank covariance matrix
Considers the conditioning of a latent representation with respect to the mean of the observation
Dataset is a collection of points
Full data distribution is a sum of low-rank covariance Gaussian densities
Assumes dataset is a mixture of Gaussians with non-overlapping support

Data distribution post the deep neural network transformation

Affine spline operator f maps from a space of dimension D to a space of dimension K with K ≥ D.
Affine transformation of region ω by the per-region parameters A ω , b ω .
Density of f (X) is intractable, but can be increased by increasing the number of prototypes N.

Information optimization and optimality

SSL algorithms for deterministic networks can be derived.
Maximize I(Z; X ) and I(Z ; X).
Variational approximation using expected loss.
Conditional output density is reduced to a single Gaussian.

An information-theoretic perspective on vicreg

Derived an objective function based on information theoretical first principles
Estimating entropy is a classic problem in information theory
Plots show l2 distances between raw images for different datasets
Approximations exist to estimate entropy
Maximizing the sum of the log eigenvalues implies maximizing the log determinant of Z
Maximize the eigenvalues of Z using more sophisticated methods

Validation of assumptions

Conditional output density can be reduced to a single Gaussian
ResNet-18 model trained with SimCLR or VICReg objectives on CIFAR-10 and CIFAR-100 datasets
512 Gaussian samples for each image from test dataset
Used D’Agostino and Pearson’s test to determine validity of assumption
P-value decreases as input noise increases
Calculated distribution of pairwise l2 distances between images for seven datasets
Pairwise distances far from zero, effective support of datasets is non-overlapping

Self-supervised learning via mutual information maximization

VICReg uses an approximation of the entropy based on assumptions
Comparing VICReg to other methods such as SimCLR, BYOL and SimSiam
Suggesting new objective functions to improve performance of SSL and understand underlying learning mechanisms

Vicreg vs. simclr

SimCLR and VICReg have different conditional distributions (von Mises-Fisher vs Gaussian)
SimCLR approximates entropy based on finite sum of input samples, VICReg estimates entropy based on second moment
ResNet-18 trained on CIFAR-10 for VICReg, SimCLR and BYOL
Entropy estimator based on distances of individual mixture component used for comparison
SimCLR has lowest entropy, VICReg has highest entropy

Alternative entropy estimators

Existing methods use an approximation to the entropy
Suggest combining the invariance term of these methods with plug-in methods for optimizing the entropy
LogDet Entropy Estimator provides a tighter upper bound
Lower bound estimator based on pairwise distances of the individual mixture component
Experiments conducted on CIFAR-10 using ResNet-18 architecture
Smart selection of entropy estimators leads to better results

Information maximization for vicreg and downstream generalization

Connection between information-theoretic principles and VICReg objective
Deriving a downstream generalization bound for VICReg
Relating generalization in VICReg to information maximization and implicit regularization
Defining labeled loss, representation matrices, projection matrices, label matrix, and unknown label matrix
Defining normalized Rademacher complexity

A generalization bound for variance-invariance-covariance regularization

Theorem 2 shows that VICReg improves generalization on supervised downstream tasks.
Minimizing the unlabeled invariance loss while controlling the covariance Z S Z S and the complexity of representations Rm (F) minimizes the expected labeled loss.
Maximizing the diagonal entries of Z S Z S while minimizing its off-diagonal entries is done in VICReg.
VICReg can be improved in the partially observable setting.

Comparison of generalization bounds

SimCLR requires label classes to go to infinity to close the generalization gap
VICReg does not require label classes to approach infinity for the generalization gap to go to zero
VICReg does not use negative pairs
VICReg bound improves as n increases
SimCLR bound does not depend on n

Understanding theorem 2 via mutual information maximization

Maximize mutual information I(Z; X) in SSL
Control complexity of representations f θ
Approximate ln |F| by 2 I(Z;X)
Maximize predictive information I(Z; X)
Minimize superfluous information I(Z; X|X)

Conclusions

Examined Variance-Invariance-Covariance Regularization for self-supervised learning from an information-theoretic perspective
Transferred stochasticity to the input distribution
Derived VICReg objective from information-theoretic principles
Highlighted assumptions implicit in the VICReg objective
Derived a VICReg generalization bound for downstream tasks
Related it to information maximization
Proposed a new VICReg-style SSL objective
Suggested that VICReg can be improved for partial label information
Opened up avenues for improved estimators and investigations into SSL methods
Lower bound on E x [log q(z|x )]
Theorem 3: DNN output density approximates a mixture of affinely transformed distributions
Theorem 4: Generalization bound
Used Hoeffding’s inequality and Lemma 2 to analyze the bound
Factor c measures the difference between labeled and unlabeled training data
Rademacher complexity of the hypothesis space W
Conditional expectations to take care of dependent random variables
Hoeffding’s inequality and Lemma 2 to analyze the bound
Minimum norm solution W S of the labeled training data and the minimum norm solution W S of the unlabeled training data
Analyzed the term E x [log q(z|x )]

Link to paper#

Abstract#

Paper Content#

Introduction#

Background & preliminaries#

An information-theoretic perspective on ssl in deterministic dnns#

Self-supervised learning as an information-theoretic problem#

Data distribution hypothesis#

Data distribution post the deep neural network transformation#

Information optimization and optimality#

An information-theoretic perspective on vicreg#

Validation of assumptions#

Self-supervised learning via mutual information maximization#

Vicreg vs. simclr#

Alternative entropy estimators#

Information maximization for vicreg and downstream generalization#

A generalization bound for variance-invariance-covariance regularization#

Comparison of generalization bounds#

Understanding theorem 2 via mutual information maximization#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Background & preliminaries

An information-theoretic perspective on ssl in deterministic dnns

Self-supervised learning as an information-theoretic problem

Data distribution hypothesis

Data distribution post the deep neural network transformation

Information optimization and optimality

An information-theoretic perspective on vicreg

Validation of assumptions

Self-supervised learning via mutual information maximization

Vicreg vs. simclr

Alternative entropy estimators

Information maximization for vicreg and downstream generalization

A generalization bound for variance-invariance-covariance regularization

Comparison of generalization bounds

Understanding theorem 2 via mutual information maximization

Conclusions