Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Provides an information-theoretic perspective on Variance-Invariance-Covariance Regularization (VICReg)
  • Demonstrates how information-theoretic quantities can be obtained for deterministic networks
  • Relates VICReg objective to mutual information maximization
  • Derives a generalization bound for VICReg
  • Presents new self-supervised learning methods derived from a mutual information maximization objective

Paper Content

Introduction

  • Self-Supervised Learning (SSL) methods learn representations by optimizing a surrogate objective between inputs and self-defined signals.
  • Information-theoretic methods have been used for deep learning applications and theoretical investigations.
  • Some works have attempted to use information theory for SSL, such as the InfoMax principle.
  • These works often present objective functions without rigorous justification and make implicit assumptions.

Background & preliminaries

  • Continuous Piecewise Affine (CPA) Mappings are a rich class of functions
  • A spline of order k is a mapping defined by a polynomial of order k on each region
  • We focus on affine splines (k = 1)
  • Spline operators have been widely used in various fields
  • Deep Neural Networks (DNNs) are a nonlinear operator with parameters that map an input to a prediction
  • Nonlinearities in DNNs are CPA mappings
  • Self-Supervised Learning (SSL) is used to learn DNN parameters without supervision
  • Contrastive methods use positive and negative examples to learn representations
  • Non-contrastive methods use regularization to prevent collapsing of the representation
  • Information-theoretic methods have been used to advance deep learning
  • Information-theoretic objectives for deterministic DNNs assume that DNN mappings are stochastic
  • Stochastic DNNs with variational bounds can be used to avoid this problem
  • Goldfeld et al. (2018) introduced an auxiliary (noisy) DNN by injecting additive noise into the model

An information-theoretic perspective on ssl in deterministic dnns

  • Assumptions about information-theoretic challenges in SSL
  • Assumptions about data distribution
  • Training sample x comes from single Gaussian distribution
  • Output of DNN corresponds to mixture of truncated Gaussian distributions
  • Approximation to VICReg objective can be recovered from information-theoretic principles

Self-supervised learning as an information-theoretic problem

  • Formulate general SSL goal from information-theoretical perspective
  • Analyze and compare different SSL methods based on ability to maximize mutual information
  • MultiView InfoMax principle aims to maximize mutual information between two different views
  • Maximize I(Z; X ) and I(Z ; X) using lower bound where H(Z) is entropy of Z
  • In supervised learning, maximize I(Z; Y ) and minimize log-loss E x [log q(z|x)]
  • Entropies not constant and can be optimized throughout learning process
  • Different methods utilize different approaches to implicit regularizing information

Data distribution hypothesis

  • Examines the way output random variables of a network are represented
  • Assumes a distribution over the data
  • Points can be seen as Gaussian random variables with a low-rank covariance matrix
  • Considers the conditioning of a latent representation with respect to the mean of the observation
  • Dataset is a collection of points
  • Full data distribution is a sum of low-rank covariance Gaussian densities
  • Assumes dataset is a mixture of Gaussians with non-overlapping support

Data distribution post the deep neural network transformation

  • Affine spline operator f maps from a space of dimension D to a space of dimension K with K ≥ D.
  • Affine transformation of region ω by the per-region parameters A ω , b ω .
  • Density of f (X) is intractable, but can be increased by increasing the number of prototypes N.

Information optimization and optimality

  • SSL algorithms for deterministic networks can be derived.
  • Maximize I(Z; X ) and I(Z ; X).
  • Variational approximation using expected loss.
  • Conditional output density is reduced to a single Gaussian.

An information-theoretic perspective on vicreg

  • Derived an objective function based on information theoretical first principles
  • Estimating entropy is a classic problem in information theory
  • Plots show l2 distances between raw images for different datasets
  • Approximations exist to estimate entropy
  • Maximizing the sum of the log eigenvalues implies maximizing the log determinant of Z
  • Maximize the eigenvalues of Z using more sophisticated methods

Validation of assumptions

  • Conditional output density can be reduced to a single Gaussian
  • ResNet-18 model trained with SimCLR or VICReg objectives on CIFAR-10 and CIFAR-100 datasets
  • 512 Gaussian samples for each image from test dataset
  • Used D’Agostino and Pearson’s test to determine validity of assumption
  • P-value decreases as input noise increases
  • Calculated distribution of pairwise l2 distances between images for seven datasets
  • Pairwise distances far from zero, effective support of datasets is non-overlapping

Self-supervised learning via mutual information maximization

  • VICReg uses an approximation of the entropy based on assumptions
  • Comparing VICReg to other methods such as SimCLR, BYOL and SimSiam
  • Suggesting new objective functions to improve performance of SSL and understand underlying learning mechanisms

Vicreg vs. simclr

  • SimCLR and VICReg have different conditional distributions (von Mises-Fisher vs Gaussian)
  • SimCLR approximates entropy based on finite sum of input samples, VICReg estimates entropy based on second moment
  • ResNet-18 trained on CIFAR-10 for VICReg, SimCLR and BYOL
  • Entropy estimator based on distances of individual mixture component used for comparison
  • SimCLR has lowest entropy, VICReg has highest entropy

Alternative entropy estimators

  • Existing methods use an approximation to the entropy
  • Suggest combining the invariance term of these methods with plug-in methods for optimizing the entropy
  • LogDet Entropy Estimator provides a tighter upper bound
  • Lower bound estimator based on pairwise distances of the individual mixture component
  • Experiments conducted on CIFAR-10 using ResNet-18 architecture
  • Smart selection of entropy estimators leads to better results

Information maximization for vicreg and downstream generalization

  • Connection between information-theoretic principles and VICReg objective
  • Deriving a downstream generalization bound for VICReg
  • Relating generalization in VICReg to information maximization and implicit regularization
  • Defining labeled loss, representation matrices, projection matrices, label matrix, and unknown label matrix
  • Defining normalized Rademacher complexity

A generalization bound for variance-invariance-covariance regularization

  • Theorem 2 shows that VICReg improves generalization on supervised downstream tasks.
  • Minimizing the unlabeled invariance loss while controlling the covariance Z S Z S and the complexity of representations Rm (F) minimizes the expected labeled loss.
  • Maximizing the diagonal entries of Z S Z S while minimizing its off-diagonal entries is done in VICReg.
  • VICReg can be improved in the partially observable setting.

Comparison of generalization bounds

  • SimCLR requires label classes to go to infinity to close the generalization gap
  • VICReg does not require label classes to approach infinity for the generalization gap to go to zero
  • VICReg does not use negative pairs
  • VICReg bound improves as n increases
  • SimCLR bound does not depend on n

Understanding theorem 2 via mutual information maximization

  • Maximize mutual information I(Z; X) in SSL
  • Control complexity of representations f θ
  • Approximate ln |F| by 2 I(Z;X)
  • Maximize predictive information I(Z; X)
  • Minimize superfluous information I(Z; X|X)

Conclusions

  • Examined Variance-Invariance-Covariance Regularization for self-supervised learning from an information-theoretic perspective
  • Transferred stochasticity to the input distribution
  • Derived VICReg objective from information-theoretic principles
  • Highlighted assumptions implicit in the VICReg objective
  • Derived a VICReg generalization bound for downstream tasks
  • Related it to information maximization
  • Proposed a new VICReg-style SSL objective
  • Suggested that VICReg can be improved for partial label information
  • Opened up avenues for improved estimators and investigations into SSL methods
  • Lower bound on E x [log q(z|x )]
  • Theorem 3: DNN output density approximates a mixture of affinely transformed distributions
  • Theorem 4: Generalization bound
  • Used Hoeffding’s inequality and Lemma 2 to analyze the bound
  • Factor c measures the difference between labeled and unlabeled training data
  • Rademacher complexity of the hypothesis space W
  • Conditional expectations to take care of dependent random variables
  • Hoeffding’s inequality and Lemma 2 to analyze the bound
  • Minimum norm solution W S of the labeled training data and the minimum norm solution W S of the unlabeled training data
  • Analyzed the term E x [log q(z|x )]