Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Neural Representations can be used to reconstruct a wide range of signals.
  • NeRN is a Neural Representation for Neural Networks.
  • Coordinates are assigned to each convolutional kernel in the network.
  • Smoothness constraint is used to aid NeRN.
  • Knowledge distillation is used to stabilize the learning process.
  • NeRN is demonstrated on CIFAR-10, CIFAR-100, and ImageNet.
  • Two applications are presented to demonstrate the capabilities of NeRN.

Paper Content

Introduction

  • Neural networks have been effective at learning representations in the last decade
  • NeRF demonstrated that a neural network can learn to represent a 3D scene
  • Task is modeled as a prediction problem from a coordinate system to an output
  • SIREN showed neural representations can model images
  • NeRV used neural representations for video encoding
  • This paper explores learning neural representations for pre-trained neural networks
  • NeRN is a predictor neural network that maps coordinates to original kernel weights
  • NeRN can be used to study importance of different weights
  • Framework for NeRN built using PyTorch
  • Neural representations are powerful tools for representing signals
  • Previous representations such as grids or meshes have been surpassed by neural representations
  • Neural representations have been used for image compression, video encoding, camera pose estimation, etc.
  • Weight prediction refers to generating a neural network’s weights using an additional predictor network
  • Knowledge distillation is used to improve the performance of a compressed network

Method

  • Representing convolutional classification networks
  • Overall pipeline presented in Figure 1
  • Design choices and training of NeRN

Designing nerns

  • NeRN is composed of a simple neural network
  • NeRN is trained using 3 losses: reconstruction, knowledge distillation, and feature map reconstruction

I/o modeling

  • Propose to learn a mapping between 3-tuple (l, f, c) and kernel size
  • Sample from middle when predicting smaller kernels
  • Model convolutional layers only, not others
  • Use high dimensional vector space to represent high-frequency variations
  • NeRN predictor is a 5-layer MLP

Training nerns

  • NeRNs require a set of loss functions to be trained
  • Reconstruction loss is the most basic loss
  • Two additional losses are introduced: KD and FMD
  • These losses improve accuracy, promote faster convergence and stabilize training process
  • No direct task loss or labeled data is required
  • Objective function is comprised of reconstruction, KD and FMD losses
  • Weight reconstruction loss is defined as the difference between original and reconstructed network weights
  • FMD loss is defined by normalized feature maps
  • KD loss is defined by the difference between original and reconstructed network outputs
  • Stochastic sampling is used to support large neural networks

Promoting smoothness

  • Videos, images, and 3D objects have inherent smoothness, but this is not the case with neural network weights.
  • Adjacent frames in a video are likely to be similar, but there is no reason for adjacent kernels on a pre-trained network to have similar values.
  • Introducing smoothness between kernels can simplify the task for NeRN.
  • Regularization-based smoothness adds a loss term to the training process of the original network to encourage smoothness.
  • Permutation-based smoothness applies permutations over the pre-trained model’s weights to achieve kernel smoothness.
  • Permutation-based smoothness is solved using graph theory and a greedy algorithm.

Smoothness in positional embeddings

  • NeRN learns to represent kernels by mapping from positional embedding to kernel
  • Embeddings chosen play a part in learning smooth neural networks
  • Adjacent convolution kernels should be similar to one another
  • Positional embeddings should be slowly changing with respect to adjacent kernel coordinates and highly separable with respect to distant coordinates
  • This allows for networks with slowly changing kernels within each layer

Experiments

  • Evaluated proposed method on 3 standard vision classification benchmarks
  • Used NeRN to predict weights of ResNet architectures
  • ResNet chosen for popularity, non-trivial design, and high accuracy
  • Examined various NeRN hidden layer sizes and showed effectiveness of promoting smoothness
  • Adopted Ranger optimizer, learning rate of 5 • 10 −3, and cosine learning rate decay
  • Ran experiments using PyTorch on Nvidia RTX3090

Cifar-10

  • Trained ResNet20/56 to fit CIFAR input size
  • Results in Table 1 show increasing predictor size results in performance gains
  • Regularization term balances original network accuracy and NeRN’s ability to reconstruct the network
  • Optimal regularization factor is 5 • 10−6

Cifar-100

  • Trained ResNet56 to be used as original model
  • Promoting smoothness and using larger predictors resulted in better reconstruction

Imagenet

  • NeRN can learn to represent a network that was trained on a large-scale dataset without access to the training scheme of the original model.
  • NeRN was trained for 160k iterations, using a task input batch size of 32 (4 epochs).
  • Results showed that a NeRN of ∼ 54% the size of the original model was sufficient.

Data-free training

  • NeRN can be trained without using the original task data
  • Distilling knowledge using out-of-domain data is difficult
  • Beyer et al. (2022) achieved worse results with out-of-domain data
  • NeRN can be trained with noise as input and performs slightly worse than with task data

Reconstructing non-smooth networks

  • Method for permutation smoothness offers small overhead
  • Examine proposed positional embeddings
  • Blue plot = smooth embeddings, red plot = nonsmooth embeddings
  • Predictor not affected by smooth embeddings, worsens ability to handle high frequencies
  • Non-smooth embeddings allow for better reconstruction in non-smooth networks
  • Results inferior to those gained by smooth networks

Ablation experiments

  • Results emphasize importance of distillation and reconstruction losses
  • Examined weight sampling methods for gradient computation

Additional applications

  • NeRNs offer a new viewpoint on neural networks by encoding the network weights in another network.
  • NeRN prioritizes the reconstruction of weights based on their influence on the activations and logits, allowing for the visualization of important filters.

Meta-compression

  • NeRN is a compact representation of a neural network
  • Naive magnitude-based pruning can be used to compress the NeRN predictor
  • Structured pruning and quantization can be used as further extensions
  • NeRN can be used post-training to reduce disk size without access to task data

Conclusion

  • Propose a technique to learn a neural representation for neural networks (NeRN)
  • Reconstructs the weights of a pretrained CNN
  • Uses multiple losses and a unique learning scheme
  • Demonstrates importance of weight smoothness
  • Two possible applications: weight importance analysis and meta-compression
  • Provide code and instructions in supplementary materials
  • Visualize reconstructed kernels
  • Initialize NeRN to preserve activation’s variance
  • Train NeRN using three losses
  • Results on popular architectures and benchmarks
  • Size overhead for weight permutations on standard ResNet architectures