Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Neural Representations can be used to reconstruct a wide range of signals.
NeRN is a Neural Representation for Neural Networks.
Coordinates are assigned to each convolutional kernel in the network.
Smoothness constraint is used to aid NeRN.
Knowledge distillation is used to stabilize the learning process.
NeRN is demonstrated on CIFAR-10, CIFAR-100, and ImageNet.
Two applications are presented to demonstrate the capabilities of NeRN.

Paper Content

Introduction

Neural networks have been effective at learning representations in the last decade
NeRF demonstrated that a neural network can learn to represent a 3D scene
Task is modeled as a prediction problem from a coordinate system to an output
SIREN showed neural representations can model images
NeRV used neural representations for video encoding
This paper explores learning neural representations for pre-trained neural networks
NeRN is a predictor neural network that maps coordinates to original kernel weights
NeRN can be used to study importance of different weights
Framework for NeRN built using PyTorch

Neural representations are powerful tools for representing signals
Previous representations such as grids or meshes have been surpassed by neural representations
Neural representations have been used for image compression, video encoding, camera pose estimation, etc.
Weight prediction refers to generating a neural network’s weights using an additional predictor network
Knowledge distillation is used to improve the performance of a compressed network

Method

Representing convolutional classification networks
Overall pipeline presented in Figure 1
Design choices and training of NeRN

Designing nerns

NeRN is composed of a simple neural network
NeRN is trained using 3 losses: reconstruction, knowledge distillation, and feature map reconstruction

I/o modeling

Propose to learn a mapping between 3-tuple (l, f, c) and kernel size
Sample from middle when predicting smaller kernels
Model convolutional layers only, not others
Use high dimensional vector space to represent high-frequency variations
NeRN predictor is a 5-layer MLP

Training nerns

NeRNs require a set of loss functions to be trained
Reconstruction loss is the most basic loss
Two additional losses are introduced: KD and FMD
These losses improve accuracy, promote faster convergence and stabilize training process
No direct task loss or labeled data is required
Objective function is comprised of reconstruction, KD and FMD losses
Weight reconstruction loss is defined as the difference between original and reconstructed network weights
FMD loss is defined by normalized feature maps
KD loss is defined by the difference between original and reconstructed network outputs
Stochastic sampling is used to support large neural networks

Promoting smoothness

Videos, images, and 3D objects have inherent smoothness, but this is not the case with neural network weights.
Adjacent frames in a video are likely to be similar, but there is no reason for adjacent kernels on a pre-trained network to have similar values.
Introducing smoothness between kernels can simplify the task for NeRN.
Regularization-based smoothness adds a loss term to the training process of the original network to encourage smoothness.
Permutation-based smoothness applies permutations over the pre-trained model’s weights to achieve kernel smoothness.
Permutation-based smoothness is solved using graph theory and a greedy algorithm.

Smoothness in positional embeddings

NeRN learns to represent kernels by mapping from positional embedding to kernel
Embeddings chosen play a part in learning smooth neural networks
Adjacent convolution kernels should be similar to one another
Positional embeddings should be slowly changing with respect to adjacent kernel coordinates and highly separable with respect to distant coordinates
This allows for networks with slowly changing kernels within each layer

Experiments

Evaluated proposed method on 3 standard vision classification benchmarks
Used NeRN to predict weights of ResNet architectures
ResNet chosen for popularity, non-trivial design, and high accuracy
Examined various NeRN hidden layer sizes and showed effectiveness of promoting smoothness
Adopted Ranger optimizer, learning rate of 5 • 10 −3, and cosine learning rate decay
Ran experiments using PyTorch on Nvidia RTX3090

Cifar-10

Trained ResNet20/56 to fit CIFAR input size
Results in Table 1 show increasing predictor size results in performance gains
Regularization term balances original network accuracy and NeRN’s ability to reconstruct the network
Optimal regularization factor is 5 • 10−6

Cifar-100

Trained ResNet56 to be used as original model
Promoting smoothness and using larger predictors resulted in better reconstruction

Imagenet

NeRN can learn to represent a network that was trained on a large-scale dataset without access to the training scheme of the original model.
NeRN was trained for 160k iterations, using a task input batch size of 32 (4 epochs).
Results showed that a NeRN of ∼ 54% the size of the original model was sufficient.

Data-free training

NeRN can be trained without using the original task data
Distilling knowledge using out-of-domain data is difficult
Beyer et al. (2022) achieved worse results with out-of-domain data
NeRN can be trained with noise as input and performs slightly worse than with task data

Reconstructing non-smooth networks

Method for permutation smoothness offers small overhead
Examine proposed positional embeddings
Blue plot = smooth embeddings, red plot = nonsmooth embeddings
Predictor not affected by smooth embeddings, worsens ability to handle high frequencies
Non-smooth embeddings allow for better reconstruction in non-smooth networks
Results inferior to those gained by smooth networks

Ablation experiments

Results emphasize importance of distillation and reconstruction losses
Examined weight sampling methods for gradient computation

Additional applications

NeRNs offer a new viewpoint on neural networks by encoding the network weights in another network.
NeRN prioritizes the reconstruction of weights based on their influence on the activations and logits, allowing for the visualization of important filters.

Meta-compression

NeRN is a compact representation of a neural network
Naive magnitude-based pruning can be used to compress the NeRN predictor
Structured pruning and quantization can be used as further extensions
NeRN can be used post-training to reduce disk size without access to task data

Conclusion

Propose a technique to learn a neural representation for neural networks (NeRN)
Reconstructs the weights of a pretrained CNN
Uses multiple losses and a unique learning scheme
Demonstrates importance of weight smoothness
Two possible applications: weight importance analysis and meta-compression
Provide code and instructions in supplementary materials
Visualize reconstructed kernels
Initialize NeRN to preserve activation’s variance
Train NeRN using three losses
Results on popular architectures and benchmarks
Size overhead for weight permutations on standard ResNet architectures

Link to paper#

Abstract#

Paper Content#

Introduction#

Related work#

Method#

Designing nerns#

I/o modeling#

Training nerns#

Promoting smoothness#

Smoothness in positional embeddings#

Experiments#

Cifar-10#

Cifar-100#

Imagenet#

Data-free training#

Reconstructing non-smooth networks#

Ablation experiments#

Additional applications#

Meta-compression#

Conclusion#

Link to paper

Abstract

Paper Content

Introduction

Related work

Method

Designing nerns

I/o modeling

Training nerns

Promoting smoothness

Smoothness in positional embeddings

Experiments

Cifar-10

Cifar-100

Imagenet

Data-free training

Reconstructing non-smooth networks

Ablation experiments

Additional applications

Meta-compression

Conclusion