Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Neural Representations can be used to reconstruct a wide range of signals.
- NeRN is a Neural Representation for Neural Networks.
- Coordinates are assigned to each convolutional kernel in the network.
- Smoothness constraint is used to aid NeRN.
- Knowledge distillation is used to stabilize the learning process.
- NeRN is demonstrated on CIFAR-10, CIFAR-100, and ImageNet.
- Two applications are presented to demonstrate the capabilities of NeRN.
Paper Content
Introduction
- Neural networks have been effective at learning representations in the last decade
- NeRF demonstrated that a neural network can learn to represent a 3D scene
- Task is modeled as a prediction problem from a coordinate system to an output
- SIREN showed neural representations can model images
- NeRV used neural representations for video encoding
- This paper explores learning neural representations for pre-trained neural networks
- NeRN is a predictor neural network that maps coordinates to original kernel weights
- NeRN can be used to study importance of different weights
- Framework for NeRN built using PyTorch
Related work
- Neural representations are powerful tools for representing signals
- Previous representations such as grids or meshes have been surpassed by neural representations
- Neural representations have been used for image compression, video encoding, camera pose estimation, etc.
- Weight prediction refers to generating a neural network’s weights using an additional predictor network
- Knowledge distillation is used to improve the performance of a compressed network
Method
- Representing convolutional classification networks
- Overall pipeline presented in Figure 1
- Design choices and training of NeRN
Designing nerns
- NeRN is composed of a simple neural network
- NeRN is trained using 3 losses: reconstruction, knowledge distillation, and feature map reconstruction
I/o modeling
- Propose to learn a mapping between 3-tuple (l, f, c) and kernel size
- Sample from middle when predicting smaller kernels
- Model convolutional layers only, not others
- Use high dimensional vector space to represent high-frequency variations
- NeRN predictor is a 5-layer MLP
Training nerns
- NeRNs require a set of loss functions to be trained
- Reconstruction loss is the most basic loss
- Two additional losses are introduced: KD and FMD
- These losses improve accuracy, promote faster convergence and stabilize training process
- No direct task loss or labeled data is required
- Objective function is comprised of reconstruction, KD and FMD losses
- Weight reconstruction loss is defined as the difference between original and reconstructed network weights
- FMD loss is defined by normalized feature maps
- KD loss is defined by the difference between original and reconstructed network outputs
- Stochastic sampling is used to support large neural networks
Promoting smoothness
- Videos, images, and 3D objects have inherent smoothness, but this is not the case with neural network weights.
- Adjacent frames in a video are likely to be similar, but there is no reason for adjacent kernels on a pre-trained network to have similar values.
- Introducing smoothness between kernels can simplify the task for NeRN.
- Regularization-based smoothness adds a loss term to the training process of the original network to encourage smoothness.
- Permutation-based smoothness applies permutations over the pre-trained model’s weights to achieve kernel smoothness.
- Permutation-based smoothness is solved using graph theory and a greedy algorithm.
Smoothness in positional embeddings
- NeRN learns to represent kernels by mapping from positional embedding to kernel
- Embeddings chosen play a part in learning smooth neural networks
- Adjacent convolution kernels should be similar to one another
- Positional embeddings should be slowly changing with respect to adjacent kernel coordinates and highly separable with respect to distant coordinates
- This allows for networks with slowly changing kernels within each layer
Experiments
- Evaluated proposed method on 3 standard vision classification benchmarks
- Used NeRN to predict weights of ResNet architectures
- ResNet chosen for popularity, non-trivial design, and high accuracy
- Examined various NeRN hidden layer sizes and showed effectiveness of promoting smoothness
- Adopted Ranger optimizer, learning rate of 5 • 10 −3, and cosine learning rate decay
- Ran experiments using PyTorch on Nvidia RTX3090
Cifar-10
- Trained ResNet20/56 to fit CIFAR input size
- Results in Table 1 show increasing predictor size results in performance gains
- Regularization term balances original network accuracy and NeRN’s ability to reconstruct the network
- Optimal regularization factor is 5 • 10−6
Cifar-100
- Trained ResNet56 to be used as original model
- Promoting smoothness and using larger predictors resulted in better reconstruction
Imagenet
- NeRN can learn to represent a network that was trained on a large-scale dataset without access to the training scheme of the original model.
- NeRN was trained for 160k iterations, using a task input batch size of 32 (4 epochs).
- Results showed that a NeRN of ∼ 54% the size of the original model was sufficient.
Data-free training
- NeRN can be trained without using the original task data
- Distilling knowledge using out-of-domain data is difficult
- Beyer et al. (2022) achieved worse results with out-of-domain data
- NeRN can be trained with noise as input and performs slightly worse than with task data
Reconstructing non-smooth networks
- Method for permutation smoothness offers small overhead
- Examine proposed positional embeddings
- Blue plot = smooth embeddings, red plot = nonsmooth embeddings
- Predictor not affected by smooth embeddings, worsens ability to handle high frequencies
- Non-smooth embeddings allow for better reconstruction in non-smooth networks
- Results inferior to those gained by smooth networks
Ablation experiments
- Results emphasize importance of distillation and reconstruction losses
- Examined weight sampling methods for gradient computation
Additional applications
- NeRNs offer a new viewpoint on neural networks by encoding the network weights in another network.
- NeRN prioritizes the reconstruction of weights based on their influence on the activations and logits, allowing for the visualization of important filters.
Meta-compression
- NeRN is a compact representation of a neural network
- Naive magnitude-based pruning can be used to compress the NeRN predictor
- Structured pruning and quantization can be used as further extensions
- NeRN can be used post-training to reduce disk size without access to task data
Conclusion
- Propose a technique to learn a neural representation for neural networks (NeRN)
- Reconstructs the weights of a pretrained CNN
- Uses multiple losses and a unique learning scheme
- Demonstrates importance of weight smoothness
- Two possible applications: weight importance analysis and meta-compression
- Provide code and instructions in supplementary materials
- Visualize reconstructed kernels
- Initialize NeRN to preserve activation’s variance
- Train NeRN using three losses
- Results on popular architectures and benchmarks
- Size overhead for weight permutations on standard ResNet architectures