Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Public model zoo contains powerful pretrained models
Question of how to assemble models for accuracy-efficiency trade-offs
SN-Net framework for model deployment produces networks with different complexity and performance
SN-Net splits pretrained networks (anchors) and stitches them together
SN-Net can adapt to dynamic resource constraints
SN-Net can challenge hundreds of models with a single network

Computational resources and data have enabled researchers to build powerful deep neural networks
There are thousands of models available to download and execute
Existing scalable deep learning frameworks are limited to a single model design space
Stitchable Neural Network (SN-Net) is a novel scalable deep learning framework for efficient model design and deployment
SN-Net stitches an off-the-shelf pretrained model family with much less training effort
SN-Net covers a fine-grained level of model complexity/performance for a wide range of deployment scenarios
SN-Net breaks the limit of a single pretrained model or supernet design
Training SN-Net is as easy as training individual models
SN-Net performance is almost predictable
SN-Net is a new universal paradigm with a “many-to-many” pipeline
SN-Net is a general approach for utilising the pretrained model families in the large-scale model zoo

Model parameters of a pretrained neural network are indicated by θ
A feed-forward neural network can be defined as a composition of functions
Model stitching involves splitting a neural network into two portions of functions at a layer index l
A stitching layer is used to implement a transformation between the activation space of two different networks
Model stitching can produce a sequence of stitched networks
Different architectures can be stitched together without significant performance drop

SN-Net is a new “many-to-many” elastic model paradigm
It is motivated by the increasing number of pretrained models in the publicly available model zoo
SN-Net inserts a few stitching layers to connect a family of pretrained models
Anchors should be consistent in terms of the pretrained domain
Stitching layers are 1x1 convolutional layers
Least-squares solution is used as the default initialization approach
Fast-to-Slow is the default stitching direction
Nearest stitching strategy is used
Stitching is done as sliding windows
Training strategy uses knowledge distillation
Experiments are conducted on ImageNet-1K
Models studied are DeiT, Swin Transformer, ResNet and CNN with ViT

Generate stitching configuration set by assembling ImageNet-1K pretrained DeiT-Ti/S/B
Jointly train stitches in DeiT-based SN-Net on ImageNet with 50 epochs
Visualize performance of all 71 stitches, including 3 anchors
Performance increases when stitching more blocks from larger anchor
Model-level interpolation between two anchors
SN-Net achieves better performance than individually trained models from scratch
SN-Net reduces training cost and disk storage compared to training and saving all individual networks

SN-Net ablated with default training strategy of 50 epochs on ImageNet
LS Init serves as a good starting point for learning stitching layers compared to Kaiming Init
Fast-to-Slow helps to ensure better performance for most stitches
Nearest stitching strategy limits a stitch to connect with a pair of anchors with nearest model complexity/performance
Tuning stitching layers is only promising for some stitches
Tuning full model improves performance of stitches

Introduced Stitchable Neural Networks (SN-Net)
Framework for developing elastic neural networks
Inherit knowledge from pretrained model families
Deliver fast and flexible accuracy-efficiency trade-offs
Low cost for massive deployment of deep models
Extendable to natural language processing, dense prediction and transfer learning
Limitations: large stitching space requires more training epochs
Nearest stitching strategy limits stitches to two types
Different settings of sliding windows can produce different number of stitches
Default setting of 50 training epochs
15 epochs still produces good performance
Simple training strategy of randomly sampling a stitch
Sandwich sampling rule and inplace distillation explored
Pretrained weights of anchors necessary for convergence
Default of 100 training images to initialize stitching layers
More samples does not bring more performance gain
SN-Net able to switch network topology at runtime