Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

ControlNet is a neural network structure used to control pretrained large diffusion models.
ControlNet can be trained on small datasets (< 50k) and is as fast as fine-tuning a diffusion model.
ControlNet can be trained on personal devices or powerful computation clusters.
ControlNet can be used to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.

Paper Content

Introduction

Large text-to-image models can generate visually appealing images with a short descriptive prompt.
Data scale in task-specific domains is often much smaller than in general image-text domain.
Fast training methods are important for optimizing large models to specific tasks within an acceptable amount of time and memory space.
Various image processing problems have diverse forms of problem definitions, user controls, or image annotations.
ControlNet is an end-to-end neural network architecture that controls large image diffusion models.
ControlNet clones the weights of a large diffusion model into a “trainable copy” and a “locked copy”.
ControlNet can be trained on small datasets (less than 50k or even 1k) and large datasets (millions of samples).

Hypernetwork and neural network structure

HyperNetwork is a neural language processing method used to influence the weights of a larger network
HyperNetwork has been used in image generation and other machine learning tasks
ControlNet and HyperNetwork are similar in the way they influence neural networks
Early neural network studies discussed the initialization of network weights
Recent studies have discussed methods to scale the initial weight of convolution layers to improve training

Diffusion probabilistic model

Diffusion probabilistic model proposed in [52]
Successful results of image generation reported at small and large scale
Improved by training and sampling methods like DDPM, DDIM, and score-based diffusion
Strategies to save computation power when handling high-resolution images
Pyramid-based or multiple-stage methods used
U-net used as neural network architecture
Latent Diffusion Model (LDM) proposed to reduce computation power required for training diffusion model

Text-to-image diffusion

Diffusion models can be used to generate images from text.
Pretrained language models like CLIP are used to encode text inputs into latent vectors.
Glide, Disco Diffusion, Stable Diffusion and Imagen are examples of text-to-image generating models.

Personalization,customization, and control of pretrained diffusion model

State-of-the-art image diffusion models are dominated by text-to-image methods
Control over a diffusion model can be achieved by manipulating CLIP features
Image diffusion process can provide color-level detail variations
Image diffusion algorithms support inpainting as a way to control results
Textual Inversion and DreamBooth can customize/personalize generated results using a small set of images

Image-to-image translation

Image-to-image translation is targeted to learn a mapping between images in different domains
ControlNet is targeted to control a diffusion model with task-specific conditions
Several methods have been developed for image-to-image translation, including conditional generative neural networks, autoregressive methods, multi-model methods, Taming Transformer, Palette, PITI, and optimization-based methods
Experiments are conducted to test these methods

Method

ControlNet is a neural network architecture that can improve pretrained image diffusion models.
Sections 3.1-3.5 describe the structure, application, learning objective, training methods, and implementations of ControlNet.

Controlnet

ControlNet manipulates the input conditions of neural network blocks to control the behavior of an entire neural network.
A “network block” is a set of neural layers used to build neural networks.
A neural network block transforms an input feature map into another feature map.
Parameters of the neural network block are cloned into a trainable copy.
The trainable copy is trained with an external condition vector.
A unique type of convolution layer called “zero convolution” is used to connect the neural network blocks.
The zero convolution layer is initialized with zeros and is optimized into non-zero parameters.

Controlnet in image diffusion model

Stable Diffusion is used as an example to introduce a method to control a large diffusion model.
Pre-processing method similar to VQ-GAN is used, which converts 512x512 image conditions to 64x64 feature maps.

Training

Image diffusion models learn to denoise images to generate samples.
Denoising can happen in pixel space or a “latent” space encoded from training data.
Stable Diffusion uses latent images as the training domain.
Diffusion algorithms add noise to the image and produces a noisy image.
Network is learned to predict the noise added to the noisy image.
During training, 50% of text prompts are replaced with empty strings.

Improved training

Strategies to improve the training of ControlNets are discussed, especially when the computation device is limited or powerful.
Disconnecting the link to decoder 1,2,3,4 and only connecting the middle block can improve the training speed.
When powerful computation clusters and large datasets are available, the entire model can be trained as a whole.
Experiments were conducted on the DIODE, Normal Maps (extended) and Cartoon Line Drawing datasets.

Experimental settings

CFG-scale set to 9.0
Sampler used is DDIM
20 steps used by default
4 types of prompts tested: no prompt, default prompt, automatic prompt, user prompt

Qualitative results

Results are presented in Figures 4 and 5
Qualitative results are shown

Ablation study

Model trained without using ControlNet
Sudden convergence phenomenon during training process from 5000 to 10000 steps with 1e-5 learning rate
Canny-edge-based ControlNets trained with different dataset scales

Comparison to previous methods

Comparison to Stability’s Depth-to-Image model shown in Fig. 14
Comparison to PITI [59] shown in Fig. 17
Comparison to sketch-guided diffusion [58] shown in Fig. 18
Comparison to Taming transformer [11] shown in Fig. 19

Comparison of pre-trained models

Comparison of pre-trained models in Fig. 23, 24, 25
Results of comparison shown in figures

More applications

Diffusion process can be masked for pen-based image editing
Model can achieve accurate control of details for simple objects
ControlNet can be applied to 50% diffusion iterations to get results that do not follow input shapes

Limitation

Model may have difficulty generating correct contents when semantic interpretation is wrong
Figures 28-29 show source images for edge detection, pose extraction, etc.
ControlNet structure compared to standard method used by Stable Diffusion
ControlNet applied to arbitrary neural network block
ControlNet applied to Stable Diffusion
Automatic prompts generated by BLIP
Semantic consistency of “wall”, “paper”, and “cup” is difficult to handle
ControlNet used to create trainable copy of 12 encoding blocks and 1 middle block of Stable Diffusion
ControlNet architecture likely to be usable in other diffusion models
Results achieved with default prompt
Most figures in paper are high-resolution images
Comparison to Sketch-guided diffusion and Taming Transformers
Sudden converge phenomenon
Comparison of six detection types and corresponding results
Masked diffusion supported by all diffusion models
Comparison to Sketch-guided diffusion and Taming Transformers
Example of simple object
Coarse-level control
Limitation when semantic of input image is mistakenly recognized
Appendix: all original source images for edge detection, semantic segmentation, pose extraction, etc.

Link to paper#

Abstract#

Paper Content#

Introduction#

Hypernetwork and neural network structure#

Diffusion probabilistic model#

Text-to-image diffusion#

Personalization,customization, and control of pretrained diffusion model#

Image-to-image translation#

Method#

Controlnet#

Controlnet in image diffusion model#

Training#

Improved training#

Experimental settings#

Qualitative results#

Ablation study#

Comparison to previous methods#

Comparison of pre-trained models#

More applications#

Limitation#

Link to paper

Abstract

Paper Content

Introduction

Hypernetwork and neural network structure

Diffusion probabilistic model

Text-to-image diffusion

Personalization,customization, and control of pretrained diffusion model

Image-to-image translation

Method

Controlnet

Controlnet in image diffusion model

Training

Improved training

Experimental settings

Qualitative results

Ablation study

Comparison to previous methods

Comparison of pre-trained models

More applications

Limitation