Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- ControlNet is a neural network structure used to control pretrained large diffusion models.
- ControlNet can be trained on small datasets (< 50k) and is as fast as fine-tuning a diffusion model.
- ControlNet can be trained on personal devices or powerful computation clusters.
- ControlNet can be used to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.
Paper Content
Introduction
- Large text-to-image models can generate visually appealing images with a short descriptive prompt.
- Data scale in task-specific domains is often much smaller than in general image-text domain.
- Fast training methods are important for optimizing large models to specific tasks within an acceptable amount of time and memory space.
- Various image processing problems have diverse forms of problem definitions, user controls, or image annotations.
- ControlNet is an end-to-end neural network architecture that controls large image diffusion models.
- ControlNet clones the weights of a large diffusion model into a “trainable copy” and a “locked copy”.
- ControlNet can be trained on small datasets (less than 50k or even 1k) and large datasets (millions of samples).
Hypernetwork and neural network structure
- HyperNetwork is a neural language processing method used to influence the weights of a larger network
- HyperNetwork has been used in image generation and other machine learning tasks
- ControlNet and HyperNetwork are similar in the way they influence neural networks
- Early neural network studies discussed the initialization of network weights
- Recent studies have discussed methods to scale the initial weight of convolution layers to improve training
Diffusion probabilistic model
- Diffusion probabilistic model proposed in [52]
- Successful results of image generation reported at small and large scale
- Improved by training and sampling methods like DDPM, DDIM, and score-based diffusion
- Strategies to save computation power when handling high-resolution images
- Pyramid-based or multiple-stage methods used
- U-net used as neural network architecture
- Latent Diffusion Model (LDM) proposed to reduce computation power required for training diffusion model
Text-to-image diffusion
- Diffusion models can be used to generate images from text.
- Pretrained language models like CLIP are used to encode text inputs into latent vectors.
- Glide, Disco Diffusion, Stable Diffusion and Imagen are examples of text-to-image generating models.
Personalization,customization, and control of pretrained diffusion model
- State-of-the-art image diffusion models are dominated by text-to-image methods
- Control over a diffusion model can be achieved by manipulating CLIP features
- Image diffusion process can provide color-level detail variations
- Image diffusion algorithms support inpainting as a way to control results
- Textual Inversion and DreamBooth can customize/personalize generated results using a small set of images
Image-to-image translation
- Image-to-image translation is targeted to learn a mapping between images in different domains
- ControlNet is targeted to control a diffusion model with task-specific conditions
- Several methods have been developed for image-to-image translation, including conditional generative neural networks, autoregressive methods, multi-model methods, Taming Transformer, Palette, PITI, and optimization-based methods
- Experiments are conducted to test these methods
Method
- ControlNet is a neural network architecture that can improve pretrained image diffusion models.
- Sections 3.1-3.5 describe the structure, application, learning objective, training methods, and implementations of ControlNet.
Controlnet
- ControlNet manipulates the input conditions of neural network blocks to control the behavior of an entire neural network.
- A “network block” is a set of neural layers used to build neural networks.
- A neural network block transforms an input feature map into another feature map.
- Parameters of the neural network block are cloned into a trainable copy.
- The trainable copy is trained with an external condition vector.
- A unique type of convolution layer called “zero convolution” is used to connect the neural network blocks.
- The zero convolution layer is initialized with zeros and is optimized into non-zero parameters.
Controlnet in image diffusion model
- Stable Diffusion is used as an example to introduce a method to control a large diffusion model.
- Pre-processing method similar to VQ-GAN is used, which converts 512x512 image conditions to 64x64 feature maps.
Training
- Image diffusion models learn to denoise images to generate samples.
- Denoising can happen in pixel space or a “latent” space encoded from training data.
- Stable Diffusion uses latent images as the training domain.
- Diffusion algorithms add noise to the image and produces a noisy image.
- Network is learned to predict the noise added to the noisy image.
- During training, 50% of text prompts are replaced with empty strings.
Improved training
- Strategies to improve the training of ControlNets are discussed, especially when the computation device is limited or powerful.
- Disconnecting the link to decoder 1,2,3,4 and only connecting the middle block can improve the training speed.
- When powerful computation clusters and large datasets are available, the entire model can be trained as a whole.
- Experiments were conducted on the DIODE, Normal Maps (extended) and Cartoon Line Drawing datasets.
Experimental settings
- CFG-scale set to 9.0
- Sampler used is DDIM
- 20 steps used by default
- 4 types of prompts tested: no prompt, default prompt, automatic prompt, user prompt
Qualitative results
- Results are presented in Figures 4 and 5
- Qualitative results are shown
Ablation study
- Model trained without using ControlNet
- Sudden convergence phenomenon during training process from 5000 to 10000 steps with 1e-5 learning rate
- Canny-edge-based ControlNets trained with different dataset scales
Comparison to previous methods
- Comparison to Stability’s Depth-to-Image model shown in Fig. 14
- Comparison to PITI [59] shown in Fig. 17
- Comparison to sketch-guided diffusion [58] shown in Fig. 18
- Comparison to Taming transformer [11] shown in Fig. 19
Comparison of pre-trained models
- Comparison of pre-trained models in Fig. 23, 24, 25
- Results of comparison shown in figures
More applications
- Diffusion process can be masked for pen-based image editing
- Model can achieve accurate control of details for simple objects
- ControlNet can be applied to 50% diffusion iterations to get results that do not follow input shapes
Limitation
- Model may have difficulty generating correct contents when semantic interpretation is wrong
- Figures 28-29 show source images for edge detection, pose extraction, etc.
- ControlNet structure compared to standard method used by Stable Diffusion
- ControlNet applied to arbitrary neural network block
- ControlNet applied to Stable Diffusion
- Automatic prompts generated by BLIP
- Semantic consistency of “wall”, “paper”, and “cup” is difficult to handle
- ControlNet used to create trainable copy of 12 encoding blocks and 1 middle block of Stable Diffusion
- ControlNet architecture likely to be usable in other diffusion models
- Results achieved with default prompt
- Most figures in paper are high-resolution images
- Comparison to Sketch-guided diffusion and Taming Transformers
- Sudden converge phenomenon
- Comparison of six detection types and corresponding results
- Masked diffusion supported by all diffusion models
- Comparison to Sketch-guided diffusion and Taming Transformers
- Example of simple object
- Coarse-level control
- Limitation when semantic of input image is mistakenly recognized
- Appendix: all original source images for edge detection, semantic segmentation, pose extraction, etc.