Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • ControlNet is a neural network structure used to control pretrained large diffusion models.
  • ControlNet can be trained on small datasets (< 50k) and is as fast as fine-tuning a diffusion model.
  • ControlNet can be trained on personal devices or powerful computation clusters.
  • ControlNet can be used to enable conditional inputs like edge maps, segmentation maps, keypoints, etc.

Paper Content

Introduction

  • Large text-to-image models can generate visually appealing images with a short descriptive prompt.
  • Data scale in task-specific domains is often much smaller than in general image-text domain.
  • Fast training methods are important for optimizing large models to specific tasks within an acceptable amount of time and memory space.
  • Various image processing problems have diverse forms of problem definitions, user controls, or image annotations.
  • ControlNet is an end-to-end neural network architecture that controls large image diffusion models.
  • ControlNet clones the weights of a large diffusion model into a “trainable copy” and a “locked copy”.
  • ControlNet can be trained on small datasets (less than 50k or even 1k) and large datasets (millions of samples).

Hypernetwork and neural network structure

  • HyperNetwork is a neural language processing method used to influence the weights of a larger network
  • HyperNetwork has been used in image generation and other machine learning tasks
  • ControlNet and HyperNetwork are similar in the way they influence neural networks
  • Early neural network studies discussed the initialization of network weights
  • Recent studies have discussed methods to scale the initial weight of convolution layers to improve training

Diffusion probabilistic model

  • Diffusion probabilistic model proposed in [52]
  • Successful results of image generation reported at small and large scale
  • Improved by training and sampling methods like DDPM, DDIM, and score-based diffusion
  • Strategies to save computation power when handling high-resolution images
  • Pyramid-based or multiple-stage methods used
  • U-net used as neural network architecture
  • Latent Diffusion Model (LDM) proposed to reduce computation power required for training diffusion model

Text-to-image diffusion

  • Diffusion models can be used to generate images from text.
  • Pretrained language models like CLIP are used to encode text inputs into latent vectors.
  • Glide, Disco Diffusion, Stable Diffusion and Imagen are examples of text-to-image generating models.

Personalization,customization, and control of pretrained diffusion model

  • State-of-the-art image diffusion models are dominated by text-to-image methods
  • Control over a diffusion model can be achieved by manipulating CLIP features
  • Image diffusion process can provide color-level detail variations
  • Image diffusion algorithms support inpainting as a way to control results
  • Textual Inversion and DreamBooth can customize/personalize generated results using a small set of images

Image-to-image translation

  • Image-to-image translation is targeted to learn a mapping between images in different domains
  • ControlNet is targeted to control a diffusion model with task-specific conditions
  • Several methods have been developed for image-to-image translation, including conditional generative neural networks, autoregressive methods, multi-model methods, Taming Transformer, Palette, PITI, and optimization-based methods
  • Experiments are conducted to test these methods

Method

  • ControlNet is a neural network architecture that can improve pretrained image diffusion models.
  • Sections 3.1-3.5 describe the structure, application, learning objective, training methods, and implementations of ControlNet.

Controlnet

  • ControlNet manipulates the input conditions of neural network blocks to control the behavior of an entire neural network.
  • A “network block” is a set of neural layers used to build neural networks.
  • A neural network block transforms an input feature map into another feature map.
  • Parameters of the neural network block are cloned into a trainable copy.
  • The trainable copy is trained with an external condition vector.
  • A unique type of convolution layer called “zero convolution” is used to connect the neural network blocks.
  • The zero convolution layer is initialized with zeros and is optimized into non-zero parameters.

Controlnet in image diffusion model

  • Stable Diffusion is used as an example to introduce a method to control a large diffusion model.
  • Pre-processing method similar to VQ-GAN is used, which converts 512x512 image conditions to 64x64 feature maps.

Training

  • Image diffusion models learn to denoise images to generate samples.
  • Denoising can happen in pixel space or a “latent” space encoded from training data.
  • Stable Diffusion uses latent images as the training domain.
  • Diffusion algorithms add noise to the image and produces a noisy image.
  • Network is learned to predict the noise added to the noisy image.
  • During training, 50% of text prompts are replaced with empty strings.

Improved training

  • Strategies to improve the training of ControlNets are discussed, especially when the computation device is limited or powerful.
  • Disconnecting the link to decoder 1,2,3,4 and only connecting the middle block can improve the training speed.
  • When powerful computation clusters and large datasets are available, the entire model can be trained as a whole.
  • Experiments were conducted on the DIODE, Normal Maps (extended) and Cartoon Line Drawing datasets.

Experimental settings

  • CFG-scale set to 9.0
  • Sampler used is DDIM
  • 20 steps used by default
  • 4 types of prompts tested: no prompt, default prompt, automatic prompt, user prompt

Qualitative results

  • Results are presented in Figures 4 and 5
  • Qualitative results are shown

Ablation study

  • Model trained without using ControlNet
  • Sudden convergence phenomenon during training process from 5000 to 10000 steps with 1e-5 learning rate
  • Canny-edge-based ControlNets trained with different dataset scales

Comparison to previous methods

  • Comparison to Stability’s Depth-to-Image model shown in Fig. 14
  • Comparison to PITI [59] shown in Fig. 17
  • Comparison to sketch-guided diffusion [58] shown in Fig. 18
  • Comparison to Taming transformer [11] shown in Fig. 19

Comparison of pre-trained models

  • Comparison of pre-trained models in Fig. 23, 24, 25
  • Results of comparison shown in figures

More applications

  • Diffusion process can be masked for pen-based image editing
  • Model can achieve accurate control of details for simple objects
  • ControlNet can be applied to 50% diffusion iterations to get results that do not follow input shapes

Limitation

  • Model may have difficulty generating correct contents when semantic interpretation is wrong
  • Figures 28-29 show source images for edge detection, pose extraction, etc.
  • ControlNet structure compared to standard method used by Stable Diffusion
  • ControlNet applied to arbitrary neural network block
  • ControlNet applied to Stable Diffusion
  • Automatic prompts generated by BLIP
  • Semantic consistency of “wall”, “paper”, and “cup” is difficult to handle
  • ControlNet used to create trainable copy of 12 encoding blocks and 1 middle block of Stable Diffusion
  • ControlNet architecture likely to be usable in other diffusion models
  • Results achieved with default prompt
  • Most figures in paper are high-resolution images
  • Comparison to Sketch-guided diffusion and Taming Transformers
  • Sudden converge phenomenon
  • Comparison of six detection types and corresponding results
  • Masked diffusion supported by all diffusion models
  • Comparison to Sketch-guided diffusion and Taming Transformers
  • Example of simple object
  • Coarse-level control
  • Limitation when semantic of input image is mistakenly recognized
  • Appendix: all original source images for edge detection, semantic segmentation, pose extraction, etc.