Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent advancements in text-to-image diffusion have motivated the study of erasing specific concepts from model weights.
- We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the style and using negative guidance as a teacher.
- We benchmark our method against previous approaches that remove sexually explicit content and demonstrate its effectiveness.
- We conduct experiments erasing five modern artists from the network and conduct a user study to assess the human perception of the removed styles.
- Our approach can remove concepts from a diffusion model permanently rather than modifying the output at the inference time.
Paper Content
Introduction
- Recent text-to-image generative models have high image quality and seemingly infinite generation capabilities
- Some concepts learned by the model are undesirable, including copyrighted content and pornography
- We propose an approach for selectively removing a single concept from a text-conditional model’s weights after pretraining
- Our method does not require retraining, which is prohibitive for large models
- Our approach directly removes the concept from the model’s parameters, making it safe to distribute its weights
- We conduct a user study to test the impact of erasure on user perception of the removed artist’s style in output images
- We also test our method on erasure of complete object classes
Related works
- Previous work to avoid undesirable image output in generative models has taken two main approaches: dataset removal and post-hoc modification
- Image cloaking is another approach to protecting images from imitation by large models
- Model editing is a lightweight method to alter the behavior of large-scale generative models
- Memorization and unlearning aim to modify a model to behave as if particular training data had not been present
- Energy-based composition can be used to reduce undesirable output of language models and vision generators
- Our work introduces score composition as a source of unsupervised training data to teach a fine-tuned model to erase an undesired concept
Background
Denoising diffusion models
- Diffusion models are a type of generative model that learn the distribution space by gradually removing noise.
- The model predicts noise at each time step which is used to generate an intermediate denoised image.
Latent diffusion models
- Latent diffusion models (LDM) operate in a lower dimensional latent space of a pretrained variational autoencoder.
- Noise is added to the encoded latent during training.
- Classifier-free guidance is used to regulate image generation.
Method
- Goal of method is to erase concepts from text-to-image diffusion models using its own knowledge and no additional data
- Approach involves editing pre-trained diffusion U-Net model weights to remove a specific style or concept
- Draws inspiration from classifier-free guidance method and score-based composition
- Objective is to learn the score of conditional model
- Exploits model’s knowledge of concept to synthesize training samples, eliminating need for data collection
- Training uses several instances of diffusion model, with one set of parameters frozen while training the other set of parameters to erase the concept
Importance of parameter choice
- Applying the erasure objective depends on the subset of parameters that are fine-tuned.
- Cross-attention parameters serve as a gateway to the prompt and directly depend on the text of the prompt.
- Non-cross-attention parameters contribute to a visual concept even if the concept is not mentioned in the prompt.
Experiments
- Trained models for 1000 gradient update steps with batch size of 1 and learning rate 1e-5 using Adam optimizer
- ESD-x fine-tunes cross-attention, ESD-u fine-tunes unconditional weights of U-Net module
- Baseline methods: SD, SLD, SD-Neg-Prompt
Artistic style removal
- Conducted a user study to measure human perception of effectiveness of removed style
- Collected 40 images of art from each artist using Google Image Search
- Composed 40 generic text prompts to invoke artist’s style
- Evaluated images from edited diffusion models, baseline models, and similar human artist
- Dataset of 1000 images
- Participants asked to estimate confidence level of experimental image being created by same artist
- 13 total participants, average of 170 responses per participant
- Evaluated effectiveness of ESD-x method for removing style of 5 modern artists
- Assessed amount of interference introduced by ESD-x compared to other baseline methods
- Findings indicate AI-duplicates rated higher than similar genuine artwork
- ESD-x, SLD, and SD-Neg-Prompt all decrease perceived artistic style
- Users most likely to consider images generated using ESD-x to be genuine artwork
Explicit content removal
- Recent works have addressed the challenge of NSFW content restriction.
- Retraining the models on filtered data can be expensive and still capable of generating nudity.
- ESD-u is used to erase “nudity” and has a more significant effect in erasing nudity.
- Image fidelity and CLIP score are used to measure the quality and specificity of the model.
Object removal
- Investigated extent to which method can erase object classes from model
- Prepared 10 ESD-u models, each removing one class from subset of ImageNet classes
- Measured effect of removing targeted and untargeted classes
- Generated 500 images of each class using base Stable Diffusion and fine-tuned models
- Evaluated results by examining top-1 predictions of pretrained Resnet-50 Imagenet classifier
- Results show approach effectively removes targeted classes, but some classes are more difficult to remove
- Accuracy of untargeted classes remains high, but some interference
Limitations
- Our method is more effective than baseline approaches for erasing targeted visual concepts
- Erasing large concepts can create a trade-off between complete erasure and interference with other visual concepts
- Erasing entire object classes can fail, erasing only particular distinctive attributes
- Erasing entire object classes can create interference with other classes
Conclusion
- Proposes an approach for eliminating concepts from text-to-image generation models
- Does not require manipulating large datasets or expensive training
- Removes concept directly from model weights
- Efficacy demonstrated in 3 applications
- Successfully removes explicit content
- Can be used to remove artistic styles
- Human study conducted to measure perception of artistic removal effect
- Versatile, can be applied to concrete object classes
- Good image fidelity performance compared to other methods
- Can cleanly erase many object concepts from a model