Simply put, Stable Diffusion is a deep generative model enabling you to generate fantastical images created by Stability.ai. In this article, I will aim to help you gain a deeper technical upstanding of Stable Diffusion, without lots of the fantastical images that often cloud SD tutorials.
Latent Diffusion Models (LDMs)
Stable diffusion is a type of probabilistic Latent Diffusion Model (LDM). In this context, latent means that it operates on the results of an autoencoder - as opposed to operating in the pixel space. Operating in the latent space is critical for computational tractability and parameter efficiency. In contrast, models that operate in the pixel space tend to be bogged down by the large amount of minute, imperceptible details within individual pixels. Operating in the latent space allows the model to be more efficient. The key breakthrough associated with Latent diffusion models is that they strike the right balance between complexity reduction and spatial downsampling, while greatly retaining visual fidelity.
- Flexibility - It can be used in a number of ways. Inputs can be other images or text prompts.
- Computational tractability - By operating in the latent space on inputs of size , the performs far fewer computations that it would in the pixel space.=
- Parameter efficiency - enabling parameter sharing.
- Visual fidelity - all these wouldn’t matter if the model results were sub-par. LDM models generate images at par with State of the at GANs
What is Diffusion?
The central concept of “diffusion” originates from non-equilibrium statistical physics. The idea is to systematically and slowly destroy the structure in a data distribution through an iterative forward diffusion process. Following that, the model then learns a reverse diffusion process that restores the structure of the data resulting in a flexible and tractable generative model of the underlying distribution. The key problem diffusion aims to solve involves modeling complex datasets - like images - using highly flexible families of probability distributions in which learning, sampling
What we simply mean “destroying” the structure, we introduction of random, Gaussian noise into a data element. Followed by attempting to restore or denoise the data element. The reconstruction process happens in a series of auto-encoders.
Image diffusion algorithms take an initial image and apply a step-by-step process of adding noise to the image, resulting in a progressively noisier image . The number of times noise is added, denoted by , determines the level of noise in the final image. As becomes sufficiently large, the image approaches a state of pure noise. Through a specified set of conditions, including time step , text prompts , and task-specific conditions , image diffusion algorithms can train a network to predict the amount of noise added to a given noisy image .
Components of a Stable Diffusion Model
Stable diffusion models are comprised of a number of models/modules. Depending on the input, the first stage is a text understander - usually a BERT
- TEXT PROMPT → Language Model → Image Generator → RESULT
The text understanding model - usually a large language model transforms text to a numerical representation called Vector Embeddings.
Stable Diffusion Architecture
The model architecture follows that of a U-NET, consisting of an encoder, a middle block, and a decoder that is skip-connected. The encoder and decoder each contain 12 blocks, and the complete model comprises a total of 25 blocks, including the middle block. Among these blocks, 8 consist of down-sampling or up-sampling convolution layers, while 17 are main blocks that contain four ResNet layers and two Vision Transformers (ViTs) each. The ViTs in these main blocks incorporate multiple cross-attention and self-attention mechanisms. OpenAI’s image and text CLIP-ViT-L/14 is used to encode the texts, while positional encoding is employed to encode the diffusion time steps.
Stable operates in the Latent Space - rather than the pixel space - converting images into smaller converted feature maps or “latent images” via an encoder. Image diffusion models learn to progressively denoise images to generate samples. In the case of SD, diffusion takes place in latent space, greatly speeding up the training process.