扩散与去噪：解释文本到图像的生成式人工智能

扩散的概念

去噪扩散模型经过训练，可以从噪声中提取模式，生成理想的图像。训练过程包括向模型展示根据噪声调度算法确定的不同噪声水平的图像（或其他数据）示例，目的是预测数据中哪些部分是噪声。如果预测成功，噪声预测模型就能从纯噪声中逐步建立起逼真的图像，每一步都能从图像中减去一定量的噪声。

与图1不同，现代扩散模型不会从添加了噪声的图像中预测噪声，至少不会直接预测噪声。相反，它们预测的是图像潜在空间表示中的噪声。潜空间表示图像的压缩数字特征集，即变分自编码（或 VAE）编码模块的输出。这个技巧将“潜在”置于潜在扩散中，大大减少了生成图像的时间和计算要求。根据论文作者的报告，与直接扩散法相比，潜在扩散法的推理速度至少提高了 2.7 倍，训练速度提高了约三倍。

使用潜在扩散技术的人通常会提到使用 “扩散模型”，但事实上，扩散过程包含多个模块。如上图所示，文本到图像工作流程的扩散管道通常包括文本嵌入模型（及其标记器）、去噪预测/扩散模型和图像解码器。潜在扩散的另一个重要部分是调度器，它决定了如何在一系列 “时间步长”（一系列迭代更新，逐渐从潜在空间中去除噪声）中对噪声进行缩放和更新。

潜在扩散代码示例

我们将在大部分示例中使用 CompVis/latent-diffusion-v1-4。文本嵌入由 CLIPTextModel 和 CLIPTokenizer 处理。噪声预测使用 “U-Net”，这是一种图像到图像模型，最初作为生物医学图像（尤其是分割）中的应用模型而备受关注。为了从去噪潜在阵列生成图像，管道使用 VAE 进行图像解码，将这些阵列转化为图像。

我们将从 HuggingFace 组件开始构建这一管道版本。

# local setup
virtualenv diff_env –python=python3.8
source diff_env/bin/activate
pip install diffusers transformers huggingface-hub
pip install torch --index-url https://download.pytorch.org/whl/cu118

如果在本地工作，请务必查看 pytorch.org，确保系统版本正确。我们的导入相对简单，下面的代码片段足以满足以下所有演示的需要。

import os
import numpy as np
import torch
from diffusers import StableDiffusionPipeline, AutoPipelineForImage2Image
from diffusers.pipelines.pipeline_utils import numpy_to_pil
from transformers import CLIPTokenizer, CLIPTextModel
from diffusers import AutoencoderKL, UNet2DConditionModel, \
       PNDMScheduler, LMSDiscreteScheduler

from PIL import Image
import matplotlib.pyplot as plt

首先定义图像和扩散参数以及提示。

prompt = [" "]

# image settings
height, width = 512, 512

# diffusion settings
number_inference_steps = 64
guidance_scale = 9.0
batch_size = 1

使用您选择的 seed 初始化伪随机数生成器，以重现您的结果。

def seed_all(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)

seed_all(193)

初始化文本嵌入模型、自动编码器、U-Net 和时间步调度器。

tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", \
        subfolder="vae")
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4",\
        subfolder="unet")
scheduler = PNDMScheduler()
scheduler.set_timesteps(number_inference_steps)

my_device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
vae = vae.to(my_device)
text_encoder = text_encoder.to(my_device)
unet = unet.to(my_device)

将文本提示编码为嵌入式文本需要首先对输入字符串进行标记化处理。标记化是通过字节对编码（BPE）等方式，将字符替换为与语义单元词汇相对应的整数代码。我们的管道在图像文本提示的同时嵌入了一个空提示（无文本）。这就平衡了所提供的描述和一般自然图像之间的扩散过程。我们将在本文稍后部分了解如何改变这些组件的相对权重。

prompt = prompt * batch_size
tokens = tokenizer(prompt, padding="max_length",\
max_length=tokenizer.model_max_length, truncation=True,\
        return_tensors="pt")

empty_tokens = tokenizer([""] * batch_size, padding="max_length",\
max_length=tokenizer.model_max_length, truncation=True,\
        return_tensors="pt")


with torch.no_grad():
    text_embeddings = text_encoder(tokens.input_ids.to(my_device))[0]
    max_length = tokens.input_ids.shape[-1]
    notext_embeddings = text_encoder(empty_tokens.input_ids.to(my_device))[0]
    text_embeddings = torch.cat([notext_embeddings, text_embeddings])

将潜在空间初始化为随机正态噪声，并根据扩散时间步长调度程序对其进行缩放。

latents = torch.randn(batch_size, unet.config.in_channels, \
        height//8, width//8)
latents = (latents * scheduler.init_noise_sigma).to(my_device)

一切准备就绪，我们可以深入研究扩散循环本身。我们可以通过定期采样来跟踪图像，这样就能看到噪音是如何逐渐减少的。

images = []
display_every = number_inference_steps // 8

# diffusion loop
for step_idx, timestep in enumerate(scheduler.timesteps):
    with torch.no_grad():
        # concatenate latents, to run null/text prompt in parallel.
        model_in = torch.cat([latents] * 2)
        model_in = scheduler.scale_model_input(model_in,\
                timestep).to(my_device)
        predicted_noise = unet(model_in, timestep, \
                encoder_hidden_states=text_embeddings).sample
        # pnu - empty prompt unconditioned noise prediction
        # pnc - text prompt conditioned noise prediction
        pnu, pnc = predicted_noise.chunk(2)
        # weight noise predictions according to guidance scale
        predicted_noise = pnu + guidance_scale * (pnc - pnu)
        # update the latents
        latents = scheduler.step(predicted_noise, \
                timestep, latents).prev_sample
        # Periodically log images and print progress during diffusion
        if step_idx % display_every == 0\
                or step_idx + 1 == len(scheduler.timesteps):
           image = vae.decode(latents / 0.18215).sample[0]
           image = ((image / 2.) + 0.5).cpu().permute(1,2,0).numpy()
           image = np.clip(image, 0, 1.0)
           images.extend(numpy_to_pil(image))
           print(f"step {step_idx}/{number_inference_steps}: {timestep:.4f}")

扩散过程结束后，我们就可以得到想要生成的效果图了。接下来，我们将学习更多的控制技术。由于我们已经制作了扩散管道，因此我们可以使用 HuggingFace 提供的简化扩散管道来处理其余示例。

控制扩散管道

在本节中，我们将使用一组辅助函数：

def seed_all(seed):
    torch.manual_seed(seed)
    np.random.seed(seed)

def grid_show(images, rows=3):
    number_images = len(images)
    height, width = images[0].size
    columns = int(np.ceil(number_images / rows))
    grid = np.zeros((height*rows,width*columns,3))
    for ii, image in enumerate(images):
        grid[ii//columns*height:ii//columns*height+height, \
                ii%columns*width:ii%columns*width+width] = image
        fig, ax = plt.subplots(1,1, figsize=(3*columns, 3*rows))
        ax.imshow(grid / grid.max())
    return grid, fig, ax

def callback_stash_latents(ii, tt, latents):
    # adapted from fastai/diffusion-nbs/stable_diffusion.ipynb
    latents = 1.0 / 0.18215 * latents
    image = pipe.vae.decode(latents).sample[0]
    image = (image / 2. + 0.5).cpu().permute(1,2,0).numpy()
    image = np.clip(image, 0, 1.0)
    images.extend(pipe.numpy_to_pil(image))

my_seed = 193

我们将从扩散模型最著名、最直接的应用开始：根据文本提示生成图像，即文本到图像生成。我们将使用的模型是Hugging Face Hub。Hugging Face 通过便捷的管道应用程序接口（pipeline API）协调潜在扩散等工作流。我们希望根据我们是否拥有 GPU 来定义要计算的设备和浮点数。

if (1):
    #Run CompVis/stable-diffusion-v1-4 on GPU
    pipe_name = "CompVis/stable-diffusion-v1-4"
    my_dtype = torch.float16
    my_device = torch.device("cuda")
    my_variant = "fp16"
    pipe = StableDiffusionPipeline.from_pretrained(pipe_name,\
    safety_checker=None, variant=my_variant,\
        torch_dtype=my_dtype).to(my_device)
else:
    #Run CompVis/stable-diffusion-v1-4 on CPU
    pipe_name = "CompVis/stable-diffusion-v1-4"
    my_dtype = torch.float32
    my_device = torch.device("cpu")
    pipe = StableDiffusionPipeline.from_pretrained(pipe_name, \
            torch_dtype=my_dtype).to(my_device)

Guidance Scale

如果您使用的文本提示不寻常（与数据集中的提示非常不同），则可能会出现在潜在空间中较少访问的部分。空提示嵌入提供了一种平衡，并根据guiding_scale将两者结合起来，允许您在提示的特殊性与常见图像特征之间进行权衡。

guidance_images = []
for guidance in [0.25, 0.5, 1.0, 2.0, 4.0, 6.0, 8.0, 10.0, 20.0]:
    seed_all(my_seed)
    my_output = pipe(my_prompt, num_inference_steps=50, \
    num_images_per_prompt=1, guidance_scale=guidance)
    guidance_images.append(my_output.images[0])
    for ii, img in enumerate(my_output.images):
        img.save(f"prompt_{my_seed}_g{int(guidance*2)}_{ii}.jpg")

temp = grid_show(guidance_images, rows=3)
plt.savefig("prompt_guidance.jpg")
plt.show()

Negative Prompts（反向提示）

有时潜在扩散确实“想要”产生与您的意图不符的图像。在这些情况下，您可以使用反向提示来推动扩散过程远离不需要的输出。

my_prompt = " "
my_negative_prompt = " "

output_x = pipe(my_prompt, num_inference_steps=50, num_images_per_prompt=9, \
        negative_prompt=my_negative_prompt)

temp = grid_show(output_x)
plt.show()

您应该收到符合提示的输出，同时避免输出否定提示中描述的内容。

图像变化

从零开始生成文本到图像并不是扩散管道的唯一应用。事实上，扩散非常适合从初始图像开始的图像修改。我们将使用稍有不同的管道和预训练模型来调整图像到图像的扩散。

pipe_img2img = AutoPipelineForImage2Image.from_pretrained(\

        "runwayml/stable-diffusion-v1-5", safety_checker=None,\

torch_dtype=my_dtype, use_safetensors=True).to(my_device)

这种方法的一种应用是生成主题的变体。概念艺术家可能会使用这种技术，根据最新的研究成果，快速迭代出不同的想法，来说明一颗系外行星。

我们将首先下载 TRAPPIST 系统中行星 1e 的公共领域艺术家概念图（图片来源：NASA/JPL-Caltech）。然后，在缩小比例以去除细节之后，我们将使用扩散管道制作出几个不同版本的系外行星 TRAPPIST-1e。

url = \
"https://upload.wikimedia.org/wikipedia/commons/thumb/3/38/TRAPPIST-1e_artist_impression_2018.png/600px-TRAPPIST-1e_artist_impression_2018.png"
img_path = url.split("/")[-1]
if not (os.path.exists("600px-TRAPPIST-1e_artist_impression_2018.png")):
    os.system(f"wget \     '{url}'")
    init_image = Image.open(img_path)

seed_all(my_seed)

trappist_prompt = "Artist's impression of TRAPPIST-1e"\
                  "large Earth-like water-world exoplanet with oceans,"\
                  "NASA, artist concept, realistic, detailed, intricate"

my_negative_prompt = "cartoon, sketch, orbiting moon"

my_output_trappist1e = pipe_img2img(prompt=trappist_prompt, num_images_per_prompt=9, \
     image=init_image, negative_prompt=my_negative_prompt, guidance_scale=6.0)

grid_show(my_output_trappist1e.images)
plt.show()

通过向模型提供初始图像示例，我们可以生成类似的图像。您还可以使用文本引导的图像到图像流水线，通过增加引导、添加反向提示以及 “非写实”、”水彩 “或 “纸质素描 “等方式来改变图像的风格。您的里程可能会有所不同，调整提示将是找到你想要创作的正确图像的最简单方法。

结论

尽管扩散系统和模仿人类生成的艺术背后有很多论述，但扩散模型还有其他更有影响力的用途。它已被应用于蛋白质折叠预测，用于蛋白质设计和药物开发。文本到视频也是一个活跃的研究领域，有几家公司（如 Stability AI、谷歌）提供了这种技术。扩散也是文本到语音应用的一种新兴方法。

很明显，扩散过程在人工智能的进化以及技术与全球人类环境的互动中发挥着核心作用。版权法和其他知识产权法的复杂性及其对人类艺术和科学的影响在积极和消极方面都是显而易见的。但真正积极的是人工智能具有前所未有的理解语言和生成图像的能力。正是 AlexNet 让计算机分析图像并输出文本，直到现在计算机才可以分析文本提示并输出连贯的图像。