Paper Reading: Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Skirrey

16

文章

0

说说

11

ssk52839916[AT]gmail.com

Skirrey

今天天气真好

近期文章

近期评论

Skirrey on 换新电脑啦
猪崽 on 换新电脑啦
CC on Applied Stochastic Differential Equations 抄书
Skirrey on 520buff!!
CC老公 on 大哥抽代

当前位置：

首页 > Computer Vision > Generative Model > Paper Reading: Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Skirrey

2 years前 • 68 • 0 •

Paper Reading: Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

正文索引 [隐藏]

Confliction
Model Design
实验

Confliction

这篇文章先提出了当前文本生成图像任务的主要缺点

缺乏控制能力
没有考虑人的感知
分辨率不高

于是作者新定义了任务，那就是加入segmentation map，于是模型的输入是文本和segmentation map，输出是图像。

Model Design

这是基于Transformer的方法，也是纳入了segmentation map后比较trivial的网络设计，作者提出了一个很有趣的idea，那就是作者认为Transformer再强最终生成还是有VQ-VAE去做，当前图像质量的生成瓶颈在VQ-VAE上，于是作者除了在新增输入segmentation map上在做手脚之外，还在最终的VQ-VAE上引入了一些专门用来优化人脸和物体的loss，作者将segmentation map的VQ-VAE 叫做VQ-SEG，图像的VQ-VAE叫做VQ-IMG。

使用VQ-SEG来分割，输入和输出的channel数=panoptic segmentation类别数+human segmentation类别数 + face segmentation类别数 + 1, 额外的1是一个分割不同类别和实例的边缘图。
联合VQ-SEG输出的分割信息，训练VQ-IMG时有如下的人脸loss，其中c表征从数据集中crop出的人脸。
$$\mathcal{L}_{\text {Face }}=\sum_{k} \sum_{l} \alpha_{f}^{l}\left\|\mathrm{FE}^{l}\left(\hat{c}_{f}^{k}\right)-\mathrm{FE}^{l}\left(c_{f}^{k}\right)\right\|$$
训练VQ-SEG是人脸会糊掉，通过一个监督信号去加强
$$\mathcal{L}_{\mathrm{WBCE}}=\alpha_{\text {cat }} \operatorname{BCE}(s, \hat{s})$$
VQ-IMG加入crop出物体的监督信号
\begin{equation}
\mathcal{L}_{\mathrm{Obj}}=\sum_{k} \sum_{l} \alpha_{o}^{l}\left\|\operatorname{VGG}^{l}\left(\hat{c}_{o}^{k}\right)-\operatorname{VGG}^{l}\left(c_{o}^{k}\right)\right\|
\end{equation}
之后就喂入了Transformer，学习三者的联合分布。在这一步中，使用了所谓的Classifier-free guidance，也就是在训练时随机地drop一些text token，在inference计算下一个segmentation map token或者image token，计算logits score时有一个conditional和一个unconditional的模模型共通来inference，由如下的公式合成起来，其中T代表Transformer。
\begin{equation}
\begin{gathered}
\operatorname{logits}_{\text {cond }}=T\left(t_{y}, t_{z} \mid t_{x}\right) \\
\text { logits }_{\text {uncond }}=T\left(t_{y}, t_{z} \mid \emptyset\right) \\
\text { logits }_{c f}=\text { logits }_{\text {uncond }}+\alpha_{c} \cdot\left(\text { logits }_{\text {cond }}-\text { logits }_{\text {uncond }}\right)
\end{gathered}
\end{equation}

实验

在MS-COCO数据集的一个包含30k图像的子集中FID的对比

\begin{equation}
\begin{array}{l|ccc|ccc}
\hline \text { Model } & \text { FID } \downarrow & \begin{array}{c}
\text { FID } \downarrow \\
\text { (filt.) }
\end{array} & \begin{array}{l}
\text { Image } \\
\text { quality }
\end{array} & \begin{array}{c}
\text { Photo- realism alignment } \\
\text { Text }
\end{array} \\
\hline \text { AttnGAN} & 35.49 & – & – & – & – \\
\text { DM-GAN } & 32.64 & – & – & – & – \\
\text { DF-GAN } & 21.42 & – & – & – & – \\
\text { DM-GAN+CL} & 20.79 & – & – & – & – \\
\text { XMC-GAN } & 9.33 & – & – & – & – \\
\text { DALL-E } & – & 34.60 & 81.8 \% & 81.0 \% & 65.9 \% \\
\text { CogView } 256 & – & 32.20 & 92.2 \% & 94.2 \% & 92.2 \% \\
\text { CogView } 512 & – & 36.53 & 91.1 \% & 88.2 \% & 87.8 \% \\
\text { LAFITE } & 8.12 & 26.94 & – & – & – \\
\text { GLIDE } & – & 12.24 & – & – & – \\
\text { Ours } 256 & \mathbf{7} .55 & \mathbf{1 1 . 8 4} & & & \\
\hline \text { Ground-truth } & 2.47 & – & – & – & – \\
\hline
\end{array}
\end{equation}

以及Ablation study

\begin{equation}
\begin{array}{lc|ccc}
\hline \text { Model } & \text { FID } \downarrow & \begin{array}{c}
\text { Image } \\
\text { quality }
\end{array} & \begin{array}{c}
\text { Photo- } \\
\text { realism }
\end{array} & \begin{array}{c}
\text { Text } \\
\text { alignment }
\end{array} \\
\hline \text { Base } & 18.01 & – & – & – \\
+\text { Scene tokens } & 19.16 & 57.3 \% & 65.3 \% & 58.3 \% \\
\text { + Face-aware } & 14.45 & 63.6 \% & 59.8 \% & 57.4 \% \\
+\text { CF } & \mathbf{7 . 5 5} & 76.8 \% & 66.8 \% & 66.8 \% \\
+\text { Obj-aware } & 8.70 & 62.0 \% & 53.5 \% & 52.2 \% \\
\hline+\text { CF with scene input } & 4.69 & – & – & – \\
\hline
\end{array}
\end{equation}

paper的link

Generative Model, Paper Reading

打赏

请作者吃个鸡腿！

Skirrey's Blog

Skirrey

近期文章

近期评论

Paper Reading: Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Confliction

Model Design

实验

扫一扫支付

评论放弃治疗

归档

分类

标签云

近期文章

近期评论

Paper Reading: Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Confliction

Model Design

实验

扫一扫支付

评论 放弃治疗

归档

分类

评论放弃治疗