当前位置:网站首页>Dtgan: dual attention generation confrontation network (CS CV) for text to image generation

Dtgan: dual attention generation confrontation network (CS CV) for text to image generation

2020-12-07 19:16:54 Ling Qian

Most of the existing text to image generation methods adopt multi-level modular structure , There are three main problems :(1) Training multiple networks will increase the running time of the generation model , It affects the convergence and stability of the generated model ;(2) These methods ignore the quality of the early generated images ;(3) Many discriminators need training . So , We propose a dual attention generation countermeasure network (DTGAN), It just uses a generator / Discriminator pairs can synthesize high-quality and visually realistic images . This model introduces the attention module of channel sensing and pixel sensing , The guide generator focuses on the channels and pixels related to the text based on the global sentence vector , And use the attention weight to fine tune the original feature mapping . meanwhile , The case layer normalization based on conditional adaptation is proposed (CAdaILN), To help our attention module flexibly control the change of shape and texture through the input natural language description . Besides , It also uses a new visual loss to improve image quality , Ensure that the shape of the generated image is realistic , Uniform color distribution . Experimental results on benchmark data sets show that , Compared to the latest models with a multi-stage framework , The method we proposed has advantages . Notice the visualization of the diagram , The channel aware attention module can locate and distinguish regions , The pixel aware attention module can capture global visual content to generate images .

Original title :DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

original text :Most existing text-to-image generation methods adopt a multi-stage modular architecture which has three significant problems: (1) Training multiple networks can increase the run time and affect the convergence and stability of the generative model; (2) These approaches ignore the quality of early-stage generator images; (3) Many discriminators need to be trained. To this end, we propose the Dual Attention Generative Adversarial Network (DTGAN) which can synthesize high quality and visually realistic images only employing a single generator/discriminator pair. The proposed model introduces channel-aware and pixel-aware attention modules that can guide the generator to focus on text-relevant channels and pixels based on the global sentence vector and to fine-tune original feature maps using attention weights. Also, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is presented to help our attention modules flexibly control the amount of change in shape and texture by the input natural-language description. Furthermore, a new type of visual loss is utilized to enhance the image quality by ensuring the vivid shape and the perceptually uniform color distributions of generated images. Experimental results on benchmark datasets demonstrate the superiority of our proposed method compared to the state-of-the-art models with a multi-stage framework. Visualization of the attention maps shows that the channel-aware attention module is able to localize the discriminative regions,

Original author :Zhenxing Zhang, Lambert Schomaker

Original address :https://arxiv.org/abs/2011.02709

Original statement , This article is authorized by the author + Community publication , Unauthorized , Shall not be reproduced .

If there is any infringement , Please contact the yunjia_community@tencent.com Delete .

版权声明
本文为[Ling Qian]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/11/202011121825454437.html