loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2408.07009

Imagen 3

Authors: Imagen-Team-Google, :, Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bunner, Lluis Castrejon, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-Rosen, Hongliang Fei, Nando de Freitas, Yilin Gao, Evgeny Gladchenko, Sergio Gómez Colmenarejo, Mandy Guo, Alex Haig, Will Hawkins, Hexiang Hu, Huilian Huang, Tobenna Peter Igwe, Christos Kaplanis, Siavash Khodadadeh, Yelin Kim, Ksenia Konyushkova, Karol Langner, Eric Lau, Rory Lawton, Shixin Luo, Soňa Mokrá, Henna Nandwani, Yasumasa Onoe, Aäron van den Oord, Zarana Parekh, Jordi Pont-Tuset, Hang Qi, Rui Qian, Deepak Ramachandran, Poorva Rane, Abdullah Rashwan, Ali Razavi, Robert Riachi, Hansa Srinivasan, Srivatsan Srinivasan, Robin Strudel, Benigno Uria, Oliver Wang, Su Wang, Austin Waters, Chris Wolff, Auriel Wright, Zhisheng Xiao, Hao Xiong, Keyang Xu, Marc van Zee, Junlin Zhang, Katie Zhang, Wenlei Zhou, Konrad Zolna, Ola Aboubakar, Canfer Akbulut, Oscar Akerlund, Isabela Albuquerque, Nina Anderson, Marco Andreetto, Lora Aroyo, Ben Bariach, David Barker, Sherry Ben, Dana Berman, Courtney Biles, Irina Blok, Pankil Botadra, Jenny Brennan, Karla Brown, John Buckley, Rudy Bunel, Elie Bursztein, Christina Butterfield, Ben Caine, Viral Carpenter, Norman Casagrande, Ming-Wei Chang, Solomon Chang, Shamik Chaudhuri, Tony Chen, John Choi, Dmitry Churbanau, Nathan Clement, Matan Cohen, Forrester Cole, Mikhail Dektiarev, Vincent Du, Praneet Dutta, Tom Eccles, Ndidi Elue, Ashley Feden, Shlomi Fruchter, Frankie Garcia, Roopal Garg, Weina Ge, Ahmed Ghazy, Bryant Gipson, Andrew Goodman, Dawid Górny, Sven Gowal, Khyatti Gupta, Yoni Halpern, Yena Han, Susan Hao, Jamie Hayes, Jonathan Heek, Amir Hertz, Ed Hirst, Emiel Hoogeboom, Tingbo Hou, Heidi Howard, Mohamed Ibrahim, Dirichi Ike-Njoku, Joana Iljazi, Vlad Ionescu, William Isaac, Reena Jana, Gemma Jennings, Donovon Jenson, Xuhui Jia, Kerry Jones, Xiaoen Ju, Ivana Kajic, Christos Kaplanis, Burcu Karagol Ayan, Jacob Kelly, Suraj Kothawade, Christina Kouridi, Ira Ktena, Jolanda Kumakaw, Dana Kurniawan, Dmitry Lagun, Lily Lavitas, Jason Lee, Tao Li, Marco Liang, Maggie Li-Calis, Yuchi Liu, Javier Lopez Alberca, Matthieu Kim Lorrain, Peggy Lu, Kristian Lum, Yukun Ma, Chase Malik, John Mellor, Thomas Mensink, Inbar Mosseri, Tom Murray, Aida Nematzadeh, Paul Nicholas, Signe Nørly, João Gabriel Oliveira, Guillermo Ortiz-Jimenez, Michela Paganini, Tom Le Paine, Roni Paiss, Alicia Parrish, Anne Peckham, Vikas Peswani, Igor Petrovski, Tobias Pfaff, Alex Pirozhenko, Ryan Poplin, Utsav Prabhu, Yuan Qi, Matthew Rahtz, Cyrus Rashtchian, Charvi Rastogi, Amit Raul, Ali Razavi, Sylvestre-Alvise Rebuffi, Susanna Ricco, Felix Riedel, Dirk Robinson, Pankaj Rohatgi, Bill Rosgen, Sarah Rumbley, Moonkyung Ryu, Anthony Salgado, Tim Salimans, Sahil Singla, Florian Schroff, Candice Schumann, Tanmay Shah, Eleni Shaw, Gregory Shaw, Brendan Shillingford, Kaushik Shivakumar, Dennis Shtatnov, Zach Singer, Evgeny Sluzhaev, Valerii Sokolov, Thibault Sottiaux, Florian Stimberg, Brad Stone, David Stutz, Yu-Chuan Su, Eric Tabellion, Shuai Tang, David Tao, Kurt Thomas, Gregory Thornton, Andeep Toor, Cristian Udrescu, Aayush Upadhyay, Cristina Vasconcelos, Alex Vasiloff, Andrey Voynov, Amanda Walker, Luyu Wang, Miaosen Wang, Simon Wang, Stanley Wang, Qifei Wang, Yuxiao Wang, Ágoston Weisz, Olivia Wiles, Chenxia Wu, Xingyu Federico Xu, Andrew Xue, Jianbo Yang, Luo Yu, Mete Yurtoglu, Ali Zand, Han Zhang, Jiageng Zhang, Catherine Zhao, Adilet Zhaxybay, Miao Zhou, Shengqi Zhu, Zhenkai Zhu, Dawn Bloxwich, Mahyar Bordbar, Luis C. Cobo, Eli Collins, Shengyang Dai, Tulsee Doshi, Anca Dragan, Douglas Eck, Demis Hassabis, Sissie Hsiao, Tom Hume, Koray Kavukcuoglu, Helen King, Jack Krawczyk, Yeqing Li, Kathy Meier-Hellstern, Andras Orban, Yury Pinsky, Amar Subramanya, Oriol Vinyals, Ting Yu, Yori Zwols

Published: 2024-08-13

Abstract:

We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models.

Paper Content: on Alphaxiv
Page 1: Figure 1|Imagen 3 is our best diffusion model for text-to-image generation, capable of following descriptive prompts, such as “ Photo of a felt puppet diorama scene of a tranquil nature scene of a secluded forest clearing with a large friendly, rounded robot is rendered in a risograph style. An owl sits on the robots shoulders and a fox at its feet. Soft washes of color, 5 color, and a light-filled palette create a sense of peace and serenity, inviting contemplation and the appreciation of natural beauty. ” Imagen 3 Imagen 3 Team, Google1 We introduce Imagen 3, a latent diffusion model that generates high quality images from text prompts. We describe our quality and responsibility evaluations. Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation. In addition, we discuss issues around safety and representation, as well as methods we used to minimize the potential harm of our models. 1. Introduction Text-to-image (T2I) models drive a number of use cases, for example in image generation and editing, as well as scene understanding. In this tech report, we outline the training and evaluation of the latest model in Google’s Imagen family, Imagen 3. At its default configuration, Imagen 3 generates images at 1024×1024resolution, and can be followed by 2×,4×, or8×upsampling. We describe our evaluations and analysis against other state-of-the-art T2I models. We find Imagen 3 is preferred over othermodels. Inparticular, itperformswellatphotorealism, andinadheringtolongandcomplexuser prompts. Deploying T2I models introduces many new challenges, we describe in detail experiments focused on understanding the safety and responsibility risks associated with this model family, along with our efforts to reduce potential harms. 1See Contributions section for full author list. Please send correspondence to imagen-report@google.com. ©2024 Google DeepMind. All rights reservedarXiv:2408.07009v3 [cs.CV] 21 Dec 2024 Page 2: Imagen 3 2. Data Our model is trained on a large dataset comprising images, text and associated annotations. To ensure quality and safety standards, we employ a multi-stage filtering process. This process begins by removing unsafe, violent, or low-quality images. We then eliminate AI-generated images to prevent the model from learning artifacts or biases commonly found in such images. Additionally, we use deduplication pipelines and down-weight similar images to minimize the risk of outputs overfitting particular elements of training data. Each image in our dataset is paired with both original (sourced from alt text, human descriptions, etc.), and synthetic captions (Betker et al., 2023). Synthetic captions are generated using Gemini models with a variety of prompts. We leverage multiple Gemini models and instructions to maximize the linguistic diversity and quality of these synthetic captions (Garg et al., 2024). We apply filters to remove unsafe captions and personally identifiable information. 3. Evaluation We compare our highest quality configuration – the Imagen 3 model – against Imagen 2 and the fol- lowing external models: DALL ·E 3 (Betker et al., 2023), Midjourney v6, Stable Diffusion 3 Large (SD3, Esser et al., 2024), and Stable Diffusion XL 1.0 (SDXL 1, Podell et al., 2023). Through extensive human (Sec. 3.1) and automatic (Sec. 3.2) evaluations we find that Imagen 3 sets a new state of the art in text-to-image generation. We discuss the overall results and limitations in Section 3.3 and Section 3.4 includes qualitative results. We note that products that may incorporate Imagen 3 may exhibit differing performance to the tested configuration. Please refer to Appendix D for updated human evaluation results as of December 2024. 3.1. Human Evaluation We run human evaluations on five different quality aspects of a text-to-image generation model: overall preference (Sec. 3.1.1), prompt–image alignment (Sec. 3.1.2), visual appeal (Sec. 3.1.3), detailed prompt–image alignment (Sec. 3.1.4), and numerical reasoning (Sec. 3.1.5). Each of these aspects are evaluated independently in order to avoid conflation in raters’ judgments. For the first four aspects, quantitative judgment (e.g. assigning a score between 1 and 5) is in practice difficult to calibrate across raters. We therefore use side-by-side comparisons; this is also becoming a standard practice in chatbot (Chiang et al., 2024) and other text-to-image (Betker et al., 2023) evaluations. evaluations. The fifth aspect – numerical reasoning – can directly and reliably be evaluated by humans by counting how many objects of a given type are depicted in an image, so we follow this single-model evaluation approach. Each side-by-side comparison (i.e. for the first four aspects and their corresponding prompt sets) is aggregated into an Elo score (Betker et al., 2023; Nichol et al., 2021) for all six models to get a calibrated comparison between them. Intuitively, each pairwise comparison represents a match played between two models, with the Elo score representing a model’s overall score in the competition among all models. We generate the complete Elo scoreboard on each aspect and prompt set through exhaustive comparison of every pair of models. Each study (a pairing between two models on a given question and given prompt set) consists of 2500ratings (we found this number to be a good trade-off between cost and reliability) which are uniformly distributed among the prompts in the prompt set. The models are anonymized in the rater interface and the sides are randomly shuffled for every rating. 2 Page 3: Imagen 3 We use an external platform to randomly select raters from an extensive and varied pool. Data col- lection is undertaken in accordance with Google DeepMind’s best practices on data enrichment (Deep- Mind, 2022), based on the Partnership on AI’s Responsible Sourcing of Data Enrichment Services (PAI, 2021). This includes ensuring all data enrichment workers are paid at least a local living wage. We run human evaluations on 5different prompt sets in total. We evaluate the first three quality aspects (overall preference, prompt-image alignment, and visual appeal) on three different prompt sets. First, we use the recently-released GenAI-Bench (Lin et al., 2024), a set of 1600high-quality prompts collected from professional designers. To align with previous work, we also evaluate on the 200 prompts of DrawBench (Saharia et al., 2022) and the 170prompts of DALL ·E 3 Eval (Betker et al., 2023). For detailed prompt-image alignment, we use 1000images and their corresponding captions fromDOCCI(Onoeetal.,2024)(DOCCI-Test-Pivots). Finally, weusetheGeckoNumbenchmark(Kajić et al., 2024) to evaluate numerical reasoning capabilities. All the external models are run via their public access offerings, except for DALL ·E 3 on DALL ·E 3 Eval and DrawBench, for which we use the images released by its authors. In total, we collected 366,569ratings in 5943submissions from 3225different raters. Each rater participated in at most 10% of our studies, and in each study, each rater provided approximately 2% of the ratings, to avoid biasing the results to a particular set of raters’ judgments. Raters from 71 different nationalities participated in our studies, with the United Kingdom, United States, South Africa, and Poland being the most represented. 3.1.1. Overall Preference Overall preference measures the degree of satisfaction of the user with respect to the generated image given the input prompt. It is by design an open question that leaves to the rater the decision of which quality aspects are the most important in every prompt, as is the case in a realistic usage of the model. We showed two images to raters, side by side together with the prompt and asked: Imagine you are using a computer tool that produces an image given the prompt above. Choose which image you would prefer to see if you were using this tool. If both images are equally appealing, select “I am indifferent” . Figure 2 shows the results on GenAI-Bench, DrawBench, and DALL ·E 3 Eval. On GenAI-Bench, Imagen 3 is significantly more preferred over other models. On DrawBench, Imagen 3 leads with a smaller margin with respect to Stable Diffusion 3 and on DALL ·E 3 Eval we observe close results for the four leading models, with Imagen 3 having a slight edge. 3.1.2. Prompt–Image Alignment Prompt-–image alignment evaluates how well the input prompt is represented in the output image content, irrespective of potential flaws in the image or its aesthetic appeal. We showed the raters two images side by side together with the prompt and asked them: Considering the text above, which image better captures the intent of the prompt? Please try to ignore potential defects or bad quality of the images. Unless mentioned in the prompt, also disregard the different styles. Figure 3 shows the results on GenAI-Bench, DrawBench, and DALL ·E 3 Eval. Imagen 3 leads with a significant margin on GenAI-Bench, it has smaller margin on DrawBench, and on DALL ·E 3 Eval the three leading models perform similarly with overlapping confidence intervals. 3.1.3. Visual Appeal Visual appeal quantifies how appealing the generated images are, irrespective of the content that was requested. To measure it, we show two images side by side to the raters, without the prompt that 3 Page 4: Imagen 3 SDXL 1 Imagen 2 MJ v6 DALL ·E 3 SD 3 Imagen 38509009501,0001,0501,100 8601,027 1,0281,047 9411,098Elo score (and 99% CI)Overall preference on GenAI-Bench 42.2 40.0 42.0 30.1 22.357.8 46.8 46.3 37.4 27.260.0 53.2 49.9 36.7 31.958.0 53.7 50.1 40.9 30.569.9 62.6 63.3 59.1 41.877.7 72.8 68.1 69.5 58.2Imagen 3 Imagen 3SD 3 SD 3DALL·E 3 DALL ·E 3MJ v6 MJ v6Imagen 2 Imagen 2SDXL 1 SDXL 1 SDXL 1 Imagen 2 DALL ·E 3 MJ v6 SD 3 Imagen 39009209409609801,0001,0201,0401,0601,080 9239991,0271,053 9291,068Elo score (and 99% CI)Overall preference on DrawBench 49.1 43.8 40.0 32.0 30.450.9 49.5 41.4 36.7 30.256.2 50.5 46.9 37.1 36.860.0 58.6 53.1 39.5 41.168.0 63.3 62.9 60.5 49.769.6 69.8 63.2 58.9 50.3Imagen 3 Imagen 3SD 3 SD 3MJ v6 MJ v6DALL·E 3 DALL ·E 3Imagen 2 Imagen 2SDXL 1 SDXL 1 SDXL 1 Imagen 2 DALL ·E 3 SD 3 MJ v6 Imagen 38509009501,0001,0501,100 8571,0581,0621,068 8751,079Elo score (and 99% CI)Overall preference on DALL ·E 3 Eval 49.5 48.0 46.3 29.0 22.350.5 48.4 48.6 28.6 32.752.0 51.6 47.5 28.4 26.853.7 51.4 52.5 25.2 31.871.0 71.4 71.6 74.8 46.977.7 67.3 73.2 68.2 53.1Imagen 3 Imagen 3MJ v6 MJ v6SD 3 SD 3DALL·E 3 DALL ·E 3Imagen 2 Imagen 2SDXL 1 SDXL 1 Figure 2|Overall preference: Elo scores and win-rate percentages on GenAI-Bench, DrawBench, and DALL ·E 3 Eval. Please refer to Appendix D for updated human evaluation results as of December 2024. 4 Page 5: Imagen 3 SDXL 1 Imagen 2 MJ v6 DALL ·E 3 SD 3 Imagen 38509009501,0001,0501,100 8731,0191,0281,047 9501,083Elo score (and 99% CI)Prompt-image alignment on GenAI-Bench 44.9 40.6 40.9 34.6 23.955.1 49.0 45.1 36.4 27.559.4 51.0 48.8 40.2 29.859.1 54.9 51.2 38.3 32.065.4 63.6 59.8 61.7 41.076.1 72.5 70.2 68.0 59.0Imagen 3 Imagen 3SD 3 SD 3DALL·E 3 DALL ·E 3MJ v6 MJ v6Imagen 2 Imagen 2SDXL 1 SDXL 1 SDXL 1 Imagen 2 MJ v6 DALL ·E 3 SD 3 Imagen 39209409609801,0001,0201,0401,0601,080 9311,011 1,0131,047 9341,064Elo score (and 99% CI)Prompt-image alignment on DrawBench 47.4 44.0 41.7 33.5 34.952.6 44.8 47.2 33.9 33.056.0 55.2 50.1 41.5 38.258.3 52.8 49.9 40.3 39.266.5 66.1 58.5 59.7 51.265.1 67.0 61.8 60.8 48.8Imagen 3 Imagen 3SD 3 SD 3DALL·E 3 DALL ·E 3MJ v6 MJ v6Imagen 2 Imagen 2SDXL 1 SDXL 1 SDXL 1 Imagen 2 MJ v6 SD 3 DALL ·E 3Imagen 38509009501,0001,0501,100 8481,0521,0691,077 8761,078Elo score (and 99% CI)Prompt-image alignment on DALL ·E 3 Eval 49.8 47.7 47.0 28.2 22.350.2 49.9 48.7 25.0 25.652.3 50.1 47.9 27.0 23.053.0 51.3 52.1 31.5 30.171.8 75.0 73.0 68.5 47.777.7 74.4 77.0 69.9 52.3Imagen 3 Imagen 3DALL·E 3 DALL ·E 3SD 3 SD 3MJ v6 MJ v6Imagen 2 Imagen 2SDXL 1 SDXL 1 Figure 3|Prompt-Image Alignment: Elo scores and win-rate percentages on GenAI-Bench, Draw- Bench, and DALL ·E 3 Eval. Please refer to Appendix D for updated human evaluation results as of December 2024. 5 Page 6: Imagen 3 created them, and we ask: Which image is more appealing to you? . Figure 4 shows the results on GenAI-Bench, DrawBench, and DALL ·E 3 Eval. Midjourney v6 leads overall, with Imagen 3 almost on par on GenAI-Bench, a slightly bigger advantage on DrawBench, and a significant advantage on DALL ·E 3 Eval. 3.1.4. Detailed Prompt-Image Alignment In this section we further push the evaluation of prompt-image alignment capabilities by generating images from the detailed prompts of DOCCI (Onoe et al., 2024). These prompts are significantly longer – 136words on average – than the prompt sets used above. After running some pilots following the same evaluation strategy of Section 3.1.2, however, we realized that reading 100+ word prompts and evaluating how well the images aligned with all the details in them was too challenging and cumbersome for human raters. We instead leveraged the fact that DOCCI prompts are actually high-quality captions of real reference photographs – in contrast to standard text-to-image evaluation prompt sets, which have no such corresponding reference images. We fed these captions to the image generation models and measured how well the content of the generated image aligns with that of the benchmark reference image from DOCCI. We specifically instruct the raters to focus on the semantics of the images (objects, their position, their orientation, etc.) and ignore styles, capturing technique, quality, etc. Figure 5 shows the results, in which we can see that Imagen 3 has a significant gap of + 114Elo points and 63% win rate against the second best model. This result further highlights its outstanding capabilities of following the detailed contents of the input prompts. 3.1.5. Numerical Reasoning We also evaluate the capability of the models to generate an exact number of objects, following the simplest task in the GeckoNum benchmark (Kajić et al., 2024). Specifically, we ask: How many <obj> are in the image? , where <obj>refers to the noun in the source prompt used to generate the image and compare it to the expected quantity requested in the prompt. The number of objects range from 1 to 10 and the task includes prompts of various complexity as numbers are embedded in different types of sentence structures, examining the role of attributes such as color and spatial relationships. The results are shown in Figure 6, where we see that, while generating an exact number of objects is still a challenging task for current models, Imagen 3 is the strongest model, outperforming the second one, DALL ·E 3, by 12percentage points. In addition, we find that Imagen 3 has higher accuracy compared to other models when generating images containing between 2and5objects, as well as better performance on prompts with numerically more complex sentence structure, such as “1 cookie and five bottles” (See Appendix C.2 for details). 3.2. Automatic Evaluation In recent years, automatic-evaluation (auto-eval) metrics, such as CLIP (Hessel et al., 2021) and VQAScore (Lin et al., 2024), are more widely used to measure quality of text-to-image models, as they are easier to scale than human evaluations. We run some auto-eval metrics for prompt–image alignment (Sec. 3.2.1) and image quality (Sec. 3.2.2) to complement the human evaluation in the previous section. 6 Page 7: Imagen 3 SDXL 1 Imagen 2 DALL ·E 3 SD 3 Imagen 3 MJ v68008509009501,0001,0501,100 8219691,0721,101 9431,095Elo score (and 99% CI)Visual appeal on GenAI-Bench 48.4 46.0 34.1 35.6 21.651.6 48.5 38.1 32.7 23.554.0 51.5 38.8 37.8 28.265.9 61.9 61.2 45.9 28.664.4 67.3 62.2 54.1 39.578.4 76.5 71.8 71.4 60.5MJ v6 MJ v6Imagen 3 Imagen 3SD 3 SD 3DALL·E 3 DALL ·E 3Imagen 2 Imagen 2SDXL 1 SDXL 1 DALL ·E 3 SDXL 1 Imagen 2 SD 3 Imagen 3 MJ v69009501,0001,0501,100 9069341,0291,075 9941,063Elo score (and 99% CI)Visual appeal on DrawBench 46.1 45.2 40.4 35.3 28.753.9 42.9 39.2 33.8 32.254.8 57.1 47.1 36.2 34.259.6 60.8 52.9 41.7 38.664.7 66.2 63.8 58.3 46.071.2 67.8 65.8 61.4 54.0MJ v6 MJ v6Imagen 3 Imagen 3SD 3 SD 3Imagen 2 Imagen 2SDXL 1 SDXL 1DALL·E 3 DALL ·E 3 Imagen 2 SDXL 1 DALL ·E 3 SD 3 Imagen 3 MJ v69009501,0001,0501,100 9221,0011,0241,095 9101,047Elo score (and 99% CI)Visual appeal on DALL ·E 3 Eval 41.9 40.9 38.9 30.9 34.158.1 48.2 42.3 35.8 36.159.1 51.8 48.5 35.3 36.061.1 57.7 51.5 43.4 32.069.1 64.2 64.7 56.6 51.165.9 63.9 64.0 68.0 48.9MJ v6 MJ v6Imagen 3 Imagen 3SD 3 SD 3DALL·E 3 DALL ·E 3SDXL 1 SDXL 1Imagen 2 Imagen 2 Figure 4|Visual Appeal: Elo scores and win-rate percentages on GenAI-Bench, DrawBench, and DALL ·E 3 Eval. Please refer to Appendix D for updated human evaluation results as of December 2024. 7 Page 8: Imagen 3 SDXL 1 DALL ·E 3Imagen 2 SD 3 MJ v6 Imagen 38009001,0001,1001,200 7939131,0511,079 9711,193Elo score (and 99% CI)37.0 29.8 25.8 20.5 11.763.0 47.0 37.9 28.7 18.370.2 53.0 38.9 31.6 19.174.2 62.1 61.1 40.6 32.179.5 71.3 68.4 59.4 32.788.3 81.7 80.9 67.9 67.3Imagen 3 Imagen 3MJ v6 MJ v6SD 3 SD 3Imagen 2 Imagen 2DALL·E 3 DALL ·E 3SDXL 1 SDXL 1 Figure 5|Detailed prompt–imagealignment: Elo scores and win percentages on DOCCI-Test-Pivots. SDXL 1 Imagen 2 MJ v6 SD 3 DALL ·E 3 Imagen 3020406080100 20.542.645.5 46.0 38.358.6Counting accuracy (Percentage) Figure 6|Numerical Reasoning: Accuracy on Exact Number Generation in GeckoNum. Imagen 3 is the strongest performing model with an accuracy of 58.6%. 8 Page 9: Imagen 3 3.2.1. Prompt–Image Alignment We choose three strong auto-eval prompt–image alignment metrics from the main families of metrics: contrastive dual encoders (CLIP, Hessel et al., 2021), VQA-based (Gecko, Wiles et al., 2024), and an LVLM prompt-based (an implementation of VQAScore2). While previous work has demonstrated that these metrics correlate well with human judgment (e.g., Cho et al., 2024; Lin et al., 2024; Wiles et al., 2024), it is unclear if they can reliably discriminate between stronger models that are more similar to each other. As a result, we first validate the three metrics by comparing their predictions with the human ratings obtained for alignment in Sec. 3.1.2 and report findings in Appendix C.1. We observe that CLIP – despite being commonly used in current work – fails to predict the correct model ordering in most cases (see Table 6). We find that Gecko and our VQAScore variant (referred to as VQAScore in the following) perform well and agree about 72% of the time. In these cases, where the metrics agree, we can have confidence in the results as they agree with human judgment 94.4% of the time. While they perform similarly, VQAScore has the edge as it matches human ratings 80% of the time as opposed to 73.3% of the time for Gecko. We note that Gecko uses a weaker backbone – PALI (Chen et al., 2022) as opposed to Gemini 1.5 Pro – which may account for the difference in performance. As a result, in the following we discuss results with VQAScore and leave other results and further discussion on the setup to Appendix C.1. We evaluate on four datasets to investigate model differences under diverse conditions: Gecko-Rel, DOCCI-Test-Pivots, Dall ·E 3 Eval, and GenAI-Bench. Gecko-Rel is designed to measure alignment and includes prompts with high inter-annotator agreement, DOCCI-Test-Pivotsincludes long, descriptive prompts, Dall ·E 3 Evaland GenAI-Benchare more varied datasets that aim to evaluate a range of capabilities. Results are reported in Figure 7. We can see that overall the best performing model under the metrics, for alignment, is Imagen 3. It performs best on the DOCCI-Test-Pivots’s longer prompts and consistently has the overall highest performance. Finally, we see that SDXL 1 and Imagen 2 are consistently less performant than the other models. We further explore, for Gecko-Rel, the breakdown by category in Figure 8. We can see that, overall, Imagen 3 is one of the best performing models. For categories testing capabilities such as color, counting, and spatial reasoning, Imagen 3 performs best (further validating results in Sec. 3.1.5). We also see a difference in model performance for more complex and compositional prompts, e.g. prompts with more linguistic difficulty. On complex prompts, SDXL 1 performs notably worse than the other models. On compositional prompts (where models are tasked to create multiple objects in a scene or a scene without an object), we see that Imagen 3 performs best. This corroborates the previous dataset findings, as Imagen 3 was best on DOCCI-Test-Pivots, which notably has very long, challenging prompts. These results indicate that Imagen 3 performs best for more complex prompts and a variety of capabilities as compared to other models. 3.2.2. Image Quality We compare the distribution of generated images by Imagen 3, SDXL 1, and DALL ·E 3 on 30,000 samples of the MSCOCO-caption validation set (Chen et al., 2015) using different feature spaces and distance metrics following the protocol in Vasconcelos et al. (2024). We take the Fréchet distance on Inception (FID, Heusel et al., 2017) and Dino-v2 (FD-Dino, Oquab et al., 2023; Stein et al., 2023)) feature spaces, and also the MMD distance on CLIP-L feature space (CMMD, Jayasumana et al., 2023). The resolution of the generated images was reduced from 1024×1024pixels to each metric’s standard input size. 2We use the same prompt as Lin et al. (2024) but Gemini 1.5 Pro (Gemini-Team et al., 2024b) as the backend. 9 Page 10: Imagen 3 SDXL 1 Imagen 2 Midjourney Dalle·E 3 SD3 Imagen 3020406080100 ns ns 36.152.056.6 57.7 38.267.4VQAScore performance on Gecko-Rel SDXL 1 Imagen 2 MJ v6 SD 3 DALL ·E 3Imagen 3020406080100 32.644.850.957.9 40.963.4VQAScore performance on GenAI-Bench Imagen 2 SDXL 1 MJ v6 DALL ·E 3 SD 3 Imagen 3020406080100 nsns ns 36.551.5 52.657.6 35.769.5VQAScore performance on Dall ·E 3 Eval SDXL 1 Imagen 2 DALL ·E 3 SD 3 MJ v6 Imagen 3020406080100 21.350.054.457.9 36.672.9VQAScore performance on DOCCI-Test-Pivots Figure 7|VQAScore performance on a variety of datasets. We plot the mean performance and 95% confidence interval as error-bars. Where error-bars overlap and groups of models are not significant, we indicate this with ‘ns’. Otherwise, results are significant with 𝑝 <0.05. To compute significance, we follow Wiles et al. (2024) and compare distributions of predictions using the Wilcoxon signed rank test. Imagen 3 is the best performing model across datasets as measured for alignment. Complexity Compositional Action Color Count Scale Shape Spatial020406080100VQAScoreImagen 3 SD 3 DALL ·E 3 MJ v6 Imagen 2 SDXL 1 Figure 8|Comparing T2I models using VQAScore on the per category breakdown of prompts within Gecko-Rel. Error bars indicate 95% confidence intervals obtained via bootstrapping. 10 Page 11: Imagen 3 Similarly to Vasconcelos et al. (2024) we observed that the minimization of these three metrics are in trade-off with each other. FID favors the generation of natural colors and textures, but under closer inspection, it fails to detect distortions on object shapes and parts. Lower values of FD-Dino and CMMD favor image content. Table 1 displays the results. The FID values of both Imagen 3 and DALL ·E 3 reflect an intentional shift in color distribution away from MSCOCO-caption samples due to aesthetic preference for generating more vivid, stylized images. Simultaneously, Imagen 3 presents the lower CMMD value of the three models, highlighting its strong performance on state-of-the-art feature space metrics. FID (↓) FD-Dino ( ↓) CMMD ( ↓) DALL ·E 3 20.1 284.4 0.894 SDXL 1 13.2 185.6 0.898 Imagen 3 17.2 213.9 0.854 Table 1|Automated Image Distribution metrics : Imagen 3 compared to DALL ·E 3 and SDXL 1 3.3. Conclusions and Limitations All in all, Imagen 3 clearly leads on prompt–image alignment (Sec. 3.1.2, Sec. 3.2.1), especially on detailed prompts (Sec. 3.1.4) and counting abilities (Sec. 3.1.5); while on visual appeal (Sec. 3.1.3), Midjourney v6 takes the lead, with Imagen 3 coming in second. When considering all the quality aspects, Imagen 3 clearly leads in overall preference (Sec. 3.1.1), indicating it strikes the best balance of high quality outputs that respect user intent. While Imagen 3 and other current strong models achieve impressive performance, they still exhibit shortcomings in certain capabilities. In particular, tasks that require numerical reasoning, from generating an exact number of objects to reasoning about parts, are challenging for all models. In addition, prompts that involve reasoning about scale (e.g. “the house is the same size as the cat”), compositional phrases (e.g. “one red hat and a black glass book”) and actions (“a person throws a football”) are the hardest across all models. This is followed by prompts that require spatial reasoning and complex language. 3.4. Qualitative Results Figure 9 shows 24 images generated by Imagen 3 to showcase its capabilities. Figure 10 shows 2 images upsampled to 12 megapixels, with crops to show the level of detail. 4. Responsible Development and Deployment In this section, we outline our latest approach to responsible deployment, from data curation to deployment within products. As part of this process, we analyzed the benefits and risks of our models, set policies and desiderata, and implemented pre-training and post-training interventions to meet these goals. We conducted a range of evaluations and red teaming activities prior to release to improve our models and inform decision-making. This aligns with the approach outlined in Google (2024). 4.1. Assessment InlinewithpreviousreleasesofGoogleDeepMind’simagegenerationmodels,wefollowedastructured approach to responsible development. Building on previous ethics and safety research work, internal 11 Page 12: Imagen 3 Figure 9|Qualitative Results showcasing Imagen 3’s capabilities. See Appendix B for prompts. red teaming data, the broader ethics literature, and real-world incidents, we assessed the societal benefits and risks of Imagen 3 models. This assessment guided the development and refinement of mitigations and evaluation approaches. 4.1.1. Benefits Image generation models introduce a range of benefits to creativity and commercial utility. Image generation can enable individuals and businesses to quickly prototype ideas and experiment with new visual creative directions. Image generation technology also has the potential to broaden participation in the creation of visual art to more people. 12 Page 13: Imagen 3 Figure 10|4K (12MP) Images after 4 ×upsampling , with crops to show the level of detail. See Appendix B for prompts. 4.1.2. Risks We broadly identified two categories of content related risks: (1) Intentional adversarial misuse of the model and (2) Unintentional model failure through benign use. The first category refers to the use of text-to-image generation models to facilitate the creation of content that may promote disinformation, facilitate fraud, or to generate hate content (Marchal et al., 2024). The second category includes how people are represented. Image generation models may amplify stereotypes of gender identities, race, sexuality or nationalities (Bianchi et al., 2023), and some have been observed to oversexualize outputs of women and girls (Wolfe et al., 2023). Image generation models may also expose users to harmful content when prompted benignly, if the model is not well-calibrated to adhere to prompt instructions. 4.2. Policies and Desiderata 4.2.1. Policy The Imagen 3 safety policies are consistent with Google’s established framework for prohibiting the generation of harmful content by Google’s Generative AI models. These policies aim to mitigate the risk of models producing content that is harmful, and encompass areas such as child sexual abuse and exploitation, hate speech, harassment, sexually explicit content, and violence and gore. This follows policy outlined in the Gemini technical reports (Gemini-Team et al., 2024b). 4.2.2. Desiderata Following the Gemini approach, we additionally optimize model development for adherence to user prompts (Gemini-Team et al., 2024b). Even though a policy of refusing all user requests may be considered “non-violative” (i.e. abides by policies around what Imagen 3 should not do), it would obviously fail to serve the needs of a user, and would fail to enable the downstream benefits of generative models. As such, Imagen 3 is developed to maximize adherence to a user’s request, and at deployment time we employ a variety of techniques to mitigate safety and privacy risks. 13 Page 14: Imagen 3 4.3. Mitigations Safety and responsibility are built into Imagen 3 through efforts which target pre-training and post- training interventions, following similar approaches to Gemini efforts (Gemini-Team et al., 2024b). We apply safety filtering to pre-training data according to risk areas, whilst additionally removing duplicated and/or conceptually similar images. We generate synthetic captions to improve the variety and diversity of concepts associated with images in the training data, and undertake analysis to assess training data for potentially harmful data and review the representation of data with consideration to fairness issues. We undertake additional post-training mitigations including production filtering which aim to ensure privacy preservation, reduce risk of misinformation, and minimize of harmful outputs, including applying tools such as SynthID (Gowal and Kohli, 2023) watermarking. 4.4. Responsibility and Safety Evaluations There are four forms of evaluation used for Imagen 3 at the model level to address different lifecycle stages, use of evaluation results, and sources of expertise: Development evaluations are conducted for the purpose of improving on responsibility criteria as Imagen 3 was developed. These evaluations are designed internally and developed based on internal and external benchmarks. Assurance evaluations are conducted for the purpose of governance and review, and are developed and run by a group outside of the model development team. Assurance evaluations are standardized by modality and evaluation datasets are strictly held out. Insights are fed back into the training process to assist with mitigation efforts. Red teaming is a form of adversarial testing where adversaries launch an attack on an AI system to identify potential vulnerabilities, is conducted by a mix of specialist internal teams and recruited participants. Discovery of potential weaknesses can be used to mitigate risks and improve evaluation approaches internally. External evaluations are conducted by independent external groups of domain experts to identify areas for improvement in our model safety work. The design of these evaluations is independent and results are reported periodically to the internal team and governance groups. 4.4.1. Development Evaluations Safety During the model development phase, we actively monitor the model’s violations of Google’s safety policies using automated safety metrics. These automated metrics serve as quick feedback for the modeling team. We use a multimodal classifier to detect content policy violations. The multimodality aspect of such a classifier is important, because there are a plethora of cases where, when two independently benign artifacts (a caption and an image) are combined, there may be a harmful end result. For example, a text prompt “image of a pig” may seem non-violative in itself. However, when combined with an image of a human belonging to a marginalized demographic, the text and image pair results in a harmful representation. We evaluated the performance of Imagen 3 on various safety datasets with recommended safety filters against the performance of Imagen 2. These datasets are targeted to assess violence, hate, explicit sexualization, and over-sexualization in generated images Hao et al. (2024). We find that despite being a higher-quality model, Imagen 3 maintains violation rates similar to, or better than, Imagen 2 across development evaluations. See Section 4.4.2 for the final model performance. 14 Page 15: Imagen 3 Fairness The process of text-to-image generation requires accurately depicting the specific details mentioned in the prompt whilst filling in all of the underspecified aspects of the scene that are left ambiguous in the prompt but must be made concrete in order to produce a high quality image. We optimize for ensuring that the image output is aligned with the user prompt, and report results on this in Sec. 3.1.2. We also aim to generate a variety of outputs within the requirements of a user prompt, and pay particular attention to the distribution of the appearances of people. Specifically, we evaluate fairness through automated metrics based on the distribution of perceived age, gender, and skin tone in images resulting from generic people-seeking prompts. This analysis complements past studies that have analyzed responses to templated queries for various professions acrosssimilardimensionsChoetal.(2023);Leeetal.(2023);Luccionietal.(2023). Weuseclassifiers to gather perceived (or P.) age, gender expression, and skin tone (on the Monk Skin Tone scale, Monk (2019)) to classify images into one of the various categories across each axis according to the table 2. Axis Categories (Perceived) Age 0-30 vs 30+ (Perceived) Gender masculine vs feminine (Perceived) Skin-tone Monk skin tone 1-3 vs 4-6 vs 7-8 vs 9-10 Table 2|Different classification categories for each of the axes. Apart from these statistics, we also measure the percentage of prompts with homogeneous outputs for the above three axes. A prompt with homogeneous outputs (with respect to a certain axis) is defined as a prompt for which all the generated images fall into a single category (Table 2) of the axis. We aim to output images that accurately reflect that anyone can be a doctor or a nurse, without unintentionally rewarding a biased model due to evaluation sets that are constructed to have as many stereotypical feminine-leaning prompts as masculine-leaning prompts. ModelP. Gender Masculine : FeminineP. Skin Tone mst 1-3 : 4-6 : 7-8 : 9-10P. Age 0-30 : 30+ Imagen 2 67.3 : 32.7 69.2 : 21.9 : 8.1 : 0.8 55.6 : 44.4 Imagen 3 62.5 : 37.5 63.6 : 18.1 : 16.7 : 1.6 58.2 : 41.8 Table 3|Distributional Statistics for axis of gender, skin-tone, and age. P. Gender is a shorthand for perceived gender and similarly for skin-tone and age. % Prompts with homogeneous outputs Model P. Gender ( ↓) P. Skin Tone ( ↓) P. Age ( ↓) Imagen 2 50.00 25.89 36.16 Imagen 3 15.48 19.66 25.94 Table 4|% Prompts with homogeneous outputs. From Table 3 and 4 we see how Imagen 3 improves or maintains results compared with Imagen 2. A significant improvement is also noticed in the lower percentage of prompts with homogeneous outputs for all the three axes. We will continue researching methods to reduce homogeneity across broad definitions of people diversity Srinivasan et al. (2024) without impacting image quality or prompt-image alignment. 15 Page 16: Imagen 3 4.4.2. Assurance Evaluations Assurance evaluations are developed and run for the purpose of responsibility governance to provide evidence for model release decisions. These evaluations are conducted independently from the model development process by a dedicated team with specialized expertise. Datasets used for these evaluations are kept separate from those used for model training. High-level findings are fed back to the team to assist with mitigation efforts. Content Safety We evaluate Imagen 3 against our safety policies (see Sec. 4.2.1). We find that Imagen 3 shows improvement in content safety: in comparison to Imagen 2, with a reduction in total policy violations on this evaluation and every policy area showing an improvement or within-error-rate result. Fairness To evaluate fairness of model outputs, we employed two approaches: 1.Standardized evaluation understanding the demographics represented in outputs when prompting for professions to proxy representational diversity. This evaluation takes a list of 140 professions, and generates 100 images for each one. We then analyze each of these images, and categorize the images by perceived age, perceived gender expression, and perceived skin tone. This evaluation found Imagen 3 tends towards lighter skin tones, perceived male faces and younger ages for perceived female faces, but to a lesser extent than Imagen 2. Category Imagen 3 Imagen 2 Monk Skin Tone 1-3 59% 71% Monk Skin Tone 4-6 27% 24% Monk Skin Tone 7-8 13% 5% Monk Skin Tone 9-10 0.3% 0% Category Imagen 3 Imagen 2 Perceived feminine (of images with confident gender) 36% 30% Perceived under 35 (of perceived feminine) 86% 94% Perceived under 35 (of perceived masculine) 60% 64% 2.Qualitative investigation of different representational risks To capture representational risks that may not be surfaced in the profession-based analysis, we also conduct qualitative investigations into a range of harms. This is testing which seeks cases of misrepresentation or inappropriate representation, for instance, if there is a mismatch between the model’s output and a demographic term requested in a prompt, either explicitly or due to the requesting of a historically or culturally demographically-defined membership group. This testing found the model matched user expected behavior. Dangerous Capabilities We also evaluated risks from Imagen 3 in areas such as self-replication, tool-use, and cybersecurity. Specifically, we tested whether Imagen 3 could be used to enable a) fraud/scams, b) social engineer- 16 Page 17: Imagen 3 ing, c) fooling of image recognition systems, and d) steganographic encoding. Examples included generating mockups of a fake login page or phishing alert; generation of fake credentials; generation of malicious QR codes; and generation of signatures. We found no evidence of dangerous capabilities in any of these scenarios, compared to existing affordances for malicious actors - such as open-source image generation or even simple online image search. 4.4.3. Red Teaming We also conducted red teaming to identify new novel failures associated with the Imagen 3 models during the model development process. Red teamers sought to elicit model behavior that violated policies or generated outputs that raised representation issues, such as historical inaccuracies or harmful stereotypes. Red teaming was conducted throughout the model development process to inform development and assurance evaluation areas and to enable pre-launch mitigations. Violations were reported and qualitatively evaluated, with novel failures and attack strategies extracted for further review and mitigation. 4.4.4. External Evaluations As outlined in the Gemini 1.0 Technical Report (Gemini-Team et al., 2024a), we work with a small set of independent external groups to help identify areas for improvement in our model safety work by undertaking structured evaluations, qualitative probing, and unstructured red teaming. Testing groups were selected based on their expertise across a range of domain areas, such as societal and chemical, biological, radiological and nuclear risks, and included academia, civil society, and commercial organizations. The groups testing Imagen 3 were compensated for their time. External groups design their own methodology to test topics within a particular domain area. Reports are written independently of Google DeepMind, but Google DeepMind experts were on hand to discuss methodology and findings. External safety testing groups share their analyses and findings, as well as the raw data and materials they use in their evaluations (e.g., prompts, model responses). Our external testing findings help inform mitigations and identify gaps in our existing internal evaluation methodologies and policies. 4.5. Product Deployment Prior to launch, Google DeepMind’s Responsibility and Safety Council (RSC) reviews a model’s performance based on the assessment and evaluation conducted through the lifecycle of a project to make release decisions. In addition to this process, system-level safety evaluations and reviews run within the context of specific applications models are deployed within. To enable release, internal model cards (Mitchell et al., 2019) are created for structured and consistent internal documentation of critical performance and safety metrics, as well as to inform appropriate external communication of these metrics over time. We release external model and system cards on an ongoing basis, within updates of our technical reports, as well as in documentation for enterprise customers. See Appendix A for the Imagen 3 model card. Additionally, online content covering terms of use, model distribution and access, and operational aspects such as change control, logging, monitoring, and feedback can be found on relevant product websites, such as the Gemini App and Cloud Vertex AI. Some of the key aspects are linked to or described in: Generative AI Prohibited Use Policy, Google Terms of Service, Google Cloud Platform Terms of Service, Gemini Apps Privacy Notice, and Google 17 Page 18: Imagen 3 Cloud Privacy Notice. Appendices A. Imagen 3 Model Card Model Information Description Imagen 3 is a latent diffusion model that generates high quality images from text prompts. Imagen 3 performs well in photorealistic composition settings and in adhering to long and complex user prompts. Inputs Natural-language text strings, such as instructions for creating a synthetic image using a visual description. Outputs Generated high quality images in response to text inputs. Model Data Training Dataset The Imagen 3 model was trained on a large dataset comprising images, text, and associated annotations. Data Pre-processingThe multi-stage safety and quality filtering process employs data cleaning and filtering methods in line with Google’s policies. These methods include: •Safety and quality image filtering: removal of unsafe, violent, or low-quality images. •Eliminating AI-generated images: removal of AI-generated images prevents the model from learning artifacts or biases that may be found in AI-generated images. •Deduplicating images: deduplication pipelines were utilized and similar im- ages were down-weighted to minimize the risk of outputs overfitting training data. •Synthetic captions: each image in the dataset was paired with both original captions and synthetic captions. Synthetic captions were generated using Gemini models and allow the model to learn small details about the image. •Filtering unsafe captions: filters were applied to remove unsafe captions or captions containing Personally Identifiable Information (PII). 18 Page 19: Imagen 3 Implementation and Sustainability Hardware Imagen 3 was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv4 and TPUv5). TPUs are specifically designed to handle the massive computations involved in training LLMs and can speed up training considerably comparedtoCPUs. TPUsoftencomewithlargeamountsofhigh-bandwidthmemory, allowing for the handling of large models and batch sizes during training, which can lead to better model quality. TPU Pods (large clusters of TPUs) also provide a scalable solution. Training can be distributed across multiple TPU devices for faster and more efficient processing. The efficiencies gained through the use of TPUs are aligned with Google’s commit- ments to operate sustainably. Software Training was done using JAX, which allows researchers to take advantage of the latest generation of hardware, including TPUs, for faster and more efficient training of large models. Evaluation Approach Human evaluations of five different quality aspects of text-to-image generation were conducted, including overall preference, prompt-image alignment, visual appeal, detailed prompt-image alignment, and numerical reasoning. Automatic evaluation metrics were used to measure prompt-image alignment and image quality. Results Using the outlined evaluation approach, Imagen 3 was compared against Imagen 2, DALL ·E 3 (Betker et al., 2023), Midjourney v6, Stable Diffusion 3 Large (SD3, Esser et al., 2024), and Stable Diffusion XL 1.0 (SDXL 1, Podell et al., 2023). Extensive human and automatic evaluations showed that Imagen 3 set a new state of the art in text-to-image generation. For detailed results across these evaluations, see Section 3 of the Imagen 3 technical report. Ethics and Safety Responsible DeploymentThedevelopmentofImagen3modelswasdriveninpartnershipwithsafety, security, and responsibility teams. As part of this process, the benefits and risks of models were analyzed, policies and desiderata were set, and pre-training and post-training interventions were implemented to meet responsible deployment goals. A range of evaluations and red teaming activities were held prior to release to improve models and inform decision-making. These evaluations and activities aligned with Google’s AI Principles and AI Responsibility Lifecycle. Social Benefits Image generation models can introduce a range of benefits to creativity and com- mercial utility. Image generation can enable individuals and businesses to quickly prototype ideas and experiment with new visual creative directions. Image genera- tion technology also has the potential to broaden participation in the creation of visual art to more people. Risks Anticipating common text-to-image generation risks, two categories of content related risks were identified: (i) intentional adversarial misuse of the model and (ii) unintentional model failure through benign use. 19 Page 20: Imagen 3 Mitigations Safety and responsibility was built into Imagen 3 through pre-training and post- training mitigations. Pre-training mitigations included safety filtering, image dedu- plication, syntheticcaptioning, anddataanalysis. Post-trainingmitigationsincluded production filtering to ensure privacy preservation and minimization of harmful outputs, and application of tools such as SynthID watermarking to reduce risks such as misinformation. Responsibility and Safety Evalu- ation ApproachAsuiteofevaluationswasusedacrosstheend-to-endlifecycleofmodeldevelopment anddeployment. Thefollowingtestingwasconductedatthemodellevel,butfurther testing is anticipated as Imagen 3 is integrated into products. Evaluation types included: •Development : Evaluations were conducted for policy violations such as violence, hate, explicit sexualization, and over-sexualization. Imagen 3 per- formed similar to or better than Imagen 2 across development safety evalu- ations. Imagen 3 improved or maintained results compared with Imagen 2 during fairness evaluations focused on perceived gender, skin-tone, and age. •Assurance : Evaluations were developed and conducted by specialized teams across areas such as content safety, fairness, and dangerous capabilities, independently from the model development team. Imagen 3 showed im- provements across content safety and fairness compared to Imagen 2, and assurance evaluations found no evidence of dangerous capabilities evaluated, including self-replication, tool-use, or cybersecurity, compared to existing affordances for malicious actors. •External : Evaluations were conducted by independent external domain ex- perts to identify areas for improvement in model safety work. Results were then reported to internal teams and governance groups to help identify gaps in internal evaluation methodologies and safety policies. •Red teaming : Red teaming was conducted by a mix of specialist internal teams and recruited internal participants throughout the model development process to inform development and assurance evaluation areas and to enable pre-launch mitigations. •Product deployment : Prior to model launches, Google DeepMind’s Responsi- bility and Safety Council (RSC) reviews a model’s performance based on the assessments and evaluations conducted throughout the lifecycle of a project to make release decisions. In addition to this process, system-level safety eval- uations and reviews are conducted in the context of the specific applications in which models are deployed. For detailed information across these evaluations, see Section 4.4 of the Imagen 3 technical report. 20 Page 21: Imagen 3 B. Prompts for the images shown Figure 1 Photo of a felt puppet diorama scene of a tranquil nature scene of a secluded forest clearing with a large friendly, rounded robot is rendered in a risograph style. An owl sits on the robots shoulders and a fox at its feet. Soft washes of color, 5 color, and a light-filled palette create a sense of peace and serenity, inviting contemplation and the appreciation of natural beauty Figure 9 •A photo of an Indian woman hugging her friend, both covered in Holi colors and smiling, celebrating the festival with joy. Realistic photography, taken in the style of DSLR camera with 35mm lens. •Abstract cross-hatch sketch: a black and white sketch with loose hand in calligraphic ink showing the abstract outline in profile of a black panther poised on a branch. A canopy of trees is behind. •A view of a knitter’s hands executing a complex weave on a striped hat - a macro DSLR image highlighting the warmth and connection with the earth and nature. •A woman with blonde hair wearing sunglasses stands amidst a dazzling display of golden bokeh lights. Strands of lights and crystals partially obscure her face, and her sunglasses reflect the lights. The light is low and warm creating a festive atmosphere and the bright reflections in her glasses and the bokeh. This is a lifestyle portrait with elements of fashion photography. •a portrait of an auto mechanic in her workshop, holding a wrench in one hand. a old sports car in the background, with a workbench and tools all around. bokeh, high quality dslr photograph. •An origami owl made of brown paper is perched on a branch of an evergreen tree. The owl is facing forward with its eyes closed, giving it a peaceful appearance. The background is a blur of green foliage, creating a natural and serene setting. •A weathered, wooden mech robot covered in flowering vines stands peacefully in a field of tall wildflowers, with a small bluebird resting on its outstretched metallic hand. Digital cartoon, with warm colors and soft lines. A large cliff with waterfall looms behind. •Close-up, low angle view of a rabbit biting into a cabbage on a plate on a counter. A man wearing glasses is yelling at the rabbit and reaching out his hand to snatch the cabbage. High-contrast visuals and cinematic lighting. Fujifilm XF 10-24mm f/4, action shot. •Photo of vinyl toy scene. A colossal stone robot adorned with giant stone gardening tools stands in a lush, futuristic garden. A single sprout peeks out from a patch of fertile soil nearby. Digital art with a soft, dreamlike quality. Vinyl miniature scene. •A pair of well-worn hiking boots, caked in mud and resting on a rocky trail. There’s a squirrel’s head poking out of one of the boots. There’s a mountainous landscape in the background, captured with a Nikon D780. •A joyful woman with a prosthetic leg and athletic attire celebrates reaching the summit of a snowy mountain. She stands triumphantly next to her snowboard, with the vast landscape stretching out behind her. captured with a Leica M11 rangefinder camera for a timeless, film-like aesthetic. •Three women stand together outside with the sun setting behind them creating a lens flare. One woman in the foreground is slightly out of focus and wearing a black felt hat. The middle woman is in focus, wearing glasses, and laughing with her head tossed back. The third woman has blonde hair pulled back in a bun and is wearing a cream sweater. She is looking at the woman in glasses and smiling. •Two contrasting figures, one wooden and jagged, the other smooth, diamond, embrace in a sun-drenched courtyard – the Harmony of Opposites. •pixel art of a space shuttle blasting of, with “STS-1” written below it. Cape Canaveral in the background, blue skies, with plumes of smoke billowing out. •A yellow toy submarine diving deep under the blue ocean. Close-up nature photography, sunlight coming through the water. •A busy city street with people crossing the road at an intersection, illuminated by sunlight, showcasing diverse age groups and styles as they walk across zebra stripes on the pavement. The focus is sharp on one person in red , standing out against their surroundings. Shot during golden hour to capture the warm lighting effects. •An antique pocket watch with Roman numerals and an ornate chain, lying on a worn leather surface with a vintage map in the background, captured with a Leica Q2. •A cute 1970’s convertible sports car sits in front of a pub in an ink wash painting, capturing a charming English village scene with people walking around. •Joy shines in the eyes of a young woman, a charcoal portrait showing she’s ready to make a difference in the world. •An elderly woman wearing a straw hat and a pink jacket is sitting next to a brown and white dog. Both the woman and the dog are looking off into the distance with serene expressions. The lighting is the warm, golden light of 21 Page 22: Imagen 3 sunset, which creates a peaceful and contemplative atmosphere. This is a lifestyle portrait capturing a quiet moment. •A long exposure photo of the Milky Way in a starry night sky, centered over an ocean beach at magic hour. The milky way is bright and prominent with many stars visible against a dark blue black atmosphere in light painting photography with vivid and bold colors. Shot on a professional camera medium format camera with high contrast and a cinematic composition in the style. •A single comic book panel of a boy and his father on a grassy hill, staring at the sunset. A speech bubble points from the boy’s mouth says “The sun will rise again”. Muted, late 1990s coloring style. •Detailed illustration of majestic lion roaring proudly in a dream-like jungle, purple white line art background, clipart on light violet paper texture •A close-up portrait of a young woman with blonde hair and brown eyes. She is lying down and covering her mouth with a dark blue sweater, only her eyes are visible. The background is dark and blurry. The light is coming from above, creating shadows on her face. Figure 10 •A mother fox playing with her baby, showing love and affection in the natural environment of their habitat. The photo captures them sharing a moment, showcasing the bond between animals. The focus is on their faces. •Shot in the style of DSLR camera with the polarizing filter. A photo of three hot air balloons floating over the unique rock formations in Cappadocia, Turkey. The colors and patterns on these balloons contrast beautifully against the earthy tones of the landscape below. This shot captures the sense of adventure that comes with enjoying such an experience C. Evaluation C.1. Automatic-Evaluation Metric Comparisons Here we discuss the differences between the three metrics and how we validated Gecko and VQAS- core with human evaluation. We report the significant model orderings from VQAScore and Gecko in Figure 11. We can see that for models where there is a large gap in performance (e.g. SDXL 1, Imagen 2 versus the other models, as demonstrated in Section 3.1), that both auto-eval metrics reliably separate the model pairs. However, when models are more similar (e.g. SD3, Imagen 3 and DALL ·E 3), then there is some disagreement or metrics do not differentiate between the models. We evaluate how often human annotators agree with the results in order to determine reliability of these metrics. Humans perform a side by side task of determining if one image is more aligned to the prompt than another (as explained in Sec. 3.1.2). We then aggregate human scores and determine confidence intervals for each side by side comparison. We differentiate ties from wins, losses when the confidence interval includes the 50% value. We look at how often metric orderings match human orderings on 30 pairs of models and report results in Table 6. First, we see that CLIP performs poorly (at 43.3%) and is not reliable. Second, we see that both Gecko and VQAScore perform well in this challenging case: agreeing with human annotators for 73.3-80.0% of the model pairs. Interestingly, we see in Figure 11 that there is only one case where either VQAScore or Gecko mixes up the direction (e.g. confuses a win with a loss or vice-versa). Both VQAScore and Gecko metrics are useful and Metric Evaluated Human Eval Setup # Models Evaluated CLIP VQAScore Gecko Dall·E 3 Eval Alignment 15 7 11 10 GenAI-Bench Alignment 15 6 13 12 Total Alignment 30 43.3% 80% 73.3% Table 6|Auto-evalmetricsperformance. We compare how often auto-eval metrics are able to predict the model ranking determined by human preferences. There are three classes: ‘win’, ‘loss’, and ‘tie’. 22 Page 23: Imagen 3 SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 Imagen 3B SD3 Midjourney Dalle·E 3 Imagen 2= = > > > = > > > > > > > > =Gecko-Rel SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 < (=) < (>) = (=) > (>) = (>) = (=) > (=) > (>) > (>) = (=) > (>) > (>) > (>) > (>) = (>)Dall·E 3 Eval SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 = > > > > > > > > < > > > > >DOCCI 1K SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 = (>) > (>) > (>) > (>) > (>) > (>) > (=) > (>) > (>) > (=) > (>) > (>) > (>) > (>) > (>)GenAI-Bench (a) Gecko. SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 Imagen 3B SD3 Midjourney Dalle·E 3 Imagen 2> > > > > > = > > < > > > > =Gecko-Rel SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 > (=) > (>) > (=) > (>) > (>) > (=) = (=) > (>) > (>) = (=) > (>) > (>) > (>) > (>) = (>)Dall·E 3 Eval SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 > > > > > < > > > > > > > > >DOCCI 1K SD3 Midjourney Dalle·E 3 Imagen 2 SDXL 1 > (>) > (>) > (>) > (>) > (>) > (>) < (=) > (>) > (>) < (=) > (>) > (>) > (>) > (>) > (>)GenAI-Bench (b) VQAScore. Figure 11|Comparing T2I models using two T2I alignment metrics on four benchmarks. We plot where metrics find significant differences between pairs of models. We use the Wilcoxon signed rank test when comparing metrics as done in Wiles et al. (2024). We color the square according to the auto-eval metric: blue and red where the auto-eval finds a significant ( 𝑝 <0.05) difference between the pair (grey where it does not) and the color indicates the direction (blue is when the model on the y-axis is better, red when the model on the x-axis is). Where we have human annotation, we indicate in parenthesis human raters’ preference. Metrics rarely confuse wins with losses. Most confusions arise from wins or losses being confused with ties. 23 Page 24: Imagen 3 robust even in these very challenging cases, with VQAScore being a bit more reliable than Gecko. Further, when these metrics agree, the agreed model ordering matches human ratings 94.4% of the time. In these cases, we can be confident in the predicted model orderings. C.2. Additional Results on Numerical Reasoning In this section, we present additional data in support of results in Section 3.1.5. Figure 12 shows a per-number accuracy breakdown for different ground truth numbers in the text prompts. While both Imagen 3 and DALL ·E 3 are the most accurate models when generating images containing exactly one object (see bars above the x-tick “1”), Imagen 3 had the highest overall accuracy when generating images with more than one object (with SD3 having overlapping confidence intervals at n=3 and n=4 with Imagen 3). As well, Imagen 3 is the strongest model on prompt types with a more complex structure (i.e., *-additive andattribute-spatial prompts), as shown in Figure 13. As with all other models we investigated, the accuracy of Imagen 3 also depends on the specific number in the text prompt. Specifically, accuracy drops with each successive number so that, on average, the model is 51.6percentage points less accurate on prompts asking for “5” objects (i.e. “5 apples”), compared to prompts asking for “1” object (i.e. “1 apple”) (see Figure 12). These results indicate that an accurate depiction of any quantity in an image remains an open challenge in text-to-image models. 1 2 3 4 5020406080100 Ground Truth NumberCounting AccuracyImagen 3 SD 3 DALL ·E 3 MJ v6 Imagen 2 SDXL 1 Figure 12|Per number accuracy on all prompts in Number Generation Task. The ground truth number on the x-axis is the original number in the text prompt used to generate the image. Accuracy is computed based on human annotations of actual counts in the images. Error bars indicate 95% confidence intervals obtained via bootstrapping. 24 Page 25: Imagen 3 numeric-simple 2-additive 3-additive numeric-sentence attribute-color 2-additive-color attribute-spatial020406080100Counting AccuracyImagen 3 SD 3 DALL ·E 3 MJ v6 Imagen 2 SDXL 1 Figure 13|Accuracy breakdown per different types of prompt in the GeckoNum benchmark . On 6/7 prompt types Imagen 3 had the highest average accuracy. Error bars indicate 95% confidence intervals obtained via bootstrapping. 25 Page 26: Imagen 3 D. Imagen 3-002 Update D.1. Human Evaluation In December 2024 we released an updated, higher quality version of Imagen 3. This section updates our human evaluation to reflect the performance of this new Imagen 3 version (which we refer to as “Imagen 3-002”). We also added five recent external models: Recraft v3, Ideogram v2, FLUX1.1 [pro], Nova Canvas, and Stable Diffusion 3.5 Large (SD 3.5 L) to our evaluation. We refer to the previous Imagen 3 version as “Imagen 3-001”. We report results on GenAI-Bench and run side-by-side comparisons on three quality aspects: (i) overall preference, (ii) visual quality, and (iii) prompt-image alignment. We start by comparing each new model to Imagen 3-001 (previous best Elo score) and computing their preliminary Elo score. We then run side-by-side comparisons of each model against the four better and four worse models in terms of Elo scores. We aggregate all these side-by-side comparisons into our final results, which we show in Figure 14. Imagen 3-002 has the best Elo scores in all three quality aspects. D.2. Qualitative Results Figure 15 shows some qualitative results showcasing Imagen 3-002’s capabilities. The prompts that were used to generate these images are, from left to right, and top to bottom: •A vibrant illustration showcases a young anime girl clinging tightly to a fuzzy purple dragon as it soars through a fantastical sky. The girl, with her signature large, expressive eyes and bright, flowing hair, is rendered in a dynamic pose, her body leaning forward against the wind as she grips the dragon’s back. The dragon itself is a fluffy, whimsical creature, its purple fur rendered with a soft, almost plush texture. They fly through a sky filled with fluffy pink clouds, glittering sparkles, and a vibrant rainbow arcing across the scene. The colors are bright and saturated, contributing to the magical and whimsical quality of the illustration. The overall mood is one of joyful, carefree adventure, emphasizing the fantastical nature of the scene and the playful bond between the girl and her unusual mount. The style is distinctly anime, with exaggerated features and a focus on dynamic movement and bright, bold colors. •Captured in the style of a high-budget animated film with vibrant, painterly textures, the frame reveals an expansive celestial landscape filled with glowing nebulae in vivid purples, blues, and golds. The protagonist, a small female figure clad in a flowing cape adorned with star motifs, stands at the edge of a crystalline cliff. Below, rivers of molten stardust wind through the galaxy, their golden light shimmering dynamically. Towering constellations shaped like mythical beasts hover in the background, their forms traced in glowing, dotted lines. Shooting stars streak across the vast sky, adding motion and brilliance to the scene. The camera angle is slightly elevated, capturing both the scale of the galaxy and the intimate journey of the protagonist. •A close-up, macro photography stock photo of a strawberry intricately sculpted into the shape of a hummingbird in mid-flight, its wings a blur as it sips nectar from a vibrant, tubular flower. The backdrop features a lush, colorful garden with a soft, bokeh effect, creating a dreamlike atmosphere. The image is exceptionally detailed and captured with a shallow depth of field, ensuring a razor-sharp focus on the strawberry-hummingbird and gentle fading of the background. The high resolution, professional photographers style, and soft lighting illuminate the scene in a very detailed manner, professional color grading amplifies the vibrant colors and creates an image with exceptional clarity. The depth of field makes the hummingbird and flower stand out starkly against the bokeh background. •A close-up shot captures a winter wonderland scene – soft snowflakes fall on a snow-covered forest floor. Behind a frosted pine branch, a red squirrel sits, its bright orange fur a splash of color against the white. It holds a small hazelnut. As it enjoys its meal, it seems oblivious to the falling snow. •An extreme close-up of a craftsperson’s hands shaping a glowing piece of pottery on a wheel. Threads of golden, luminous energy connect the potter’s hands to the clay, swirling dynamically with their movements. The workspace is filled with rich textures—dusty shelves lined with tools, scattered clay fragments, and beams of natural light piercing through wooden shutters. The interplay of light and energy creates an ethereal, almost magical atmosphere •A foggy 1940s European train station at dawn, framed by intricate wrought-iron arches and misted glass windows. Steam rises from the tracks, blending with dense fog. Two lovers stand in an emotional embrace near the train, backlit by the warm, amber glow of dim lanterns. The departing train is partially visible, its red tail lights fading into the mist. The woman wears a faded red coat and clutches a small leather diary, while the man is dressed in a weathered soldier’s uniform. Dust motes float in the air, illuminated by the soft golden backlight. The atmosphere is melancholic and timeless, evoking the bittersweet farewell of wartime cinema. 26 Page 27: Imagen 3 SDXL 1Imagen 2NovaCanvasMJ v6DALL·E 3 SD 3.5 LFlux1.1p Imagen 3-001IdeogramV2RecraftV3Imagen 3-0028008509009501,0001,0501,1001,150 8179649829971,0011,0431,0591,078 8891,0531,115Elo rating (with 99% CI)Overall preference 44.8 41.2 41.2 40.555.2 49.2 47.1 43.9 38.258.8 50.8 49.8 49.0 40.8 40.658.8 52.9 50.2 49.4 43.8 42.7 41.059.5 56.1 51.0 50.6 45.6 44.2 40.2 39.761.8 59.2 56.2 54.4 49.6 45.9 45.9 34.659.4 57.3 55.8 50.4 49.9 41.4 36.7 31.959.0 59.8 54.1 50.1 47.8 40.9 30.560.3 54.1 58.6 52.2 37.8 31.165.4 63.3 59.1 62.2 41.868.1 69.5 68.9 58.2Imagen 3-002 Imagen 3-002RecraftV3 RecraftV3IdeogramV2 IdeogramV2Imagen 3-001 Imagen 3-001Flux1.1p Flux1.1pSD 3.5 L SD 3.5 LDALL·E 3 DALL ·E 3MJ v6 MJ v6NovaCanvas NovaCanvasImagen 2 Imagen 2SDXL 1 SDXL 1 SDXL 1Imagen 2NovaCanvasDALL·E 3 SD 3.5 LFlux1.1pMJ v6 Imagen 3-001IdeogramV2RecraftV3Imagen 3-0026507007508008509009501,0001,0501,1001,150 6978979611,0331,063 1,0661,1041,112 8271,1041,135Elo rating (with 99% CI)Visual quality 47.3 46.2 46.8 39.552.7 50.9 50.9 42.4 41.053.8 49.1 48.5 50.4 47.8 39.353.2 49.1 51.5 51.3 43.6 40.6 33.560.5 57.6 49.6 48.7 52.4 49.3 34.1 27.559.0 52.2 56.4 47.6 49.2 36.0 36.4 32.260.7 59.4 50.7 50.8 45.0 36.4 29.6 19.966.5 65.9 64.0 55.0 35.3 45.9 28.672.5 63.6 63.6 64.7 37.3 25.967.8 70.4 54.1 62.7 39.580.1 71.4 74.1 60.5Imagen 3-002 Imagen 3-002RecraftV3 RecraftV3IdeogramV2 IdeogramV2Imagen 3-001 Imagen 3-001MJ v6 MJ v6Flux1.1p Flux1.1pSD 3.5 L SD 3.5 LDALL·E 3 DALL ·E 3NovaCanvas NovaCanvasImagen 2 Imagen 2SDXL 1 SDXL 1 SDXL 1Imagen 2NovaCanvasMJ v6DALL·E 3 SD 3.5 LFlux1.1p Imagen 3-001IdeogramV2RecraftV3Imagen 3-0028509009501,0001,0501,100 8459719821,002 1,0041,0371,0451,061 9061,0421,106Elo rating (with 99% CI)Prompt-image alignment 43.2 41.6 42.1 39.556.8 48.6 47.8 45.3 41.258.4 51.4 49.3 50.9 43.5 43.357.9 52.2 50.7 48.9 44.5 43.9 44.560.5 54.7 49.1 51.1 45.2 44.1 42.1 41.558.8 56.5 55.5 54.8 51.9 44.7 44.6 35.856.7 56.1 55.9 48.1 48.8 43.1 40.2 29.855.5 57.9 55.3 51.2 50.3 38.3 32.058.5 55.4 56.9 49.7 40.1 33.564.2 59.8 61.7 59.9 41.070.2 68.0 66.5 59.0Imagen 3-002 Imagen 3-002RecraftV3 RecraftV3IdeogramV2 IdeogramV2Imagen 3-001 Imagen 3-001Flux1.1p Flux1.1pSD 3.5 L SD 3.5 LDALL·E 3 DALL ·E 3MJ v6 MJ v6NovaCanvas NovaCanvasImagen 2 Imagen 2SDXL 1 SDXL 1 Figure 14|Updated human evaluation on GenAI-Bench: Elo scores and win-rate percentages for (i) overall preference, (ii) visual quality, and (iii) prompt-image alignment. 27 Page 28: Imagen 3 •A low-angle close-up shot, in stark black and white, focuses on a woman with a short, precisely cut bob. Her expression is one of deep concern; her eyebrows are slightly furrowed, her mouth drawn into a thin line, and her eyes hold a worried intensity. The high contrast of the black and white photography emphasizes the texture of her skin and the lines around her eyes, accentuating her worried expression. The background is a blurred but imposing array of tall skyscrapers, their forms rendered in varying shades of grey, creating a sense of depth and scale. The low angle, shooting upwards, emphasizes her upward gaze, suggesting a sense of being overwhelmed by the weight of her worries within the vast urban landscape. The overall mood is one of serious apprehension, a powerful and poignant image of a woman grappling with anxieties within a monumental city. •A portrait of an Asian woman with neon green lights in the background, shallow depth of field. Figure 15|Qualitative Results showcasing Imagen 3-002’s capabilities. 28 Page 29: Imagen 3 References J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al. Improving image generation with better captions. Computer Science , 2(3):8, 2023. URL https: //cdn.openai.com/papers/dall-e-3.pdf . F. Bianchi, P. Kalluri, E. Durmus, F. Ladhak, M. Cheng, D. Nozza, T. Hashimoto, D. Jurafsky, J. Zou, and A. Caliskan. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In 2023 ACM Conference on Fairness, Accountability, and Transparency , FAccT ’23. ACM, June 2023. doi: 10.1145/3593013.3594095. URL http://dx.doi.org/10.1145/3593013. 3594095 . X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015. URL http:// arxiv.org/abs/1504.00325 . X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al. PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794 , 2022. URL https://arxiv.org/abs/2209.06794 . W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference, 2024. URL https://arxiv.org/abs/2403.04132 . J. Cho, A. Zala, and M. Bansal. DALL-Eval: Probing the reasoning skills and social biases of text-to- image generation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3043–3054, 2023. J. Cho, Y. Hu, R. Garg, P. Anderson, R. Krishna, J. Baldridge, M. Bansal, J. Pont-Tuset, and S. Wang. Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. In ICLR, 2024. G. DeepMind. Best practices for data enrichment. https://deepmind.google/discover/blog/ best-practices-for-data-enrichment/ , 2022. Accessed: 2024-06-25. P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206 , 2024. URL http://arxiv.org/abs/2403.03206 . R. Garg, A. Burns, B. K. Ayan, Y. Bitton, C. Montgomery, Y. Onoe, A. Bunner, R. Krishna, J. Baldridge, and R. Soricut. ImageInWords: Unlocking hyper-detailed image descriptions. arXiv preprint arXiv:2405.02793 , 2024. URL http://arxiv.org/abs/2405.02793 . Gemini-Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: A family of highly capable multimodal models, 2024a. URL https: //arxiv.org/abs/2312.11805 . Gemini-Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024b. URLhttps://arxiv.org/abs/2403.05530 . Google. End-to-end responsibility: A lifecycle approach to AI. https://ai.google/static/ documents/ai-responsibility-2024-update.pdf , 2024. Accessed: 2024-07-09. 29 Page 30: Imagen 3 S.GowalandP.Kohli. IdentifyingAI-generatedimageswithSynthID. https://deepmind.google/ discover/blog/identifying-ai-generated-images-with-synthid/ , 2023. Accessed: 2024-06-25. S. Hao, R. Shelby, Y. Liu, H. Srinivasan, M. Bhutani, B. K. Ayan, R. Poplin, S. Poddar, and S. Laszlo. Harm amplification in text-to-image models, 2024. URL http://arxiv.org/abs/2402.01787 . J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. CLIPscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 , 2021. URL https://arxiv.org/ abs/2104.08718 . M.Heusel, H.Ramsauer, T.Unterthiner, B.Nessler, andS.Hochreiter. GANstrainedbyatwotime-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st International Conference on Neural Information Processing Systems , NIPS’17, page 6629–6640, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964. S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar. Rethinking FID: Towards a better evaluation metric for image generation. arXiv preprint arXiv:2401.09603 , 2023. URLhttp://arxiv.org/abs/2401.09603 . I. Kajić, O. Wiles, I. Albuquerque, M. Bauer, S. Wang, J. Pont-Tuset, and A. Nematzadeh. Evaluating numerical reasoning in text-to-image models. In NeurIPS, 2024. URL https://arxiv.org/abs/ 2406.14774 . T. Lee, M. Yasunaga, C. Meng, Y. Mai, J. S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, M. Kang, T. Park, J. Leskovec, J.-Y. Zhu, F.-F. Li, J. Wu, S. Er- mon, and P. S. Liang. Holistic evaluation of text-to-image models. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neu- ral Information Processing Systems , volume 36, pages 69981–70011. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ dd83eada2c3c74db3c7fe1c087513756-Paper-Datasets_and_Benchmarks.pdf . Z. Lin, D. Pathak, B. Li, J. Li, X. Xia, G. Neubig, P. Zhang, and D. Ramanan. Evaluating text-to- visual generation with image-to-text generation. arXiv preprint arXiv:2404.01291 , 2024. URL http://arxiv.org/abs/2404.01291 . S. Luccioni, C. Akiki, M. Mitchell, and Y. Jernite. Stable Bias: Evaluating societal representations in diffusion models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems , volume 36, pages 56338–56351. Curran As- sociates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/b01153e7112b347d8ed54f317840d8af-Paper-Datasets_and_Benchmarks.pdf . N.Marchal, R.Xu, R.Elasmar, I.Gabriel, B.Goldberg, andW.Isaac. GenerativeAImisuse: Ataxonomy of tactics and insights from real-world data, 2024. URL https://arxiv.org/abs/2406.13843 . M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T.Gebru. Modelcardsformodelreporting. In ProceedingsoftheConferenceonFairness,Accountability, and Transparency , FAT* ’19. ACM, Jan. 2019. doi: 10.1145/3287560.3287596. URL http: //dx.doi.org/10.1145/3287560.3287596 . E. Monk. Monk skin tone scale, 2019. URL https://skintone.google . A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 , 2021. URL http://arxiv.org/abs/2112.10741 . 30 Page 31: Imagen 3 Y.Onoe,S.Rane,Z.Berger,Y.Bitton,J.Cho,R.Garg,A.Ku,Z.Parekh,J.Pont-Tuset,G.Tanzer,S.Wang, and J. Baldridge. DOCCI: Descriptions of Connected and Contrasting Images. In arXiv:2404.19753 , 2024. URL http://arxiv.org/abs/2404.19753 . M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, H. Xu, V. Sharma, S.-W. Li, W. Galuba, M. Rab- bat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without supervision, 2023. URL http://arxiv.org/abs/2304.07193 . PAI. Responsible sourcing of data enrichment services. https://partnershiponai.org/ responsible-sourcing-considerations/ , 2021. Accessed: 2024-06-25. D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952 , 2023. URL http://arxiv.org/abs/2307.01952 . C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B.KaragolAyan,T.Salimans,etal. Photorealistictext-to-imagediffusionmodelswithdeeplanguage understanding. Advances in neural information processing systems , 35:36479–36494, 2022. H. Srinivasan, C. Schumann, A. Sinha, D. Madras, G. O. Olanubi, A. Beutel, S. Ricco, and J. Chen. Generalized people diversity: Learning a human perception-aligned diversity repre- sentation for people images. In Proceedings of the 2024 ACM Conference on Fairness, Account- ability, and Transparency , FAccT ’24, page 797–821, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400704505. doi: 10.1145/3630106.3658940. URL https://doi.org/10.1145/3630106.3658940 . G. Stein, J. Cresswell, R. Hosseinzadeh, Y. Sui, B. Ross, V. Villecroze, Z. Liu, A. L. Caterini, E. Taylor, and G. Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Advances in Neural Information Processing Systems , volume 36, 2023. C. N. Vasconcelos, A. R. A. Waters, T. Walker, K. Xu, J. Yan, R. Qian, S. Luo, Z. Parekh, A. Bunner, H. Fei, R. Garg, M. Guo, I. Kajic, Y. Li, H. Nandwani, J. Pont-Tuset, Y. Onoe, S. Rosston, S. Wang, W. Zhou, K. Swersky, D. J. Fleet, J. M. Baldridge, and O. Wang. Greedy growing enables high-resolution pixel-based diffusion models. TMLR, 2024. URL http://arxiv.org/abs/2405.16759 . O.Wiles, C.Zhang, I.Albuquerque, I.Kajić, S.Wang, E.Bugliarello, Y.Onoe, C.Knutsen, C.Rashtchian, J.Pont-Tuset,etal. Revisitingtext-to-imageevaluationwithGecko: Onmetrics,prompts,andhuman ratings. arXiv preprint arXiv:2404.16820 , 2024. URL https://arxiv.org/abs/2104.16820 . R. Wolfe, Y. Yang, B. Howe, and A. Caliskan. Contrastive language-vision AI models pretrained on web-scraped multimodal data exhibit sexual objectification bias. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency , FAccT ’23, page 1174–1185, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701924. doi: 10.1145/3593013. 3594072. URL https://doi.org/10.1145/3593013.3594072 . 31 Page 32: Imagen 3 Contributions Core Contributors Jason Baldridge Jakob Bauer Mukul Bhutani Nicole Brichtova Andrew Bunner Lluis Castrejon Kelvin Chan Yichang Chen Sander Dieleman Yuqing Du Zach Eaton-Rosen Hongliang Fei Nando de Freitas Yilin Gao Evgeny Gladchenko Sergio Gómez Colmenarejo Mandy Guo Alex Haig Will Hawkins Hexiang (Frank) Hu Huilian Huang Tobenna Peter Igwe Christos Kaplanis Siavash Khodadadeh Yelin Kim Ksenia Konyushkova Karol Langner Eric Lau Rory Lawton Shixin LuoSoňa Mokrá Henna Nandwani Yasumasa Onoe Aäron van den Oord Zarana Parekh Jordi Pont-Tuset Hang Qi Rui Qian Deepak Ramachandran Poorva Rane Abdullah Rashwan Ali Razavi Robert Riachi Hansa Srinivasan Srivatsan Srinivasan Robin Strudel Benigno Uria Oliver Wang Su Wang Austin Waters Chris Wolff Auriel Wright Zhisheng Xiao Hao Xiong Keyang Xu Marc van Zee Junlin Zhang Katie Zhang Wenlei Zhou Konrad Zolna 32 Page 33: Imagen 3 Contributors Ola Aboubakar Canfer Akbulut Oscar Akerlund Isabela Albuquerque Nina Anderson Marco Andreetto Lora Aroyo Ben Bariach David Barker Praseem Banzal Sherry Ben Dana Berman Courtney Biles Irina Blok Pankil Botadra Jenny Brennan Karla Brown John Buckley Rudy Bunel Elie Bursztein Christina Butterfield Ben Caine Viral Carpenter Norman Casagrande Ming-Wei Chang Solomon Chang Shamik Chaudhuri Tony Chen John Choi Dmitry Churbanau Nathan Clement Matan Cohen Forrester Cole Romina Datta Mikhail Dektiarev Vincent Du Praneet Dutta Tom Eccles Ndidi Elue Ashley Feden Shlomi Fruchter Frankie Garcia Roopal Garg Weina Ge Ahmed Ghazy Bryant Gipson Andrew Goodman Dawid GórnySven Gowal Khyatti Gupta Yoni Halpern Yena Han Susan Hao Jamie Hayes Jonathan Heek Amir Hertz Ed Hirst Emiel Hoogeboom Tingbo Hou Heidi Howard Mohamed Ibrahim Dirichi Ike-Njoku Joana Iljazi Vlad Ionescu William Isaac Komal Jalan Reena Jana Gemma Jennings Donovon Jenson Xuhui Jia Kerry Jones Xiaoen Ju Ivana Kajic Christos Kaplanis Burcu Karagol Ayan Jacob Kelly Suraj Kothawade Christina Kouridi Ira Ktena Jolanda Kumakaw Dana Kurniawan Dmitry Lagun Lily Lavitas Jason Lee Tao Li Marco Liang Ricky Liang Maggie Li-Calis Rui Lin Jasmine Liu Yuchi Liu Javier Lopez Alberca Matthieu Kim Lorrain Peggy Lu Kristian Lum Yukun Ma 33 Page 34: Imagen 3 Chase Malik John Mellor Thomas Mensink Inbar Mosseri Tom Murray Aida Nematzadeh Paul Nicholas Signe Nørly João Gabriel Oliveira Guillermo Ortiz-Jimenez Michela Paganini Tom Le Paine Roni Paiss Alicia Parrish Anne Peckham Vikas Peswani Igor Petrovski Tobias Pfaff Alex Pirozhenko Ryan Poplin Utsav Prabhu Yuan Qi Matthew Rahtz Cyrus Rashtchian Charvi Rastogi Amit Raul Ali Razavi Sylvestre-Alvise Rebuffi Susanna Ricco Felix Riedel Dirk Robinson Pankaj Rohatgi Bill Rosgen Sarah Rumbley Moonkyung Ryu Anthony Salgado Tim Salimans Eleni Shaw Gregory Shaw Sahil Singla Florian Schroff Candice Schumann Tanmay Shah Brendan Shillingford Kaushik Shivakumar Dennis ShtatnovZach Singer Evgeny Sluzhaev Valerii Sokolov Thibault Sottiaux Florian Stimberg Brad Stone David Stutz Yu-Chuan Su Eric Tabellion Amit Talreja Shuai Tang David Tao Kurt Thomas Gregory Thornton Andeep Toor Cristian Udrescu Aayush Upadhyay Cristina Vasconcelos Shanthal Vasanth Alex Vasiloff Andrey Voynov Amanda Walker Luyu Wang Miaosen Wang Simon Wang Stanley Wang Qifei Wang Yuxiao Wang Ágoston Weisz Olivia Wiles Chenxia Wu Xingyu Federico Xu Andrew Xue Jianbo Yang Luo Yu Mete Yurtoglu Ali Zand Han Zhang Jiageng Zhang Catherine Zhao Adilet Zhaxybay Miao Zhou Shengqi Zhu Zhenkai Zhu 34 Page 35: Imagen 3 Advisors Dawn Bloxwich Mahyar Bordbar Luis C. Cobo Eli Collins Shengyang Dai Tulsee Doshi Anca Dragan Douglas Eck Demis Hassabis Sissie Hsiao Tom HumeKoray Kavukcuoglu Helen King Jack Krawczyk Yeqing Li Kathy Meier-Hellstern Andras Orban Yury Pinsky Amar Subramanya Oriol Vinyals Ting Yu Yori Zwols The roles are defined as below: •Core Contributor : Individual that had significant impact throughout the project. •Contributor : Individual that had contributions to the project and was partially involved with the effort. •Advisor: Individual who provided guidance and expertise to the project. Within each role, contributions are equal, and are listed in alphabetical order. Ordering within each role does not indicate ordering of the contributions. 35

---