DALL-E 3 Quality AI Art using GPT-4 Vision & SDXL

Table of Contents

Introduction

Today I wanna talk about some really cool new AI research that could totally change how we create images with technology.

Do you know those fancy apps that can turn regular words into pictures? Like you type “a cute puppy” and it makes a photo of a cute puppy. Well, some smart folks recently figured out how to make those apps WAY better by having the AI talk to itself!

Let me explain. There’s this new AI called GPT-4 Vision that can look at photos and understand what’s in them. The researchers used that to help the image-making AI make better pictures.

They start with GPT-4 Vision looking at a simple photo. Then it tells the image AI what it sees, to try making a picture of that. The image AI makes something, but it’s not perfect yet.

So GPT-4 Vision looks at the messy picture, compares it to the original photo, and tells the image AI how to fix it up. After going back and forth a few times, the image AI gets way better at making what GPT-4 Vision asks for!

It’s like the AI are working together, with GPT-4 Vision coaching the image AI on how to improve. I think that’s amazing! In this post, I’ll show you some examples of how much better the pictures get when the AI talk to each other like this. Let’s get to it!

Overview of the Topic – Image Generation with AI

If you guys remember, some really big recent AI news was OpenAI’s announcement of their GPT-4 Vision model. This gives ChatGPT the ability to actually see images, just like a human.

Some clever researchers have taken this GPT-4 Vision capability and applied it to an AI image generator. Starting with a general image prompt, they show it to the GPT-4 Vision model, which then creates a text prompt to feed into the image generator SDXL.

The output image is fed back into GPT-4 Vision for evaluation, and the process repeats, with GPT-4 Vision continuously refining the text prompts to produce better and better images from SDXL.

You might be wondering why they didn’t just use DALL-E 3. I think for the researchers it was easier to use SDXL, but using SDXL actually has a hidden benefit – this method produces shockingly good images from SDXL, at what I’d consider almost DALL-E 3 quality levels. That’s because it’s able to prompt SDXL so effectively through this iterative self-learning process.

Introduction to GPT-4 Vision Model

Let’s take a quick deeper dive into the GPT-4 Vision model that makes this possible. GPT-4 Vision was developed by OpenAI as an extension of their conversational AI chatbot ChatGPT. It gives ChatGPT the ability to actually “see” and understand visual inputs in addition to text.

This is a really big deal – now AI assistants can take in images and video just like humans can, and leverage vision capabilities to have more natural conversations and be helpful for more real-world tasks.

Currently, GPT-4 Vision is only available to select users through the ChatGPT chatbot and Bing Chat. But the possibilities it unlocks, as we’ll see here, are really exciting!

Researchers’ Approach: Using GPT-4 Vision for Image Generation

A slide presentation for GPT-4 vision Idea2Img, a tool for automatic image design and generation.

In this new research, the Microsoft Azure AI team got special access to GPT-4 Vision through their closed partnership with OpenAI. They used it to develop a new iterative approach for improving image generation, which they call “Idea2Img.”

As we’ll see through the examples, this technique produces way better results than just feeding text prompts directly into an image generator like SDXL. The key is GPT-4 Vision’s ability to deeply understand visual concepts from both text and images, allowing it to craft prompts that better communicate what the user is looking for.

Advantages of Using SDXL Over DALL-E 3

You might still be wondering – why use SDXL instead of DALL-E 3 or another more advanced image model? I think for the researchers, SDXL was just easier to integrate with GPT-4 Vision.

However, using SDXL actually shows the power of this technique. Despite SDXL being less advanced than DALL-E 3, this prompting method gets way better images out of SDXL than you’d expect was possible. We’re talking DALL-E 3-level quality or better, just by optimizing the prompts!

So this approach could likely supercharge even the most advanced image models out there. The limiting factor is the prompt crafting, not the image model itself.

Get a Free DALL-E Prompt Generator and generate high-quality images.

Examples of Basic Text-to-Image Generation

Let’s look at some examples comparing basic text-to-image generation to the new “Idea2Img” technique.

First, we’ve got a prompt to generate 5 people sitting around a table drinking beer and eating buffalo wings. The raw SDXL output based on just that text is decent, but has some weird glitches and isn’t perfect.

gpt-4 vision and Idea2Img generated: two images of 5 people sitting around a table drinking beer and eating buffalo wings.

The Idea2Img result looks much more cinematic and coherent. You can clearly see all 5 distinct people, beers, and wings on the table. Still not totally photorealistic due to SDXL’s limitations, but a big improvement!

Improvements with the “Idea2Img” Approach

Here’s another example where Idea2Img really shines at improving text prompt following. The prompt asks for: “A whole cake on the table with the words Azure Research written on the cake.”

A photo of a cake with the words “Azure Research” written on it, surrounded by plates and cups.

Again, raw SDXL doesn’t get it right. But Idea2Img nails it – a nice cake with the text clearly written, just missing one letter R. Pretty impressive considering SDXL has no inherent ability to generate legible text like that!

Next up: “An image of a hand holding an iPhone illustrating how to take a screenshot on an iPhone.” SDXL’s try looks glitchy and nonsensical. With Idea2Img, the hand and iPhone are much clearer and properly framed for a “how-to” shot.

A hand holding an iPhone, with text explaining how to take a screenshot on an iPhone.

Text Prompt Following and Its Enhancements

Idea2Img also improves following logical text prompts thanks to GPT-4 Vision’s understanding. For example: “A plate that has no bananas on it. There is a glass without orange juice next to it.”

A photo of a white table with a plate of bananas, a glass of orange juice, and a plate with no bananas or orange juice.

Raw SDXL just draws bananas and orange juice since those words are present. The Idea2Img understands the intent and gives us an empty plate and glass.

Here’s another example asking for a simple logo.

SDXL ignores that and draws complex scenes. Idea2Img provides a clean stylized logo.

So you can see how Idea2Img results in significantly better prompt following and coherence!

Style Transfer Using GPT-4 Vision

Incredibly, this technique can even direct style transfer just by using SDXL as the base image model. Typically style transfer requires specialized AI models trained explicitly in artistic styles.

A graphic of a painting of a corgi dog with a style similar to the one in the image.

But here, they show GPT-4 Vision an art style sample, then have it optimize prompts for SDXL to emulate that style. As you can see, the results capture the artistic style remarkably well given it’s just guiding a standard image generator.

Image Manipulation and Style Changes

It goes beyond style transfer too – Idea2Img can manipulate and modify images like an editing program.

Here they take an image of a tennis player, and prompt the system to make it a drawing, but also change the background to a beach scene.

A triptych of images showing a tennis player with the background changed to a beach in the third image.

Shockingly, it handles both modifications nearly perfectly!

Again, this is with no image editing capability inherent to SDXL. The prompts optimized by GPT-4 Vision abstract out and convey the edits at a textual level.

Combining Multiple Concepts in Image Generation

Another benefit is the ability to easily combine disparate concepts and custom inputs.

For example, we can provide an image of Bill Gates, another image of a specific dog breed, and an image of a man in a certain pose/outfit. Then prompt the system to generate Bill Gates with that same outfit, pose, and the dog next to him.

A collage of three images, one of a man with a dog, one of a dog, and one of a man with a dog waving.

It blends all those elements together remarkably well. This level of compositing complex custom concepts is difficult with a single text prompt.

Feedback Loop and How It Enhances Output

So what makes this all work so well? It’s thanks to the iterative feedback loop between GPT-4 Vision and the image generator.

GPT-4 Vision looks at the initial output image and compares it to the original multimodal prompt. It decides how well its prompt conveyed the concepts, where the discrepancies are, and how to revise the text to get an output closer to the desired result.

It can go through multiple rounds of refinement until the generated image matches the intent as closely as possible. This is AI leveraging visual understanding to teach another AI system!

Prospects for Future Use

This research really proves the viability of this approach for enhancing text-to-image generation. Combining a large language model like GPT-4 with an image generator, even a less advanced one, produces drastically better results.

The main limitation right now is that GPT-4 Vision is not publicly accessible. But if OpenAI were to release an API, developers could integrate these capabilities to create next-level image generation web apps and creative tools!

This technique also pushes the boundaries of what’s possible with text-to-image models. It unlocks their full potential through optimized prompting.

Requirements for Implementing This Technology

To recap, there are 2 core requirements for implementing this “Idea2Img” system:

Access to an image generator API – this could be SDXL, DALL-E 3, or others.
Access to an API for a vision-capable language model like GPT-4 Vision. This handles evaluating the images and iteratively refining prompts.

Right now, only GPT-4 Vision has this advanced level of multimodal understanding needed to make the loop work effectively. But future visual AI systems could perhaps replicate these capabilities.

Having both those elements connected allows for this autonomous self-improving image generation process.

Significance for the AI Art Generation Community

This research has huge implications for the AI art community. Right now, text-to-image models still require carefully crafted prompts to produce desired results. The idea of Image demonstrates how an AI assistant could refine prompts on its own to match any visual concept.

This could enable totally new creative workflows and collaboration between human creators and AI. We provide high-level visual ideas, and AI handles the tedious fine-tuning of prompts to make that vision a reality.

The possibilities span art, design, advertising, media, and essentially any field where optimized image generation would be valuable!

Conclusion and Final Thoughts

In conclusion, this new technique called “Idea2Img” represents a massive step forward in AI-assisted image generation. Combining the strengths of a vision-capable language model and an image generation model, it creates a self-improving feedback loop that can produce astonishing visuals from simple prompts.

The outputs surpass what either model can achieve alone, and unlock new capabilities like style transfer, compositing, and image editing that aren’t inherent to the base systems at all.

This research highlights the importance of multimodal AI that can connect diverse data types like text, images, and more. And it provides a glimpse of how creatives could one day collaborate with AI to turn imaginative ideas into photorealistic visual content.

I’m excited to see how these capabilities evolve in the future. Let me know your thoughts on this research and what implications you think it could have!