DALL.E – Creating Images from Text


OpenAI is a well known AI lab for creating the GPT-n series of deep learning pre-trained autoregressive language models similar to BERT/Roberta/XLNet. It has been known to generate almost human level text using just a short sentence.

They have now come up with an interesting new idea – DALL.E – creating images from text. Normally we use computer vision and NLP to extract text from images. The overall explanation is quite funny, but the concept is brilliant and slightly counter-intuitive. Computer vision and NLP algorithms on their own are becoming quite advanced to recognize all kinds of images in language context real-time. Effectively we are moving towards smarter AI systems that can think and write like humans, using human reasoning and ability to connect images to language and vice-versa.

Links to articles and original paper are given below.

In January 2021, OpenAI introduced DALL·E. One year later, our newest system, DALL·E 2, generates more realistic and accurate images with 4x greater resolution.

We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language. https://openai.com/blog/dall-e/

DALL·E 2 can create original, realistic images and art from a text description. It can combine concepts, attributes, and styles. https://openai.com/dall-e-2/

DALL·E 2 has learned the relationship between images and the text used to describe them. It uses a process called “diffusion,” which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image.

The original paper is – Zero-Shot Text-to-Image Generation


Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.