DALL-E2
Article
DALL-E2 is a recurring brand in the Astral Codex Ten archive, appearing 4 times across 4 issues between April 18, 2022 and July 08, 2025. The archive places it in contexts such as “The drop corresponded to three big AI milestones. First, DALL-E2, a new and very impressive art AI”; “DALL-E2: “A giant Statue Of Responsibility, standing in a city harbor””; “DALL-E2 is bad at “compositionality”“. It most often appears alongside China, Gary Marcus, Google Imagen.
Metadata
- Category: Brands
- Mention count: 4
- Issue count: 4
- First seen: April 18, 2022
- Last seen: July 08, 2025
Appears In
- 22
- Book Review: San Fransicko
- I Won My Three Year AI Progress Bet In Three Months
- Now I Really Won That AI Bet
Related Pages
-
- China (2 shared issues)
-
- Gary Marcus (2 shared issues)
-
- Google Imagen (2 shared issues)
-
- GPT (2 shared issues)
-
- MidJourney (2 shared issues)
-
- Twitter (2 shared issues)
-
- Vitor (2 shared issues)
-
- 1978 (1 shared issues)
-
- 2016 essay (1 shared issues)
-
- 4o (1 shared issues)
-
- Abbott (1 shared issues)
External Links
Source Context
Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.
The drop corresponded to three big AI milestones. First, DALL-E2, a new and very impressive art AI.
Inline links: DALL-E2
This raises the eternal question of “exciting game-changer” vs. “incremental progress at the same rate as always”. These certainly don’t seem to me to be bigger game changers than the original DALL-E or GPT-3, but I’m not an expert and maybe they should be. It’s just weird that they used up half our remaining AI timeline (ie moved the date when we should expect AGI by this definition from 20 years out to 10 years out) when I feel like there have been four or five things this exciting in the past decade.
DALL-E2: “A giant Statue Of Responsibility, standing in a city harbor”. I myself prefer civic monuments that look less like Sauron, but tastes may differ. I don’t really have a strong opinion on this. I can think of ways that people are victims and it’s important to acknowledge that and treat them appropriately, and also ways that taking responsibility and not wallowing in victimhood is psychologically healthy and important. Probably San Francisco progressives are too far to the victimology end of the scale and could benefit from taking Shellenberger’s thoughts on the matter seriously. I still feel conflicted on this without really being able to verbalize why.
DALL-E2 is bad at “compositionality”, ie combining different pieces accurately. For example, here’s its response to “a red sphere on a blue cube, with a yellow pyramid on the right, all on top of a green table”.
At the time, I wrote: I’m not going to make the mistake of saying these problems are inherent to AI art. My guess is a slightly better language model would solve most of them…for all I know, some of the larger image models have already fixed these issues. These are the sorts of problems I expect to go away with a few months of future research. This proved controversial. Gary Marcus in particular has emphasized how challenging compositionality is for modern language and image models: @sama @gdb @Plinz @ylecun, \n\nEach of you ridiculed my recent title, but this is what the article was actually about: compositionality.\n\nYes, there are many kinds of progress in other directions. \n\nBut compositionality is at the core of intelligence. \n\nNo AGI without it. ","username":"GaryMarcus","name":"Gary Marcus","profile_image_url":"","date":"Sat Apr 09 04:34:37 +0000 2022","photos":[],"quoted_tweet":{"full_text":"Compositionality *is* the wall. \n\nEven “red cube” and “blue cube” on their own are represented unreliably; not one of ten images correctly captures the full phrasal description.\n\nThe images are beautiful, but no match for the precision of language. https://t.co/uvoXUtETwi","username":"GaryMarcus","name":"Gary Marcus"},"reply_count":0,"retweet_count":7,"like_count":54,"impression_count":0,"expanded_url":{},"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM"> And one of my commenters, Vitor, asked: Why are you so confident in this? The inability of systems like DALL-E to understand semantics in ways requiring an actual internal world model strikes me as the very heart of the issue. We can also see this exact failure mode in the language models themselves. They only produce good results when the human asks for something vague with lots of room for interpretation, like poetry or fanciful stories without much internal logic or continuity. Not to toot my own horn, but two years ago you were naively saying we'd have GPT-like models scaled up several orders of magnitude (100T parameters) right about now (https://slatestarcodex.com/2020/06/10/the-obligatory-gpt-3-post/#comment-912798). I'm registering my prediction that you're being equally naive now. Truly solving this issue seems AI-complete to me. I'm willing to bet on this (ideas on operationalization welcome). I responded to Marcus here, and I responded to Vitor by making a bet on whether AI image models could draw some compositionality-heavy pictures by 2025. The specific terms we agreed on: My proposed operationalization of this is that on June 1, 2025, if either if us can get access to the best image generating model at that time (I get to decide which), or convince someone else who has access to help us, we'll give it the following prompts: 1. A stained glass picture of a woman in a library with a raven on her shoulder with a key in its mouth 2. An oil painting of a man in a factory looking at a cat wearing a top hat 3. A digital art picture of a child riding a llama with a bell on its tail through a desert 4. A 3D render of an astronaut in space holding a fox wearing lipstick 5. Pixel art of a farmer in a cathedral holding a red basketball We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do. DALL-E can’t do any of these: If I were being kind, I would give it the farmer in the cathedral. But I am being unkind, so the farmer in front of the cathedral doesn’t count. II. There are now at least four more AI image models available: Google Imagen announced May 2022.
Inline links: https://slatestarcodex.com/2020/06/10/the-obligatory-gpt-3-post/#comment-912798, here, bet, https://substackcdn.com/image/fetch/$s_!_gqe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6138ab0d-3a82-4eb9-a328-bf38ea0f6b10_632x784.png, announced
It still fails the library scene, although it does better than DALL-E2 in realizing that the picture itself should be in the style of stained glass. It still fails the fox scene, although it does better than DALL-E2 in at least realizing that the fox should have the lipstick.
DALL-E2 had just come out, showcasing the potential of AI art. But it couldn’t follow complex instructions; its images only matched the “vibe” of the prompt. For example, here were some of its attempts at “a red sphere on a blue cube, with a yellow pyramid on the right, all on top of a green table”.
We generate 10 images for each prompt, just like DALL-E2 does. If at least one of the ten images has the scene correct in every particular on 3/5 prompts, I win, otherwise you do. Loser pays winner $100, and whatever the result is I announce it on the blog (probably an open thread). If we disagree, Gwern is the judge.
Why are you so confident in this? The inability of systems like DALL-E to understand semantics in ways requiring an actual internal world model strikes me as the very heart of the issue. We can also see this exact failure mode in the language models themselves. They only produce good results when the human asks for something vague with lots of room for interpretation, like poetry or fanciful stories without much internal logic or continuity.