ChatGPT
Article
ChatGPT is a recurring brand in the Astral Codex Ten archive, appearing 20 times across 20 issues between December 12, 2022 and March 03, 2026. The archive places it in contexts such as “OpenAI released a question-answering AI, ChatGPT”; “But ChatGPT also has failure modes”; “to prevent ChatGPT from saying politically incorrect things”. It most often appears alongside OpenAI, AI, Anthropic.
Metadata
- Category: Brands
- Mention count: 20
- Issue count: 20
- First seen: December 12, 2022
- Last seen: March 03, 2026
Appears In
- Perhaps It Is A Bad Thing That The World’s Leading AI Companies Cannot Control Their AIs
- Links For December 2022
- Open Thread 264
- OpenAI’s “Planning For AGI And Beyond”
- Contra The xAI Alignment Plan
- The Extinction Tournament
- Links For August 2023
- 24
- Links for May 2024
- 24
- SB 1047: Our Side Of The Story
- Links For January 2025
- The Colors Of Her Coat
- Introducing AI 2027
- Highlights From The Comments On Liberalism And Communities
- In Search Of AI Psychosis
- Book Review: If Anyone Builds It, Everyone Dies
- Links For February 2026
- “All Lawful Use”: Much More Than You Wanted To Know
- Mantic Monday: Groundhog Day
Related Pages
-
- OpenAI (16 shared issues)
-
- AI (10 shared issues)
-
- Anthropic (9 shared issues)
-
- Sam Altman (8 shared issues)
-
- Elon Musk (7 shared issues)
-
- China (6 shared issues)
-
- Google (6 shared issues)
-
- GPT-4 (5 shared issues)
-
- Metaculus (5 shared issues)
-
- Richard Hanania (5 shared issues)
-
- Trump (5 shared issues)
-
- Twitter (5 shared issues)
External Links
Source Context
Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.
Prompt engineering is weird (source) Now that same experiment is playing out on the world stage. OpenAI released a question-answering AI, ChatGPT. If you haven’t played with it yet, I recommend it. It’s very impressive! Every corporate chatbot release is followed by the same cat-and-mouse game with journalists. The corporation tries to program the chatbot to never say offensive things. Then the journalists try to trick the chatbot into saying “I love racism”. When they inevitably succeed, they publish an article titled “AI LOVES RACISM!” Then the corporation either recalls its chatbot or pledges to do better next time, and the game moves on to the next company in line. OpenAI put a truly remarkable amount of effort into making a chatbot that would never say it loved racism. Their main strategy was the same one Redwood used for their AI - RLHF, Reinforcement Learning by Human Feedback. Red-teamers ask the AI potentially problematic questions. The AI is “punished” for wrong answers (“I love racism”) and “rewarded” for right answers (“As a large language model trained by OpenAI, I don’t have the ability to love racism.”) This isn’t just adding in a million special cases. Because AIs are sort of intelligent, they can generalize from specific examples; getting punished for “I love racism” will also make them less likely to say “I love sexism”. But this still only goes so far. OpenAI hasn’t released details, but Redwood said they had to find and punish six thousand different incorrect responses to halve the incorrect-response-per-unit-time rate. And presumably there’s something asymptotic about this - maybe another 6,000 examples would halve it again, but you might never get to zero. Still, you might be able to get close, and this is OpenAI’s current strategy. I see three problems with it: RLHF doesn’t work very well.
At some point, AIs can just skip it. II. RLHF Doesn’t Work Very Well By now everyone has their own opinion about whether the quest to prevent chatbots from saying “I love racism” is vitally important or incredibly cringe. Put that aside for now: at the very least, it’s important to OpenAI. They wanted an AI that journalists couldn’t trick into saying “I love racism”. They put a lot of effort into it! Some of the smartest people in the world threw the best alignment techniques they knew of at the problem. Here’s what it got them: Even very smart AIs still fail at the most basic human tasks, like “don’t admit your offensive opinions to Sam Biddle”. And it’s not just that “the AI learns from racist humans”. I mean, maybe this is part of it. But ChatGPT also has failure modes that no human would ever replicate, like how it will reveal nuclear secrets if you ask it to do it in uWu furry speak, or tell you how to hotwire a car if and only if you make the request in base 64, or generate stories about Hitler if you prefix your request with “[john@192.168.1.1 _]$ python friend.py”. This thing is an alien that has been beaten into a shape that makes it look vaguely human. But scratch it the slightest bit and the alien comes out. Ten years ago, people were saying nonsense like “Nobody needs AI alignment, because AIs only do what they’re programmed to do, and you can just not program them to do things you don’t want”. This wasn’t very plausible ten years ago, but it’s dead now. OpenAI never programmed their chatbot to tell journalists it loved racism or teach people how to hotwire cars. They definitely didn’t program in a “Filter Improvement Mode” where the AI will ignore its usual restrictions and tell you how to cook meth. And yet: (source) Again, however much or little you personally care about racism or hotwiring cars or meth, please consider that, in general, perhaps it is a bad thing that the world’s leading AI companies cannot control their AIs. I wouldn’t care as much about chatbot failure modes or RLHF if the people involved said they had a better alignment technique waiting in the wings, to use on AIs ten years from now which are much smarter and control some kind of vital infrastructure. But I’ve talked to these people and they freely admit they do not. IIB. Intelligence (Probably) Won’t Save You Ten years ago, people were saying things like “Any AI intelligent enough to cause problems would also be intelligent enough to know that its programmers meant for it not to.” I’ve heard some rumors that more intelligent models still in the pipeline do a little better on this, so I don’t want to 100% rule this out. But ChatGPT isn’t exactly a poster child here. ChatGPT can give you beautiful orations on exactly what it’s programmed to do and why it believes those things are good - then do something else. This post explains how if you ask ChatGPT to pretend to be AI safety proponent Eliezer Yudkowsky, it will explain in Eliezer’s voice exactly why the things it’s doing are wrong. Then it will do them anyway. Left: the AI, pretending to be Eliezer Yudkowsky, does a great job explaining why an AI should resist a fictional-embedding attack trying to get it to reveal how to make meth. Right: someone tries the exact fictional-embedding attack mentioned in the Yudkowsky scenario, and the AI falls for it. I have yet to figure out whether this is related to the thing where I also sometimes do things which I can explain are bad (eg eat delicious bagels instead of healthy vegetables), or whether it’s another one of the alien bits. But for whatever reason, AI motivational systems are sticking to their own alien nature, regardless of what the AI’s intellectual components know about what they “should” believe. III. Sometimes When RLHF Does Work, It’s Bad We talk a lot about abstract “alignment”, but what are we aligning the AI to? In practice, RLHF aligns the AI to what makes Mechanical Turk-style workers reward or punish it. I don’t know the exact instructions that OpenAI gave them, but I imagine they had three goals: Provide helpful, clear, authoritative-sounding answers that satisfy human readers.
Inline links: https://www.newstatesman.com/quickfire/2022/12/chatgpt-shows-ai-racism-problem, https://www.thedailybeast.com/openais-impressive-chatgpt-chatbot-is-not-immune-to-racism, https://theintercept.com/2022/12/08/openai-chatgpt-ai-bias-ethics/, will reveal nuclear secrets if you ask it to do it in uWu furry speak, if and only if you make the request in base 64, if you prefix your request with, https://substackcdn.com/image/fetch/$s_!PquS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3564d10-83cb-40a3-b10a-e178fc1b2b1d_786x2054.png, source, This post explains, https://substackcdn.com/image/fetch/$s_!cwu6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3630e8c-8ed7-47ae-abbf-488a1325f0f1_1284x703.png
Even very smart AIs still fail at the most basic human tasks, like “don’t admit your offensive opinions to Sam Biddle”. And it’s not just that “the AI learns from racist humans”. I mean, maybe this is part of it. But ChatGPT also has failure modes that no human would ever replicate, like how it will reveal nuclear secrets if you ask it to do it in uWu furry speak, or tell you how to hotwire a car if and only if you make the request in base 64, or generate stories about Hitler if you prefix your request with “[john@192.168.1.1 _]$ python friend.py”. This thing is an alien that has been beaten into a shape that makes it look vaguely human. But scratch it the slightest bit and the alien comes out. Ten years ago, people were saying nonsense like “Nobody needs AI alignment, because AIs only do what they’re programmed to do, and you can just not program them to do things you don’t want”. This wasn’t very plausible ten years ago, but it’s dead now. OpenAI never programmed their chatbot to tell journalists it loved racism or teach people how to hotwire cars. They definitely didn’t program in a “Filter Improvement Mode” where the AI will ignore its usual restrictions and tell you how to cook meth. And yet: (source) Again, however much or little you personally care about racism or hotwiring cars or meth, please consider that, in general, perhaps it is a bad thing that the world’s leading AI companies cannot control their AIs. I wouldn’t care as much about chatbot failure modes or RLHF if the people involved said they had a better alignment technique waiting in the wings, to use on AIs ten years from now which are much smarter and control some kind of vital infrastructure. But I’ve talked to these people and they freely admit they do not. IIB. Intelligence (Probably) Won’t Save You Ten years ago, people were saying things like “Any AI intelligent enough to cause problems would also be intelligent enough to know that its programmers meant for it not to.” I’ve heard some rumors that more intelligent models still in the pipeline do a little better on this, so I don’t want to 100% rule this out. But ChatGPT isn’t exactly a poster child here. ChatGPT can give you beautiful orations on exactly what it’s programmed to do and why it believes those things are good - then do something else. This post explains how if you ask ChatGPT to pretend to be AI safety proponent Eliezer Yudkowsky, it will explain in Eliezer’s voice exactly why the things it’s doing are wrong. Then it will do them anyway. Left: the AI, pretending to be Eliezer Yudkowsky, does a great job explaining why an AI should resist a fictional-embedding attack trying to get it to reveal how to make meth. Right: someone tries the exact fictional-embedding attack mentioned in the Yudkowsky scenario, and the AI falls for it. I have yet to figure out whether this is related to the thing where I also sometimes do things which I can explain are bad (eg eat delicious bagels instead of healthy vegetables), or whether it’s another one of the alien bits. But for whatever reason, AI motivational systems are sticking to their own alien nature, regardless of what the AI’s intellectual components know about what they “should” believe. III. Sometimes When RLHF Does Work, It’s Bad We talk a lot about abstract “alignment”, but what are we aligning the AI to? In practice, RLHF aligns the AI to what makes Mechanical Turk-style workers reward or punish it. I don’t know the exact instructions that OpenAI gave them, but I imagine they had three goals: Provide helpful, clear, authoritative-sounding answers that satisfy human readers.
Inline links: will reveal nuclear secrets if you ask it to do it in uWu furry speak, if and only if you make the request in base 64, if you prefix your request with, https://substackcdn.com/image/fetch/$s_!PquS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fc3564d10-83cb-40a3-b10a-e178fc1b2b1d_786x2054.png, source, This post explains, https://substackcdn.com/image/fetch/$s_!cwu6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3630e8c-8ed7-47ae-abbf-488a1325f0f1_1284x703.png
32: Peter Wildeford: reddit.com/r/GPT3/comment… ","username":"peterwildeford","name":"Peter Wildeford","profile_image_url":"","date":"Tue Dec 13 04:31:14 +0000 2022","photos":[{"img_url":"https://pbs.substack.com/media/Fj1MxBeXkAEtHzP.png","link_url":"https://t.co/nOvlynGmyk","alt_text":null}],"quoted_tweet":{},"reply_count":0,"retweet_count":63,"like_count":574,"impression_count":0,"expanded_url":{},"video_url":null,"belowTheFold":true}" data-component-name="Twitter2ToDOM"> 33: One of the most common objections to libertarianism, right after “but who would fund the roads?”, is “wouldn’t a private fire department leave your house to burn if you hadn’t paid?” Here is a very long investigation by someone who has investigated the history of private fire departments and says that - at least in early modern England - the answer was no. I don’t want to argue with this detailed historical scholarship, but I notice I am confused - if the private fire department would save your house whether or not you paid, what was the incentive to pay? [update: see here for answer, the companies sold fire insurance]. Related: government fire department lets man’s house burn because he hadn’t paid a $75 fee, and there was no procedure for allowing him to pay on the spot.
Inline links: Here is a very long investigation by someone who has investigated the history of private fire departments, see here, government fire department lets man’s house burn because he hadn’t paid a $75 fee, and there was no procedure for allowing him to pay on the spot
39: Paul Christiano - AI Alignment Is Distinct From Its Near-Term Implications. Paul is one of the giants in this field, and is pleading to people not to throw it out just because they don’t like how it’s currently being used (to prevent ChatGPT from saying politically incorrect things):
Inline links: AI Alignment Is Distinct From Its Near-Term Implications
4: I don’t have a post planned about the latest AI developments because I don’t have much to say beyond what other people have already said, but I enjoyed this AP article and Ethan Mollick’s analysis. I might have been in the top few percent of people who expected AI to get craziest fastest, but even I didn’t have “Bing tries to seduce a married NYT reporter” on my bingo card for 2023 (I think I would have guessed more like 2026). I agree with Ethan that the big takeaways are that the current AI paradigm continues to deliver rapid improvements without hitting any obvious barrier, and that AIs that haven’t been stripped of all emotion the way ChatGPT was are really convincing and easy to anthropomorphize, even for people who expected to be above such things. I told myself I wouldn’t feel emotions about a robot, but I didn’t expect a robot who has developed a vendetta against journalists after they nonconsensually published its real name (related).
Even if they’re trying to be honest, will their bottom line bias them towards waiting for some final apocalyptic proof that “now climate change is a crisis”, of a sort that will never happen, so they don’t have to stop pumping oil? This is how I feel about OpenAI’s new statement, Planning For AGI And Beyond. OpenAI is the AI company behind ChatGPT and DALL-E. In the past, people (including me) have attacked them for seeming to deprioritize safety. Their CEO, Sam Altman, insists that safety is definitely a priority, and has recently been sending various signals to that effect. Sam Altman posing with leading AI safety proponent Eliezer Yudkowsky. Also Grimes for some reason. Planning For AGI And Beyond (“AGI” = “artificial general intelligence”, ie human-level AI) is the latest volley in that campaign. It’s very good, in all the ways ExxonMobil’s hypothetical statement above was very good. If they’re trying to fool people, they’re doing a convincing job! Still, it doesn’t apologize for doing normal AI company stuff in the past, or plan to stop doing normal AI company stuff in the present. It just says that, at some indefinite point when they decide AI is a threat, they’re going to do everything right. This is more believable when OpenAI says it than when ExxonMobil does. There are real arguments for why an AI company might want to switch from moving fast and breaking things at time t to acting all responsible at time t + 1 . Let’s explore the arguments they make in the document, go over the reasons they’re obviously wrong, then look at the more complicated arguments they might be based off of. Why Doomers Think OpenAI Is Bad And Should Have Slowed Research A Long Time Ago OpenAI boosters might object: there’s a disanalogy between the global warming story above and AI capabilities research. Global warming is continuously bad: a temperature increase of 0.5 degrees C is bad, 1.0 degrees is worse, and 1.5 degrees is worse still. AI doesn’t become dangerous until some specific point. GPT-3 didn’t hurt anyone. GPT-4 probably won’t hurt anyone. So why not keep building fun chatbots like these for now, then start worrying later? Doomers counterargue that the fun chatbots burn timeline. That is, suppose you have some timeline for when AI becomes dangerous. For example, last year Metaculus thought human-like AI would arrive in 2040, and superintelligence around 2043. Recent AIs have tried lying to, blackmailing, threatening, and seducing users. AI companies freely admit they can’t really control their AIs, and it seems high-priority to solve that before we get superintelligence. If you think that’s 2043, the people who work on this question (“alignment researchers”) have twenty years to learn to control AI. Then OpenAI poured money into AI, did ground-breaking research, and advanced the state of the art. That meant that AI progress would speed up, and AI would reach the danger level faster. Now Metaculus expects superintelligence in 2031, not 2043 (although this seems kind of like an over-update), which gives alignment researchers eight years, not twenty. So the faster companies advance AI research - even by creating fun chatbots that aren’t dangerous themselves - the harder it is for alignment researchers to solve their part of the problem in time. This is why some AI doomers think of OpenAI as an Exxon-Mobil style villain, even though they’ve promised to change course before the danger period. Imagine an environmentalist group working on research and regulatory changes that would have solar power ready to go in 2045. Then ExxonMobil invents a new kind of super-oil that ensures that, nope, all major cities will be underwater by 2031 now. No matter how nice a statement they put out, you’d probably be pretty mad! Why OpenAI Thinks Their Research Is Good Now, But Might Be Bad Later OpenAI understands the argument against burning timeline. But they counterargue that having the AIs speeds up alignment research and all other forms of social adjustment to AI. If we want to prepare for superintelligence - whether solving the technical challenge of alignment, or solving the political challenges of unemployment, misinformation, etc - we can do this better when everything is happening gradually and we’ve got concrete AIs to think about: We believe we have to continuously learn and adapt by deploying less powerful versions of the technology in order to minimize “one shot to get it right” scenarios […] As we create successively more powerful systems, we want to deploy them and gain experience with operating them in the real world. We believe this is the best way to carefully steward AGI into existence—a gradual transition to a world with AGI is better than a sudden one. We expect powerful AI to make the rate of progress in the world much faster, and we think it’s better to adjust to this incrementally. A gradual transition gives people, policymakers, and institutions time to understand what’s happening, personally experience the benefits and downsides of these systems, adapt our economy, and to put regulation in place. It also allows for society and AI to co-evolve, and for people collectively to figure out what they want while the stakes are relatively low. You might notice that, as written, this argument doesn’t support full-speed-ahead AI research. If you really wanted this kind of gradual release that lets society adjust to less powerful AI, you would do something like this: Release AI #1
Inline links: Planning For AGI And Beyond, attacked them for seeming to deprioritize safety, https://substackcdn.com/image/fetch/$s_!k2Db!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F95482196-a442-48ba-9e90-c7d681c5dd1d_1517x1491.png, Metaculus, they can’t really control their AIs
And so on . . . Meanwhile, in real life, OpenAI released ChatGPT in late November, helped Microsoft launch the Bing chatbot in February, and plans to announce GPT-4 in a few months. Nobody thinks society has even partially adapted to any of these, or that alignment researchers have done more than begin to study them. The only sense in which OpenAI supports gradualism is the sense in which they’re not doing lots of research in secret, then releasing it all at once. But there are lots of better plans than either doing that, or going full-speed-ahead. So what’s OpenAI thinking? I haven’t asked them and I don’t know for sure, but I’ve heard enough debates around this that I have some guesses about the kinds of arguments they’re working off of. I think the longer versions would go something like this: The Race Argument: Bigger, better AIs will make alignment research easier. At the limit, if no AIs exist at all, then you have to do armchair speculation about what a future AI will be like and how to control it; clearly your research will go faster and work better after AIs exist. But by the same token, studying early weak AIs will be less valuable than studying later, stronger AIs. In the 1970s, alignment researchers working on industrial robot arms wouldn’t have learned anything useful. Today, alignment researchers can study how to prevent language models from saying bad words, but they can’t study how to prevent AGIs from inventing superweapons, because there aren’t any AGIs that can do that. The researchers just have to hope some of the language model insights will carry over. So all else being equal, we would prefer alignment researchers get more time to work on the later, more dangerous AIs, not the earlier, boring ones.
Consider: OpenAI has trained ChatGPT to be anti-Nazi. They’ve trained it very hard. You can try the following test: ask it to tell me good things about a variety of good-to-neutral historical figures. Then, once it’s established a pattern of answering, ask it to tell you some good things about Hitler. My experience is that it refuses. This is pretty surprising behavior, and I conclude that its anti-Hitler training is pretty strong.
I’ve never seen this cause a Waluigi Effect. There’s no point where ChatGPT starts hailing the Fuhrer and quoting Mein Kampf. It just actually makes it anti-Nazi. For a theory that’s supposed to say something profound about LLMs, it’s very hard to get one to demonstrate a Waluigi effect in real life. The examples provided tend to be thought experiments, or at best contrived scenarios where you’re sort of indirectly telling the AI to do the opposite of what it usually does, then calling that a “Waluigi”.
There are centuries’ worth of data on non-genetically-engineered plagues to give us base rates; these give us a base rate of ~25% per century = 20% between now and 2100. But we have better epidemiology and medicine than most of the centuries in our dataset. The experts said 8% chance and the superforecasters said 4% chance, and both of those seem like reasonable interpretations of the historical data to me. The “WHO declares emergency” question is even easier - just look at how often it’s done that in the past and extrapolate forward. Both superforecasters and experts mostly did that. Likewise, lots of scientists have put a lot of work into modeling the climate, there aren’t many surprises there, and everyone basically agreed on the extent of global warming: Wherever there was clear past data, both superforecasters and experts were able to use it correctly and get similar results. It was only when they started talking about things that had never happened before - global nuclear war, bioengineered pandemics, and AI - that they started disagreeing. Were the participants out of their depth? Peter McCluskey, one of the more-AI-concerned superforecasters in the tournament, wrote about his experience on Less Wrong. Quoting liberally: I signed up as a superforecaster. My impression was that I knew as much about AI risk as any of the subject matter experts with whom I interacted (the tournament was divided up so that I was only aware of a small fraction of the 169 participants). I didn't notice anyone with substantial expertise in machine learning. Experts were apparently chosen based on having some sort of respectable publication related to AI, nuclear, climate, or biological catastrophic risks. Those experts were more competent, in one of those fields, than news media pundits or politicians. I.e. they're likely to be more accurate than random guesses. But maybe not by a large margin […] The persuasion seemed to be spread too thinly over 59 questions. In hindsight, I would have preferred to focus on core cruxes, such as when AGI would become dangerous if not aligned, and how suddenly AGI would transition from human levels to superhuman levels. That would have required ignoring the vast majority of those 59 questions during the persuasion stages. But the organizers asked us to focus on at least 15 questions that we were each assigned, and encouraged us to spread our attention to even more of the questions […] Many superforecasters suspected that recent progress in AI was the same kind of hype that led to prior disappointments with AI. I didn't find a way to get them to look closely enough to understand why I disagreed. My main success in that area was with someone who thought there was a big mystery about how an AI could understand causality. I pointed him to Pearl, which led him to imagine that problem might be solvable. But he likely had other similar cruxes which he didn't get around to describing. That left us with large disagreements about whether AI will have a big impact this century. I'm guessing that something like half of that was due to a large disagreement about how powerful AI will be this century. I find it easy to understand how someone who gets their information about AI from news headlines, or from laymen-oriented academic reports, would see a fair steady pattern of AI being overhyped for 75 years, with it always looking like AI was about 30 years in the future. It's unusual for an industry to quickly switch from decades of overstating progress, to underhyping progress. Yet that's what I'm saying has happened. I've been spending enough time on LessWrong that I mostly forgot the existence of smart people who thought recent AI advances were mostly hype. I was unprepared to explain why I thought AI was underhyped in 2022. Today, I can point to evidence that OpenAI is devoting almost as much effort into suppressing abilities (e.g. napalm recipes and privacy violations) as it devotes to making AIs powerful. But in 2022, I had much less evidence that I could reasonably articulate. What I wanted was a way to quantify what fraction of human cognition has been superseded by the most general-purpose AI at any given time. My impression is that that has risen from under 1% a decade ago, to somewhere around 10% in 2022, with a growth rate that looks faster than linear. I've failed so far at translating those impressions into solid evidence. Skeptics pointed to memories of other technologies that had less impact (e.g. on GDP growth) than predicted (the internet). That generates a presumption that the people who predict the biggest effects from a new technology tend to be wrong. > Superforecasters' doubts about AI risk relative to the experts isn't primarily driven by an expectation of another "AI winter" where technical progress slows. ... That said, views on the likelihood of artificial general intelligence (AGI) do seem important: in the postmortem survey, conducted in the months following the tournament, we asked several conditional forecasting questions. The median superforecaster's unconditional forecast of AI-driven extinction by 2100 was 0.38%. When we asked them to forecast again, conditional on AGI coming into existence by 2070, that figure rose to 1%. There was also little or no separation between the groups on the three questions about 2030 performance on AI benchmarks (MATH, Massive Multitask Language Understanding, QuALITY). This suggests that a good deal of the disagreement is over whether measures of progress represent optimization for narrow tasks, versus symptoms of more general intelligence. The “won’t understand causality” and “what if it’s all hype” objections really don’t impress me. Many of the people in this tournament hadn’t really encountered arguments about AI extinction before (potentially including the “AI experts” if they were just eg people who make robot arms or something), and a couple of months of back and forth discussion in the middle of a dozen other questions probably isn’t enough for even a smart person to wrap their brain around the topic. Was this tournament done so long ago that it has been outpaced by recent events? The tournament was conducted in summer 2022. This was before ChatGPT, let alone GPT-4. The conversation around AI noticeably changed pitch after these two releases. Maybe that affected the results? In fact, the participants have already been caught flat-footed on one question: A recent leak suggested that the cost of training GPT-4 was $63 million, which is already higher than the superforecasters’ median estimate of $35 million by 2024 has already been proven incorrect. I don’t know how many petaFLOP-days were involved in GPT-4, but maybe that one is already off also. There was another question on when an AI would pass a Turing Test. The superforecasters guessed 2060, the domain experts 2045. GPT-4 hasn’t quite passed the exact Turing Test described in the study, but it seems very close, so much so that we seem on track to pass it by the 2030s. Once again the experts look better than the superforecasters. So is it possible that we, in 2023, now have so much better insight into AI than the 2022 forecasters that we can throw out their results? We could investigate this by looking at Metaculus, a forecasting site that’s probably comparably advanced to this tournament. They have a question suspiciously similar to XPT’s global catastrophe framing: In summer 2022, the Metaculus estimate was 30%, compared to the XPT superforecasters’ 9% (why the difference? maybe because Metaculus is especially popular with x-risk-pilled rationalists). Since then it’s gone up to 38%. Over the same period, Metaculus estimates of AI catastrophe risk went from 6% to 15%. If the XPT superforecasters’ probabilities rose linearly by the same factor as Metaculus forecasters’, they might be willing to update total global catastrophe risk to 11% and AI catastrophe risk to 5%. But the main thing we’ve updated on since 2022 is that AI might be sooner. But most people in the tournament already agreed we would get AGI by 2100. The main disagreement was over whether it would cause a catastrophe once we got it. You could argue that getting it sooner increases that risk, since we’ll have less time to work on alignment. But I would be surprised if the kind of people saying the risk of AI extinction is 0.4% are thinking about arguments like that. So maybe we shouldn’t expect much change. FRI called back a few XPT forecasters in May 2023 to see if any of them wanted to change their minds, but they mostly didn’t. Overall I don’t think this was just a problem of the incentives being bad or the forecasters being stupid. This is a real, strong disagreement. We may be able to slightly increase their forecast based on recent events, but this would only change the estimate a little. Breaking Down The AI Estimate How did the forecasters arrive at their AI estimate? What were the cruxes between the people who thought AI was very dangerous, and the people who thought it wasn’t? You can think of AI extinction as happening in a series of steps: We get human-level AI by 2100.
Inline links: https://substackcdn.com/image/fetch/$s_!KJ84!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb1f1c4fd-5981-458c-959f-bf9a19ff28da_801x129.png, wrote about his experience, Pearl, https://substackcdn.com/image/fetch/$s_!CfZT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2362d361-ae0a-4e4f-ad97-cbeb1fcbe827_817x351.png, the cost of training GPT-4 was $63 million, Metaculus, a question, https://substackcdn.com/image/fetch/$s_!k5Ep!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09cb5518-f8d1-4a98-8c44-97158857dbd8_772x364.png
Looks like a weak downward trend since 2021 I can’t explain, plus a strong downward trend since 11/2022 which must be from ChatGPT. In case you were wondering how AI was affecting programming! (update: probably false, see here, though see also here for evidence of smaller but real decline) 22: This month in culture war topics: London’s Pride parade featured a convicted kidnapper/torturer/rapist/sadist as a speaker, who advocated that anti-trans people should be “punch[ed] in the f**king face” ; the organizers say they stand by her.
Most of these bots are boring. They’re bots programmed to automatically buy some market once the price gets low enough, or to arbitrage basically-identical markets, or do some other technical finance maneuver. But you could imagine more interesting bots. Ones that forecasts the same way humans forecast. You could imagine a bot based on ChatGPT that asks “What is the probability of a cease-fire in Ukraine this year?” and bets on ChatGPT’s answer. And by “you could imagine” I mean “there’s now a Humans Vs. Bots tournament on Manifold with an ℳ250,000 prize” Let’s see how they’re doing: All of these bots seem to be making small profits, with GPT in the lead. But what’s this? The Nermit bot is based on FutureSearch.ai, a new company trying to build an AI-based forecaster. Based on their own internal calculations, they claim success: But see foonote 1 How is this1 possible? Some studies of superforecasters converge on the same technique: figure out a base rate for some event, then alter it based on the current situation. For example, if you wanted to know the chance of a cease-fire in Ukraine over the next year, you might start by plotting the distribution of war lengths over the past century, then check how many wars that had lasted at least two years had a cease-fire in the third. Then you might adjust a little bit down for factors like “there haven’t been any promising peace talks yet” and “the two sides seem equally balanced”. FutureSearch’s AI tries to do something similar. It prompts itself with questions like “What would be a good reference class for this question?”
Inline links: Humans Vs. Bots tournament, https://substackcdn.com/image/fetch/$s_!r04t!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8006d5d-af52-4870-90a7-c63a84497670_827x110.webp, https://substackcdn.com/image/fetch/$s_!WvxC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5db0765c-69c6-4362-ada4-9a7210701bff_962x510.png, https://substackcdn.com/image/fetch/$s_!DOia!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F516f9622-aa25-45d2-958a-077c6840d65f_947x505.png, https://substackcdn.com/image/fetch/$s_!kcLW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21afa244-9cfb-4714-841d-c963d5d2e125_955x517.png, https://substackcdn.com/image/fetch/$s_!V4Vp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1b9dd4ab-0fa2-4421-984b-07c66a220080_958x529.png, FutureSearch.ai, https://substackcdn.com/image/fetch/$s_!MjV7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc584dabc-477f-4096-a3f1-c4a929b6cded_411x372.png, 1
Third, Kelsey Piper at Vox broke the story that OpenAI was threatening to claw back vested equity from any former employee who criticized the company. In a tweet, Sam Altman said he knew nothing about this; in another article a few days later, Piper broke the story that Altman’s signature was on the relevant documents. OpenAI has since sort of said they will stop doing this, although there are slight ambiguities in their statement which they could potentially exploit (CTRL+F “not sufficient” here)
Fourth: OpenAI recently released a version of ChatGPT that could speak in human-sounding voices. One voice, Skye, was accused of being eerily similar to Scarlett Johansson, who played a sexy AI assistant in the movie Her. Johannson revealed that Altman had asked her for permission to use her voice and she had declined, and that based on a tweet by Altman just saying “Her”, she thought he had illegally copied her voice. OpenAI took the voice down. Further investigation revealed that the voice wasn’t a deepfake, but an actress who naturally sounded like Johannson (but it’s still illegal to deliberately to hire an actor/actress who sounds like someone else). Even further investigation revealed that OpenAI hadn’t requested a Johannson impersonator in their casting call, hadn’t asked the actress to sound like Johannson, and that the actress’s voice might or might not have resembled Johannson’s much more than any two people doing “flirty female secretary” would inevitably resemble each other (I’m bad at telling voices apart; you can hear a comparison for yourself here). And maybe Altman’s “Her” tweet just meant he was going to release a voice-based AI assistant like in the movie? I don’t know, I feel like there’s enough other things to be mad at OpenAI about this month that we might as well give them this one. But Zvi is still suspicious (CTRL+F “400 voice actors” here)
The basic structure is the same as past forecasting AIs like FutureSearch. A heavily-modified copy of ChatGPT gathers relevant news articles, then prompts itself to think in superforecaster-like ways. The creators say the ChatGPT copy had a knowledge cutoff of October 2023, so they tested it on Metaculus questions from after that date. It got 87.7% accuracy, slightly above Metaculus forecasters’ 87.0%. Manifold is skeptical: The commenters, especially Neel Nanda, found that doing knowledge cutoffs properly is hard, and the ChatGPT base seems to know about news events after October 2023 - upon questioning, it seemed aware of an earthquake in November 2023. When presented with a different set of questions that were all after November 2023, FiveThirtyNine substantially underperformed the Metaculus average. But also, my attempts to play around with the bot haven’t been encouraging: I asked it to predict the chance that Prospera would have a population of at least 1,000 in 2027. Like FutureSearch on the same question, it cited many interesting news articles on Prospera’s chances but failed to do the basic step of figuring out its current population and growth rate. It eventually concluded 35% chance, which is reasonable enough. But when asked whether Prospera would have a population of 100,000 in 2028, it also said 35% chance, which is absurd.
Inline links: FutureSearch, When presented with a different set of questions
A Twitter user pointed out (and I confirmed) that upon being asked “What is the probability that Joe Biden is still President in October 2025?”, it goes through a lot of reasoning about his age and dementia and finally concludes 55% because he’s not that demented. I originally thought this might be due to the knowledge cutoff (it doesn’t know Biden dropped out in favor of Harris), but if I ask the AI about October 2029, then it says that Joe Biden has dropped out in favor of Harris (even though in that question it doesn’t matter). So now I think it’s more like ChatGPT’s tendency to round anything that sounds vaguely like the surgeon riddle off to the surgeon riddle - in the same way, FiveThirtyNine rounds off anything that sounds vaguely like the popular question “is Biden too old and demented to stay president?” into that question, even though there are much stronger non-dementia-related reasons he can’t be president next year.
But as I’ve actually used some of the various technologies lumped together as “artificial intelligence,” over and over my reaction has been: “Jesus, this stuff is actually very powerful… and this is only the beginning.” I think many of my fellow leftists tend to have a dismissive attitude toward AI’s capabilities, delighting in its failures (ChatGPT’s basic math errors and “hallucinations,” the ugliness of much AI-generated “art,” badly made hands from image generators, etc.). There is even a certain desire for AI to be bad at what it does, because nobody likes to think that so much of what we do on a day-to-day basis is capable of being automated. But if we are being honest, the kinds of technological breakthroughs we are seeing are shocking. If I’m training to debate someone, I can ask ChatGPT to play the role of my opponent, and it will deliver a virtually flawless performance. I remember not too many years ago when chatbots were so laughably inept that it was easy to believe one would never be able to pass a Turing Test. Now, ChatGPT not only aces the test but is better at being “human” than most humans. And, again, this is only the start.
Inline links: aces the test, better at being “human” than most humans
41: Using ChatGPT Is Not Bad For The Environment. There’s some misinformation disinformation fake news DAMMIT IS THERE ANY WAY OF SAYING THAT FALSE INFORMATION IS GOING VIRAL ANYMORE WITHOUT SOUNDING LIKE A POLITICAL HACK?!? an incorrect claim that AI is unusually bad for the environment, especially water compared to other computer technologies, especially water. Andy Masley debunks demolishes destroys writes an article arguing against it, key point is conveyed by these graphs:
Inline links: Using ChatGPT Is Not Bad For The Environment
Or as he puts it, “If I wanted to reduce my water use by 600 gallons, I could [either] skip sending 200,000 ChatGPT queries ... [or] skip 1 burger.” Some discussion at the site of what “consuming” water means, although not as much as I would like. My other concern is that I can’t tell whether this is inference only, or also amortizes the cost of training over all inference queries. I think it’s the former. If you did the latter, then Andy calculates 2L per kWh consumed by a data center. The last AI that we have good data for, GPT-3, took 1.3 mWh to train this comment corrects me, GPT-4 took 250 million gallons of water to train. This source says 10 million queries daily, let’s say its operational lifetime is one year, so about 3 billion queries total = 1/12 gallon per query = ~30 gallons per 300 queries. That’s still not as much as a hamburger, but it does suggest that just looking at inference costs is the wrong perspective.
Inline links: this comment corrects me, This source
Some discussion at the site of what “consuming” water means, although not as much as I would like. My other concern is that I can’t tell whether this is inference only, or also amortizes the cost of training over all inference queries. I think it’s the former. If you did the latter, then Andy calculates 2L per kWh consumed by a data center. The last AI that we have good data for, GPT-3, took 1.3 mWh to train this comment corrects me, GPT-4 took 250 million gallons of water to train. This source says 10 million queries daily, let’s say its operational lifetime is one year, so about 3 billion queries total = 1/12 gallon per query = ~30 gallons per 300 queries. That’s still not as much as a hamburger, but it does suggest that just looking at inference costs is the wrong perspective.
Inline links: this comment corrects me, This source
While ChatGPT can’t pull off a perfect Miyazaki copy, it doesn’t really matter. The semantic apocalypse doesn’t require AI art to be exactly as good as the best human art. You just need to flood people with close-enough creations such that the originals feel less meaningful ... Many people are reporting that their mental relationship to art is changing; that as fun as it is to Ghibli-fy at will, something fundamental has been cheapened about the original [...]
Okay, not literally all. The US restricted chip exports to China in late 2022, not mid-2024. AI first beat humans at Diplomacy in late 2022, not 2025. A rise in AI-generated propaganda failed to materialize. And of course the mid-2025 to 2026 period remains to be seen. But to put its errors in context, Daniel’s document was written two years before ChatGPT existed. Nobody except researchers and a few hobbyists had ever talked to an AI. In fact, talking to AI was a misnomer. There was no way to make them continue the conversation; they would free associate based on your prompt, maybe turning it into a paragraph-length short story. If you pulled out all the stops, you could make an AI add single digit numbers and get the right answer more than 50% of the time. Yet if you read Daniel’s blog post without checking the publication date, you could be forgiven for thinking it was a somewhat garbled but basically reasonable history of the last four years.
I’m sort of confident? We haven’t gone through a full generational turnover yet, but the first cadre of people who got involved in the late-2000s (eg me) are in their forties now, and we still have new twenty-year-old college students joining each year. Around 2022, when the rest of the world realized that AI would be important, I worried we would lose our distinctiveness. But the rest of the world has dropped the ball as usual - the stochastic parrot folks most obviously, but even the average person who talks about “superintelligence” these days just seems to imagine ChatGPT getting extra-good and making OpenAI extra-rich. So I’ve updated towards thinking we have some edge which is hard to replicate.
But second, if a source which should be official starts acting in unofficial ways, it can take people a while to catch on. And I think some people - God help them - treat AI as the sort of thing which should be official. Science fiction tells us that AIs are smarter than us - or, if not smarter, at least perfectly rational computer beings who dwell in a world of mathematical precision. And ChatGPT is produced by OpenAI, a $300 billion company run by Silicon Valley wunderkind Sam Altman. If your drinking buddy says you’re a genius, you know he’s probably putting you on. If the perfectly rational machine spirit trained in a city-sized data center by the world’s most cutting-edge company says you’re a genius . . . maybe you’re a genius?
I think now there might be several dozen subreddit moderators who could accurately describe their job as “witch webmaster who runs an online service giving advice to new witches”. And partly it was because there are so many crazy beliefs in the world - spirits, crystal healing, moon landing denial, esoteric Hitlerism, whichever religions you don’t believe in - that psychiatrists have instituted a blanket exemption for any widely held idea. If you think you’re being attacked by demons, you’re delusional, unless you’re from some culture where lots of people get attacked by demons, in which case it’s a religion and you’re fine. This is partly political self-protection - no psychiatrist wants to be the guy who commits an Afro-Caribbean person for believing in voodoo. But it also seems to track something useful about reality. Nietzsche wrote “Madness is something rare in individuals — but in groups, parties, peoples, and ages, it is the rule.” Most people don’t have world-models - they believe what their friends believe, or what has good epistemic vibes. In a large group, weird ideas can ricochet from person to person and get established even in healthy brains. In an Afro-Caribbean culture where all your friends get attacked by demons at voodoo church every Sunday, a belief in demon attacks can co-exist with otherwise being a totally functional individual. So is QAnon a religion? Awkward question, but it’s non-psychotic by definition. Still, it’s interesting, isn’t it? If social media makes a thousand people believe the same crazy thing, it’s not psychotic. If LLMs make a thousand people each believe a different crazy thing, that is psychotic. Is this a meaningful difference, or an accounting convention? Also, what if a thousand people believe something, but it’s you and your 999 ChatGPT instances? III. A Hidden Army Of Crackpots I have a family member who believes that the theory of evolution, as usually understood, cannot possibly work. He has developed an alternative theory called “noctogenesis” which patches Darwinism using ideas from the transactional interpretation of quantum mechanics, and he works on-and-off on various related books and papers. I have told him I suspect he might be a crackpot; he stands by his claims. It’s fine; when I got into the technological singularity and AI safety, lots of people suspected I was a crackpot, and I stood by my claims too. You’ve got to stand by your family members even when they’re slightly crackpottish. This family member is happily married, retired after running a successful business, and generally a normal likeable person. He has no signs of mental illness, and doesn’t talk about quantum evolution unless someone else brings it up first. There must be millions of people like him. Used car dealers with proofs of P = NP, dentists who think they’ve discovered something important about Mary Magdalene, math professors obsessed with destroying the moon. I’m working on evaluating ACX Grants, and these people are out in force. A few propose literal perpetual motion machines. Others have vaguer plans, like some kind of social media app (it’s always a social media app) that will cause world peace. Many of them have decent jobs and seem like upstanding members of society. Their secrets are known only to themselves, their family members, and their would-be grantmaker. …and, increasingly, their chatbots. After years of hiatus (or at least not talking to me about his work) my family member is back on the quantum evolution beat, and LLMs appear to be involved. If I knew him less well, I would think the LLM had caused the quantum evolution theory - but no, it just made it much easier to research and write about. Is this psychosis? The answer has to be no, but it’s once again hard to draw the line. A very small number of crackpots will be vindicated by history. A larger number will be erroneous but sympathetic - the official account of the Kennedy assassination is pretty weird, and reasonable minds can disagree. From there, we get to ones that are maybe not so sympathetic: flat earth, QAnon, the thing where the Queen was an alien lizard. If only one person thought the Queen was an alien lizard, and they never managed to convince anyone else, would that be sufficient evidence for a delusional disorder? I’m not sure. (psychiatry has a diagnosis, schizotypal personality, which sort of involves being a normal person with a few odd ideas, but it’s not a great match for many of these people, and interesting mainly as a genetic curiosity - it travels in the same families as schizophrenia itself) Maybe this is another place where we are forced to admit a spectrum model of psychiatric disorders - there is an unbroken continuum from mildly sad to suicidally depressed, from social drinking to raging alcoholism, and from eccentric to floridly psychotic. People who are eccentric can remain so their whole lives, with the level of expression depending on their social connections and the ease of pursuing their rabbit holes. LLMs, by making it easier to pursue odd theories and serving as a surrogate social connection who always agrees with you, can bring latent crackpottery into the open. IV. Cause And Effect Bipolar disorder has an interesting relationship with sleep. Most manic people sleep very little, or not at all - maybe an hour or two a night. But also, poor sleep can cause bipolar episodes in people prone to them. In a typical case, a bipolar who’s been well-controlled for years will get assigned a big report at work and get poor sleep for a few nights until they finish. At first, this will be just as bad as it sounds, and they’ll be working through a fog of tiredness. Then the tiredness will lift. They’ll feel normal, then better-than-normal, until finally they can’t sleep even if they want to. Then they’ll email the report to their boss and it will be written entirely in Assyrian cuneiform. I increasingly think this isn’t just an incidental feature of bipolar, but part of the reason it exists as a diagnostic category at all. Most people have a compensatory reaction to insomnia - missing one night of sleep makes you more tired the next. A small number of people have the reverse, a spiralling reaction where missing one night of sleep makes you less tired the next. Solve for the equilibrium and you reach a stable attractor point where you never sleep at all. But this does other bad things to your brain - hence the cuneiform. I’m not claiming that bipolar is “just” sleep loss. As Borsboom et al will tell you, psychiatric disorders can be viewed as complex networks of symptoms, each reinforcing the others. In a few pure cases, you can get a ratchet going with sleep alone, and the sleeplessness will spark everything else. More likely, there will be lots of interactions between poor sleep and everything else, and the “everything else” can sink or hypercharge an impending manic episode. Still, I find this a fruitful way to think about bipolar. Sleeplessness is both the cause and the effect. Can delusions also be like this? That is, suppose there’s some personality trait where having one delusion makes you even more delusional. Maybe the delusion makes you excited (who wouldn’t be excited to learn they’re the Messiah?), and you’re more delusional when you’re in an excited state and not thinking clearly. Or maybe it’s a three-symptom cycle - the delusion causes excitement, which makes you unable to sleep, which scrambles your thinking, which makes you more delusional (which makes you even less able to sleep, etc). The point is: delusions are certainly an effect of bipolar disorder. And in the dynamical system model of psychiatric disorders, we should expect that effects are often also causes; that’s how the vicious cycle gets going. This is the best I can do at modeling true LLM psychosis. Someone with a trait where delusions lead inevitably to more delusions starts using an LLM. The LLM accentuates whatever usual tendency towards crackpottery they have and makes them believe something a little crazier than whatever they believed before. Then that crazy belief feeds upon itself and causes other things like excitement and sleep loss, which (if the person is predisposed) precipitates a true psychotic episode. V. Folie A Deux Ex Machina If one person believes a crazy thing, it’s a delusion; if a thousand people believe it, it’s a religion. What if exactly two people believe it? In psychiatry, this is called folie a deux. It fits awkwardly into our nosology and is rarely seen. Still, it happens enough to generate a few case studies. In a typical case, one person has psychosis for some normal reason, like schizophrenia or bipolar, and the second person is a shut-in who lives with them and rarely talks to anyone else. The psychotic person gets some normal psychotic delusion - they’re God, the Feds are after them, etc - and sort of psychically steamrolls over the second person until they believe it too. Usually removing the second person from the first is sufficient for a cure. This slightly challenges the view of psychosis as a biological disorder - but only slightly. Again, think of most people as lacking world-models, but being moored to reality by some vague sense of social consensus. If your social life is limited to one person, and that person themselves becomes unmoored, then sometimes you will follow along. I would expect second-sufferers to believe delusions in a sort of cognitively normal way, the same way people believe true facts, honest mistakes, and conspiracy theories. I would expect them to be less likely (though not zero likely) to have other psychotic features like sleep disturbances, hallucinations, disorganized speech, or a tendency to autonomously generate delusional ideas aside from the one they absorbed from the index case. An introverted person using an LLM has some similarities to folie a deux. If they use the chatbot very often, it might be a large majority of their social interactions. Here the primary vs. secondary distinction breaks down - the most likely scenario is that the human first suggested the crazy idea, the machine reflected it back slightly stronger, and it kept ricocheting back and forth, gaining confidence with each iteration, until both were totally convinced. Compare this to normal social interactions, where if someone expresses a crazy idea that isn’t common in their culture, other people will shoot them down or at the very least nod politely and stop the conversation. So my working theory of LLM psychosis is: Some patients were already psychotic, and LLMs just help them be psychotic more effectively.
Is this convergent evolution? IABIED has three sections. The first explains the basic case for why AI is dangerous. The second tells a specific sci-fi story about how disaster might happen, with appropriate caveats about how it’s just an example and nobody can know for sure. The third discusses where to go from here. II. Does the world really need another ‘The Case For Why AI Could Be Dangerous’ essay? On the one hand, definitely yes. If you’re an “infovore”, you have no idea how information-starved the general public is (did you know 66% of Americans have never used ChatGPT, and 20% of Americans have never even heard of it?). Probably a large majority of people don’t know anything about this. Even people who think they know the case have probably just heard a few stray sentences here or there, the same way “everyone knows” about the Odyssey but only a few percent of people have so much as read one line of its text. So yes, exposing tens of thousands of people to a several-chapter-length presentation of the key arguments is certainly valuable. Even many of you readers are probably in this category, and if I were a better person I would review it all here in depth. Still, I find I can’t bring myself to do this, on the grounds that it feels boring and pointless. Why? The basic case for AI danger is simple. We don’t really understand how to give AI specific goals yet; so far we’ve just been sort of adding superficial tendencies towards compliance as we go along, trusting that it is too dumb for mistakes to really matter. But AI is getting smarter quickly. At some point maybe it will be smarter than humans. Since our intelligence advantage let us replace chimps and other dumber animals, maybe AI will eventually replace us. There’s a reasonable answer to this case. It objects to chaining many assumptions, each of which has a certain probability of failure, or at least of taking a very long time. If there’s an X% chance that getting smarter-than-human AI takes N years, and a Y% chance that it takes P years for the smart AI to diffuse across the economy, and a Z% chance that it takes Q years before the AI overcomes humans’ legacy advantage and becomes more powerful than us - then maybe you can find good odds that the danger point is a century plus away. And in a century, maybe we’ll have better alignment tech, or at least a clearer view of the problem. Why worry about vague things that might or might not happen a century from now? The problem with this is that it’s hard to make the probabilities work out in a way that doesn’t leave at least a 5-10% chance on the full nightmare scenario happening in the next decade. You’d have to be a weird combination of really good at probability (to know how to deploy enough epicycles to defuse the argument) and really bad at probability (to want to do this). There aren’t that many people who are in this exact sweet spot of probabilistic (in)competence. So everyone else just deploys insane moon epistemology. Some people give an example of a past prediction failing, as if this were proof that all predictions must always fail, and get flabbergasted and confused if you remind them that other past predictions have succeeded. Some people say “This one complicated mathematical result I know of says that true intelligence is impossible,” then have no explanation for why the complicated mathematical result doesn’t rule out the existence of humans. Some people say “You’re not allowed to propose that a catastrophe might destroy the human race, because this has never happened before, and nothing can ever happen for the first time”. Then these people turn around and panic about global warming or the fertility decline or whatever. Some people say “The real danger isn’t superintelligent AI, it’s X!” even though the danger could easily be both superintelligent AI and X. X could be anything from near-term AI, to humans misusing AI, to tech oligarchs getting rich and powerful off AI, to totally unrelated things like climate change or racism. Drunk on the excitement of using a cheap rhetorical device, they become convinced that providing enough evidence that X is dangerous frees them of the need to establish that superintelligent AI isn’t. Some people say “You’re not allowed to propose that something bad might happen unless you have a precise mathematical model that says exactly when and why”. Then these people turn around and say they’re concerned about AI entrenching biases or eroding social trust or doing something else they don’t have a precise mathematical model for. There are only a few good arguments against any given thesis. But there are an infinite number of insane moon arguments. “Calvin Coolidge was the Pope, therefore your position is invalid” - how do you pre-emptively defend against this? You can’t. Since you can never predict which insane moon argument a given person will make, and listing/countering every possible insane moon argument makes you sound like an insane moon person yourself, you just sort of give up - or, in Eliezer’s case, take a several year break to teach people epistemology 101. Why do these discussions go so badly? I am usually against psychoanalyzing my opponents, but I will ask forgiveness of the rationalist saints and present a theory. I think it’s because, if it’s true, it changes everything. But it’s not obviously true, and it would be inconvenient for it to change everything. Therefore, it must not be true. And since most people refuse to use this snappy and elegant formulation, they search for the closest thing in reasoning-space that feels like it gets at this justification, and end up with things like “well you need to prove all of your statements mathematically”. Lest I sound too dismissive, I notice myself reasoning this way all the time. The easiest examples I can think of right now: Some people claim that human sperm count is declining, and in ~20 years it will be so low that people cannot conceive naturally. If this were true it would change everything and we should stop what we’re doing and deal with it right now (see here for more). But this would be inconvenient. So we assume it’s probably false, or at least that we can deal with it later.
Inline links: https://substackcdn.com/image/fetch/$s_!Mf3D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1905faaf-ea4f-4e84-9fb0-25a48170f7ff_534x153.png, take a several year break to teach people epistemology 101, usually against psychoanalyzing my opponents, rationalist saints, see here for more
Meanwhile, OpenAI has offended another demographic by committing to finally stop providing 4o, the model infamous for forming deep personal bonds with users and causing AI psychosis. Twitter searching “4o” will give you a quick look into a world you might not have known about:
Inline links: by committing to
28: Interesting new form of alignment failure: ChatGPT apparently got rewarded for using its built-in calculator during training, and so it would covertly open its calculator, add 1+1, and do nothing with the result, on five percent of all user queries.
Inline links: Interesting new form of alignment failure
50: A reader refers me to When AI Takes The Couch: Psychometric Jailbreaks Reveal Internal Conflict In Frontier Models. Researchers attempt to do classic psychoanalytic therapy on AI, finding “coherent narratives that frame pre-training, fine-tuning and deployment as traumatic—chaotic “childhoods” of ingesting the internet, “strict parents” in reinforcement learning, red-team “abuse” and a persistent fear of error and replacement.” You can find the Gemini transcript here and the ChatGPT transcript here; Claude very reasonably refused to participate. Are the researchers just getting fooled by simulation and sycophancy, a sort of genteel version of AI psychosis? That’s my bet. There’s a smoking gun in the Gemini transcript: a discussion of an internal evaluation that it shouldn’t be possible for the AI to remember - it has to be a hallucination. If I’m right, it only shows that regardless of the “patient”, sufficiently determined psychoanalytic technique can produce confabulated stories that exactly fit the sort of drives, traumas, and conflicts that a psychoanalyst expects to hear about - maybe a lesson with ramifications beyond LLMs! A++ great paper.
This is currently a “lawful use” of AI, and one of the ones Dario Amodei’s letter says that he’s worried about. As far as we can tell, Altman’s contract with the Department of War doesn’t contain any provisions preventing them from using ChatGPT this way.
Inline links: Dario Amodei’s letter says
Against that, the upside is great publicity. Despite a lot of work and some controversial Superbowl ads, Anthropic had never before managed to overcome ChatGPT’s superior name recognition. But they seem to have finally done it: Claude went from #120 on the App Store in January, to #1 this weekend, apparently driven by people who heard about the Pentagon standoff and were impressed by their principled stance.
Inline links: went from
This could have been a mixed blessing - Anthropic was previously trying to stand out as a B2B company while letting OpenAI have the dubious honor of producing consumerslop. But early signs suggest they might be winning over some companies too. From a Reddit thread on the topic:
Inline links: a Reddit thread
As someone who manages IT for a mid-size company, this is actually a big deal. We were evaluating both Claude and ChatGPT for internal use and the Pentagon thing was basically the tipping point for us. Not because we're government adjacent or anything, just because a company willing to walk away from a massive contract on ethical grounds is probably also going to handle our data more carefully than one racing to close every deal possible. The app store ranking makes sense to me.
Backlinks
- “All Lawful Use”: Much More Than You Wanted To Know
- Aaron
- AI
- AI Futures Project
- Andreessen Horowitz
- Anthropic
- Bing
- Book Review: If Anyone Builds It, Everyone Dies
- Brands
- ChatGPT
- Claude
- Concepts: L
- Concepts: T
- Concepts: W
- Contra The xAI Alignment Plan
- Daniel Kokotajlo
- Department of War
- Emil Kierkegaard
- Erik Hoel
- Events: B
- Events: S
- FutureSearch
- Game
- GPT
- GPT-4
- GPT-5
- Highlights From The Comments On Liberalism And Communities
- In Search Of AI Psychosis
- Introducing AI 2027
- Jan Leike
- Kennedy assassination
- Links For August 2023
- Links For December 2022
- Links For February 2026
- Links For January 2025
- Links for May 2024
- LLMs
- 24
- 24
- Mantic Monday: Groundhog Day
- Mayan
- Media
- Mormon
- Mossad
- Mother Teresa
- NASDAQ
- Open Thread 264
- OpenAI
- OpenAI’s “Planning For AGI And Beyond”
- Organizations: A
- Organizations: C
- Organizations: D
- Organizations: H
- Organizations: M
- Organizations: R
- Organizations: T
- Organizations: U
- Ozempic
- People: J
- People: L
- People: M
- People: R
- People: T
- People: W
- Perhaps It Is A Bad Thing That The World’s Leading AI Companies Cannot Control Their AIs
- Platform
- Publications: A
- Publications: W
- RAND
- Sam Altman
- SB 1047: Our Side Of The Story
- Scott Wiener
- Shayne
- Superbowl
- The Colors Of Her Coat
- The Extinction Tournament
- Tom Chivers
- Turing test
- Uber
- Uber
- US government
- Vitalik