AI Safety

Article

AI Safety is a recurring concept in the Astral Codex Ten archive, appearing 12 times across 12 issues between December 30, 2021 and November 26, 2025. The archive places it in contexts such as ""AI Safety Needs Great Engineers""; “AI safety, which started as the hobbyhorse of a few weird transhumanists in the early 2000s”; “I’ve seen it in AI safety”. It most often appears alongside Manifold Markets, Sam Altman, ACX Grants.

Metadata

  • Category: Concepts
  • Mention count: 12
  • Issue count: 12
  • First seen: December 30, 2021
  • Last seen: November 26, 2025

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

December 30, 2021 · Original source
5: Related: AI Safety Needs Great Engineers. “If you could write a pull request for a major ML library, you should apply to one of the groups working on empirical AI safety: Anthropic, Cohere, DeepMind Safety, OpenAI Safety and Redwood Research.”
January 19, 2022 · Original source
The story thus far: AI safety, which started as the hobbyhorse of a few weird transhumanists in the early 2000s, has grown into a medium-sized respectable field. OpenAI, the people responsible for GPT-3 and other marvels, have a safety team. So do DeepMind, the people responsible for AlphaGo, AlphaFold, and AlphaWorldConquest (last one as yet unreleased). So do Stanford, Cambridge, UC Berkeley, etc, etc. Thanks to donations from people like Elon Musk and Dustin Moskowitz, everyone involved is contentedly flush with cash. They all report making slow but encouraging progress.
He admits he's failed most of his persuasion rolls. When he succeeds, it barely helps. He analogizes his quest to arguing against perpetual motion machine inventors. Approach the topic on too shallow a level, and they're likely to respond to criticism by tweaking their designs. Fine, you've debunked that particular scheme, better add a few more pulleys and a waterwheel or two. Eliezer thinks that's the level on which mainstream AI safety has incorporated his criticisms. He would prefer they take a step back, reconsider everything, and maybe panic a little.
Over the past few months, he and his friends have worked on transforming this general disagreement into a series of dialogues. These have been pretty good, and (rare for bigshot AI safety discussions) gotten released publicly. That gives us mere mortals a rare window into what AI safety researchers are thinking.
March 28, 2022 · Original source
This idea comes up in some weird places. I’ve seen it in AI safety: AIs that you “only” ask to predict an event for you aren’t necessarily safe, partly because one easy way to predict an event is to cause it.
November 20, 2022 · Original source
5: Prof. Daniel Kang is looking for PhD/masters students to work with him on practical AI safety at the University of Illinois Urbana-Champaign CS program. Projects include fighting deepfakes with cryptographic techniques, democratizing AI for non-experts, and developing AI-based analytics methods with accuracy guarantees for eg scientific studies and mission-critical workflows. Some potential longer-term implications. See here for more information.
November 28, 2023 · Original source
…and other major AI safety advances, including RLAIF and the foundations of AI interpretability10.
Founded the field of AI safety, and incubated it from nothing up to the point where Geoffrey Hinton, Yoshua Bengio, Demis Hassabis, Sam Altman, Bill Gates, and hundreds of others have endorsed it and urged policymakers to take it seriously.11
Helped (probably, I have no secret knowledge) the Biden administration pass what they called "the strongest set of actions any government in the world has ever taken on AI safety, security, and trust.”
December 08, 2023 · Original source
An effort to perform rapid replication of results in psychology journals. You can read the full list here. What are impact certificates / impact markets? This year’s ACX Grants will be a hybrid design. Most of it will use the traditional funding model. But applicants whose projects don’t get funded by the traditional model will have the option (not requirement! you don’t have to think about this if you don’t want to!) to opt in to an “impact market”, a non-traditional charitable funding institution. In an impact market, charitable projects offer to sell a sort of “stock”, called “impact certificates”. Investors buy the impact certificates, funding the project. If the project succeeds, a funder (like a grantmaker or foundation) may choose to buy the impact certificates, becoming the “spiritual owner” of the project (ie endorsing it as something good that ought to have been funded, and getting some sort of social credit for funding it). This money goes to the investors, who hopefully profit off of their investment, vindicating their decision to buy the certificates in the first place. The motivating idea is that a grantmaker might dismiss a charity’s plan as impossible, but an investor might believe they could succeed. The investor can fund the plan, then collect from the grantmaker if they turn out to be right. Everyone benefits: the charity gets funded, the investor makes a profit, and the grantmaker gets more of whatever kind of change they want (since a successful project is able to happen). We ran a test grant of impact markets earlier this year, along with partner Manifund. You can see the announcement here, the results here, and Manifund’s continuing impact market here. How will this year’s ACX Grants (optionally!!!) use impact markets? Most of ACX Grants will happen through the traditional grantmaking structure. But if I don’t fund your grant, you have the option of letting us auto-convert it into an impact certificate and place it on Manifund’s impact market. Then investors might fund your grant. From your perspective, this will just look like you getting the money you wanted, plus an investor who might give you some help and advice if you want. You won’t have to handle any of the impact market details, worry about your “stock price”, or anything like that. (if you do want to do those things, you can work with Manifund to create a bespoke impact certificate contract - but you don’t have to) I’m hesitant to fund AI safety grants and effective altruism community-building grants myself, both because of difficulty judging these things and because of potential conflicts of interest, so these are more likely to end up on the impact market than other things. This is an exception to the “you don’t have to use impact markets if you don’t want to” rule, sorry. (there’s some concern that impact markets have skewed incentives for projects that have a risk of doing severe harm, and I understand AI safety and EA community building are especially dangerous here and we’ve avoided them in the past. We’ll be pre-screening projects before allowing them on the impact market, and eliminating ones in that category. Final oracular funders will also be encouraged not to fund projects that they think ex ante could have caused harm.) We have four potential oracular funders who have expressed interest in impact certificates: Next year’s ACX Grants
Long Term Future Fund and Survival and Flourishing Fund focus on the long-term future, including but not limited to AI safety, forecasting, and long-termist community building. EA Infrastructure Fund focuses on EA community-building. You can find previous lists of grants funded by LTFF, EAIF, SFF, and ACXG.
Final oracular funders will operate on a model where they treat retrospective awards the same as prospective awards, multiplied by a probability of success. For example, suppose LTFF would give a $20,000 grant to a proposal for an AI safety conference, which they think has a 50% chance of going well. Instead, an investor buys the impact certificate for that proposal, waits until it goes well, and then sells it back to LTFF. They will pay $40,000 for the certificate, since it’s twice as valuable as it was back when it was just a proposal with a 50% success chance.
August 19, 2024 · Original source
1: I’m doing some AI safety grantmaking and am curious how other people value different parts of the ecosystem. If you have experience/familiarity with AI grantmaking, AI alignment, or AI policy, can you take this quick (~15 minute) survey?
February 20, 2025 · Original source
St. Madeline Medianus had many interesting opinions on AI safety, but nobody listened because she was ugly, shy, and a bad speaker. She prayed for help, and one night Gwern appeared to her in a dream and told her a personalized supplement stack that would make her beautiful and charismatic. She took the supplements, got invited to all the cool parties, and her theories became the talk of the town. But she realized that her beauty and charisma were making people take her too seriously compared to others, so she lowered the dose until she was exactly average-looking and people would update on her opinions exactly the right amount.
June 18, 2025 · Original source
Then GPT-4 came out and shook up our AI timelines, and we hard-pivoted to AI safety and interpretability research. We rebranded as Confirm Labs, and did work on adversarial attacks and interpretability including here, here, here, and here. Then Ben and I worked at Anthropic on the transformer circuits paper. As of a few weeks ago, I have returned to open research
ACX Grants (almost) always approves of pivoting to AI safety research, but I still wonder what might have been with the original project. Michael says that “The types of software projects in clinical trials that we were initially intending to do seem on track to fall to AI by 2030. We DID succeed at deriving new math techniques, and AI does not yet have a clear path to solving that kind of creative research-level math.”
[Anonymous] got their professorship and is now leading AI safety research at a good university.
August 14, 2025 · Original source
Anti-amyloid drugs (like Aduhelm) don't reverse the disease, and only slow progression a relatively small amount. Opponents call the amyloid hypothesis zombie science, propped up only by pharmaceutical companies hoping to sell off a few more anti-amyloid me-too drugs before it collapses. Meanwhile, mainstream scientists . . . continue to believe it without really offering any public defense. Scott was so surprised by the size of the gap between official and unofficial opinion that he asked if someone from the orthodox camp would speak out in its favor. I am David Schneider-Joseph, an engineer formerly with SpaceX and Google, now working in AI safety. Alzheimer’s isn’t my field, but I got very interested in it, spent six months studying the literature, and came away believing the amyloid hypothesis was basically completely solid. I thought I’d share that understanding with current skeptics. The ATN model The most plausible variant of the amyloid hypothesis is the A → T → N model: amyloid causes tau causes neurodegeneration. 1: Amyloid The common entrypoint, typically at least 15 years before clinically detectable symptoms [1], is accumulation of amyloid-β deposits (especially Aβ42, one of several variants). Amyloid-β is a peptide produced in healthy human beings and many other animals, probably for antimicrobial purposes [2, 3]. Factors which cause overproduction of amyloid also cause Alzheimer’s. Factors that cause decreased clearance of amyloid also cause Alzheimer’s. The clearest relationship is various genes which massively increase amyloid production (while doing nothing else); these genes are Alzheimer’s risk factors, with some of the rarer and more severe ones causing extreme versions of the disease that manifest at otherwise almost-never-seen ages. One of the clearest examples is Down syndrome, which is caused by three (rather than the usual two) copies of chromosome 21. People with Down syndrome are at much higher risk of Alzheimer’s than the general population: two-thirds will have the condition by age sixty, and 15% have it by age forty. APP, the gene for the amyloid precursor protein, is on chromosome 21. This means that people with Down syndrome will have an extra copy. This extra copy has been observed to lead to higher-than-normal amyloid levels. But there are many genes on chromosome 21; do we have additional evidence that it’s the amyloid one that’s involved? Yes. Dozens of other mutations on APP cause the same sort of extremely young and severe Alzheimer’s. So do mutations on PSEN1 and 2, the genes for the enzyme that processes amyloid precursor protein into amyloid. So do mutations on several other amyloid-related genes. [6, 91 - 96] Researchers call these autosomal-dominant Alzheimer’s, meaning Alzheimer’s cases that get inherited from a single parent in a simple fashion typical of single-gene disorders. They make up about 1% of all cases, and are our strongest evidence for the causal role of amyloid in the disorder. To my knowledge, there is no serious claim that these genes could be working through any pathway other than their shared role in the amyloid system. But these autosomal-dominant cases only make up about 1% of all Alzheimer’s patients. Might they be a different disease than the usual sporadic Alzheimer’s that strikes people without strong family histories at normal ages? Probably not: the presentation and trajectory of autosomal-dominant and sporadic Alzheimer’s cases are strikingly similar. Both show an initial appearance of amyloid pathology starting in intrinsic connectivity networks in both autosomal-dominant [14] and sporadic [15–18] types, cortical tau appearing first in the medial temporal lobe and with the exact same fold in both disease types [97] (despite human tauopathies having at least seven other possible characteristic folds [36]), that tau pathology worsening and spreading outside this region only once amyloid pathology reaches sufficient severity [65], neurodegeneration progressing closely in step with the tau pathology, and the same usual approximate trajectory of cognitive symptoms due to the sequence of affected regions. So it’s as if two bank robberies occurred hours apart, in the same town, and in a highly similar and idiosyncratic manner, and we can positively identify the culprit of one on security camera footage. It’s a good bet the culprit of the other is the same. Increased amyloid production → Alzheimer’s is an especially clear and simple pathway, but any other change in amyloid can also cause the disease. For example Overproduction or reduced clearance of amyloid due to impaired slow wave sleep. Aβ production is neuronal activity-dependent, and toxins (perhaps including Aβ) are cleared from the brain during sleep via the glymphatic system. Thus Aβ can accumulate if the brain is more active and/or has less opportunity for clearance. [7, 8, 9, 10, 11]
August 26, 2025 · Original source
I think now there might be several dozen subreddit moderators who could accurately describe their job as “witch webmaster who runs an online service giving advice to new witches”. And partly it was because there are so many crazy beliefs in the world - spirits, crystal healing, moon landing denial, esoteric Hitlerism, whichever religions you don’t believe in - that psychiatrists have instituted a blanket exemption for any widely held idea. If you think you’re being attacked by demons, you’re delusional, unless you’re from some culture where lots of people get attacked by demons, in which case it’s a religion and you’re fine. This is partly political self-protection - no psychiatrist wants to be the guy who commits an Afro-Caribbean person for believing in voodoo. But it also seems to track something useful about reality. Nietzsche wrote “Madness is something rare in individuals — but in groups, parties, peoples, and ages, it is the rule.” Most people don’t have world-models - they believe what their friends believe, or what has good epistemic vibes. In a large group, weird ideas can ricochet from person to person and get established even in healthy brains. In an Afro-Caribbean culture where all your friends get attacked by demons at voodoo church every Sunday, a belief in demon attacks can co-exist with otherwise being a totally functional individual. So is QAnon a religion? Awkward question, but it’s non-psychotic by definition. Still, it’s interesting, isn’t it? If social media makes a thousand people believe the same crazy thing, it’s not psychotic. If LLMs make a thousand people each believe a different crazy thing, that is psychotic. Is this a meaningful difference, or an accounting convention? Also, what if a thousand people believe something, but it’s you and your 999 ChatGPT instances? III. A Hidden Army Of Crackpots I have a family member who believes that the theory of evolution, as usually understood, cannot possibly work. He has developed an alternative theory called “noctogenesis” which patches Darwinism using ideas from the transactional interpretation of quantum mechanics, and he works on-and-off on various related books and papers. I have told him I suspect he might be a crackpot; he stands by his claims. It’s fine; when I got into the technological singularity and AI safety, lots of people suspected I was a crackpot, and I stood by my claims too. You’ve got to stand by your family members even when they’re slightly crackpottish. This family member is happily married, retired after running a successful business, and generally a normal likeable person. He has no signs of mental illness, and doesn’t talk about quantum evolution unless someone else brings it up first. There must be millions of people like him. Used car dealers with proofs of P = NP, dentists who think they’ve discovered something important about Mary Magdalene, math professors obsessed with destroying the moon. I’m working on evaluating ACX Grants, and these people are out in force. A few propose literal perpetual motion machines. Others have vaguer plans, like some kind of social media app (it’s always a social media app) that will cause world peace. Many of them have decent jobs and seem like upstanding members of society. Their secrets are known only to themselves, their family members, and their would-be grantmaker. …and, increasingly, their chatbots. After years of hiatus (or at least not talking to me about his work) my family member is back on the quantum evolution beat, and LLMs appear to be involved. If I knew him less well, I would think the LLM had caused the quantum evolution theory - but no, it just made it much easier to research and write about. Is this psychosis? The answer has to be no, but it’s once again hard to draw the line. A very small number of crackpots will be vindicated by history. A larger number will be erroneous but sympathetic - the official account of the Kennedy assassination is pretty weird, and reasonable minds can disagree. From there, we get to ones that are maybe not so sympathetic: flat earth, QAnon, the thing where the Queen was an alien lizard. If only one person thought the Queen was an alien lizard, and they never managed to convince anyone else, would that be sufficient evidence for a delusional disorder? I’m not sure. (psychiatry has a diagnosis, schizotypal personality, which sort of involves being a normal person with a few odd ideas, but it’s not a great match for many of these people, and interesting mainly as a genetic curiosity - it travels in the same families as schizophrenia itself) Maybe this is another place where we are forced to admit a spectrum model of psychiatric disorders - there is an unbroken continuum from mildly sad to suicidally depressed, from social drinking to raging alcoholism, and from eccentric to floridly psychotic. People who are eccentric can remain so their whole lives, with the level of expression depending on their social connections and the ease of pursuing their rabbit holes. LLMs, by making it easier to pursue odd theories and serving as a surrogate social connection who always agrees with you, can bring latent crackpottery into the open. IV. Cause And Effect Bipolar disorder has an interesting relationship with sleep. Most manic people sleep very little, or not at all - maybe an hour or two a night. But also, poor sleep can cause bipolar episodes in people prone to them. In a typical case, a bipolar who’s been well-controlled for years will get assigned a big report at work and get poor sleep for a few nights until they finish. At first, this will be just as bad as it sounds, and they’ll be working through a fog of tiredness. Then the tiredness will lift. They’ll feel normal, then better-than-normal, until finally they can’t sleep even if they want to. Then they’ll email the report to their boss and it will be written entirely in Assyrian cuneiform. I increasingly think this isn’t just an incidental feature of bipolar, but part of the reason it exists as a diagnostic category at all. Most people have a compensatory reaction to insomnia - missing one night of sleep makes you more tired the next. A small number of people have the reverse, a spiralling reaction where missing one night of sleep makes you less tired the next. Solve for the equilibrium and you reach a stable attractor point where you never sleep at all. But this does other bad things to your brain - hence the cuneiform. I’m not claiming that bipolar is “just” sleep loss. As Borsboom et al will tell you, psychiatric disorders can be viewed as complex networks of symptoms, each reinforcing the others. In a few pure cases, you can get a ratchet going with sleep alone, and the sleeplessness will spark everything else. More likely, there will be lots of interactions between poor sleep and everything else, and the “everything else” can sink or hypercharge an impending manic episode. Still, I find this a fruitful way to think about bipolar. Sleeplessness is both the cause and the effect. Can delusions also be like this? That is, suppose there’s some personality trait where having one delusion makes you even more delusional. Maybe the delusion makes you excited (who wouldn’t be excited to learn they’re the Messiah?), and you’re more delusional when you’re in an excited state and not thinking clearly. Or maybe it’s a three-symptom cycle - the delusion causes excitement, which makes you unable to sleep, which scrambles your thinking, which makes you more delusional (which makes you even less able to sleep, etc). The point is: delusions are certainly an effect of bipolar disorder. And in the dynamical system model of psychiatric disorders, we should expect that effects are often also causes; that’s how the vicious cycle gets going. This is the best I can do at modeling true LLM psychosis. Someone with a trait where delusions lead inevitably to more delusions starts using an LLM. The LLM accentuates whatever usual tendency towards crackpottery they have and makes them believe something a little crazier than whatever they believed before. Then that crazy belief feeds upon itself and causes other things like excitement and sleep loss, which (if the person is predisposed) precipitates a true psychotic episode. V. Folie A Deux Ex Machina If one person believes a crazy thing, it’s a delusion; if a thousand people believe it, it’s a religion. What if exactly two people believe it? In psychiatry, this is called folie a deux. It fits awkwardly into our nosology and is rarely seen. Still, it happens enough to generate a few case studies. In a typical case, one person has psychosis for some normal reason, like schizophrenia or bipolar, and the second person is a shut-in who lives with them and rarely talks to anyone else. The psychotic person gets some normal psychotic delusion - they’re God, the Feds are after them, etc - and sort of psychically steamrolls over the second person until they believe it too. Usually removing the second person from the first is sufficient for a cure. This slightly challenges the view of psychosis as a biological disorder - but only slightly. Again, think of most people as lacking world-models, but being moored to reality by some vague sense of social consensus. If your social life is limited to one person, and that person themselves becomes unmoored, then sometimes you will follow along. I would expect second-sufferers to believe delusions in a sort of cognitively normal way, the same way people believe true facts, honest mistakes, and conspiracy theories. I would expect them to be less likely (though not zero likely) to have other psychotic features like sleep disturbances, hallucinations, disorganized speech, or a tendency to autonomously generate delusional ideas aside from the one they absorbed from the index case. An introverted person using an LLM has some similarities to folie a deux. If they use the chatbot very often, it might be a large majority of their social interactions. Here the primary vs. secondary distinction breaks down - the most likely scenario is that the human first suggested the crazy idea, the machine reflected it back slightly stronger, and it kept ricocheting back and forth, gaining confidence with each iteration, until both were totally convinced. Compare this to normal social interactions, where if someone expresses a crazy idea that isn’t common in their culture, other people will shoot them down or at the very least nod politely and stop the conversation. So my working theory of LLM psychosis is: Some patients were already psychotic, and LLMs just help them be psychotic more effectively.
November 26, 2025 · Original source
If we worry too much about AI safety, will this make us “lose the race with China”1?
(here “AI safety” means long-term concerns about alignment and hostile superintelligence, as opposed to “AI ethics” concerns like bias or intellectual property.)
Leverage their applications advantage as hard as possible. They imagine that sure, maybe America will have AI that’s 1-2 years more advanced than theirs. But if our smarter AI is still just sitting in a data center answering user queries - and their dumber AI is already integrated with tens of thousands of humanoid robots, automated drones, missile targeting systems, etc - then they still win. This is a very practical strategy from a very practical country. The Chinese don’t really believe in recursive self-improvement or superintelligence4. If they did, they wouldn’t be so blasé about the possibility of America having AIs 1-2 years more advanced than theirs - if our models pass the superintelligence threshold while theirs are still approaching it, then their advantage in humanoids and drones no longer seems so impressive. What is the optimal counter-strategy for America? We’re still debating specifics, but a skeletal, obvious-things-only version might be to preserve our compute advantage as long as possible, protect our technological secrets from Chinese espionage, and put up as much of a fight as possible on the application layer. The State Of AI Safety Policy It’s worth being specific about what we mean by “AI safety regulation”. The two most discussed AI safety bills of the past year - California’s SB53 and New York’s RAISE Act - as well as Dean Ball’s proposed federal AI safety preemption bill - all focus on a few key topics: The biggest companies (eg OpenAI, Anthropic, Google) must disclose their model spec, ie the internal document saying what their models are vs. aren’t banned from doing.