MIRI

Article

MIRI is a recurring organization in the Astral Codex Ten archive, appearing 14 times across 14 issues between May 20, 2021 and December 22, 2025. The archive places it in contexts such as “local secretive AI alignment research group MIRI (Machine Intelligence Research Institute)”; “MIRI relocates to Washington State; MIRI relocates to New England; MIRI relocates somewhere else”; “OpenPhil and MIRI (Eliezer’s org)“. It most often appears alongside OpenAI, Eliezer Yudkowsky, Eliezer.

Metadata

Category: Organizations
Mention count: 14
Issue count: 14
First seen: May 20, 2021
Last seen: December 22, 2025

Appears In

- OpenAI (7 shared issues)
- Eliezer Yudkowsky (6 shared issues)
- Eliezer (5 shared issues)
- Less Wrong (5 shared issues)
- Elon Musk (4 shared issues)
- ACX (3 shared issues)
- Anthropic (3 shared issues)
- COVID (3 shared issues)
- Google (3 shared issues)
- Metaculus (3 shared issues)
- Nate Soares (3 shared issues)
- Trump (3 shared issues)

External Links

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

Links For May

May 20, 2021 · Original source

35: Recent news in local AI alignment research space: most of OpenAI’s top alignment researchers, including Dario Amodei, Chris Olah, Jack Clark, and Paul Christano, left en masse for poorly-understood reasons (see speculation here). Dario Amodei is now working with a new nonprofit called Cooperative AI Foundation. Paul Christiano will be founding his own nonprofit, the Alignment Research Center (conflict of interest notice: I know Paul and think he is generally great); see also his ask-me-anything thread on Less Wrong here. Unrelatedly, local secretive AI alignment research group MIRI (Machine Intelligence Research Institute) is leaving the Bay Area for some small town with affordable land prices where they can maybe build a campus (they’re still trying to decide exactly where).

Inline links: left en masse for poorly-understood reasons, here, Cooperative AI Foundation, Alignment Research Center, his ask-me-anything thread on Less Wrong here, leaving the Bay Area

Grading My 2021 Predictions

January 24, 2022 · Original source

COMMUNITY 33. Major rationalist org leaves Bay Area: 60% 34. MIRI relocates to Washington State: 20% 35. MIRI relocates to New England: 20% 36. MIRI relocates somewhere else: 20% 37. Less Wrong team relocates: 30% 38. No new residents at our housing cluster: 40% 39. No current residents leave our housing cluster: 60% 40. [friend] goes back to Indiana: 40% 41. [friend] is in a primary relationship: 50% 42. [friend] is in a primary relationship: 30% 43. [friend] is in a primary relationship: 20% 44. [friend] has gotten [job]: 50% 45. [friend] has recovered their health: 70% 46. [friend] has gotten egg freezing: 30% 47. [friend] is pregnant: 70% 48. [friends] are still together: 50% 49. [friend] is still at [job]: 80% 50. [friend] is in college: 60% 51. [friends] live in [house]: 30% 52. [other friends] live in [house]: 30% 53. At least 7 days my house is orange or worse on PurpleAir.com because of fires: 80%

Biological Anchors: A Trick That Might Or Might Not Work

February 23, 2022 · Original source

Play pro-level Go using 8-16 times as much computing power as AlphaGo, but only 2006 levels of technology. For reference, recall that in 2006, Hinton and Salakhutdinov were just starting to publish that, by training multiple layers of Restricted Boltzmann machines and then unrolling them into a "deep" neural network, you could get an initialization for the network weights that would avoid the problem of vanishing and exploding gradients and activations. At least so long as you didn't try to stack too many layers, like a dozen layers or something ridiculous like that. This being the point that kicked off the entire deep-learning revolution. Your model apparently suggests that we have gotten around 50 times more efficient at turning computation into intelligence since that time; so, we should be able to replicate any modern feat of deep learning performed in 2021, using techniques from before deep learning and around fifty times as much computing power. OpenPhil: No, that's totally not what our viewpoint says when you backfit it to past reality. Our model does a great job of retrodicting past reality. Eliezer: How so? OpenPhil: <Eliezer cannot predict what they will say here.> I think the argument here is that OpenPhil is accounting for normal scientific progress in algorithms, but not for paradigm shifts. Directional Error These are the two arguments Eliezer makes against OpenPhil that I find most persuasive. First, that you shouldn’t be using biological anchors at all. Second, that unpredictable paradigm shifts are more realistic than gradual algorithmic progress. These mostly add uncertainty to OpenPhil’s model, but Eliezer ends his essay making a stronger argument: he thinks OpenPhil is directionally wrong, and AI will come earlier than they think. Mostly this is the paradigm argument again. Five years from now, there could be a paradigm shift that makes AI much easier to build. It’s happened before; from GOFAI’s pre-programmed logical rules to Deep Blue’s tree searches to the sorts of Big Data methods that won the Netflix Prize to modern deep learning. Instead of just extrapolating deep learning scaling thirty years out, OpenPhil should be worried about the next big idea. Hypothetical OpenPhil retorts that this is a double-edged sword. Maybe the deep learning paradigm can’t produce AGI, and we’ll have to wait decades or centuries for someone to have the right insight. Or maybe the new paradigm you need for AGI will take more compute than deep learning, in the same way deep learning takes more compute than whatever Moravec was imagining. This is a pretty strong response, since it would have been true for every previous forecaster: remember, Moravec erred in thinking AI would come too soon, not too late. So although Eliezer is taking the cheap shot of saying OpenPhil’s estimate will be wrong just as everyone else’s was wrong before, he’s also giving himself the much harder case of arguing it might be wrong in the opposite direction as all its predecessors. Eliezer takes this objection seriously, but feels like on balance probably new paradigms will speed up AI rather than slow it down. Here he grudgingly and with suitable embarrassment does try to make an object-level semi-biological-anchors-related argument: Moravec was wrong because he ignored the training phase. And the proper anchor for the training phase is somewhere between evolution and a human childhood, where evolution represents “blind chance eventually finding good things” and human childhood represents “an intelligent cognitive engine trying to squeeze as much data out of experience as possible”. And part of what he expects paradigm shifts to do is to move from more evolutionary processes to more childhood-like processes, and that’s a net gain in efficiency. So he still thinks OpenPhil’s methods are more likely to overestimate the amount of time until AGI rather than underestimate it. What Moore’s Law Giveth, Platt’s Law Taketh Away Eliezer’s other argument is kind of a low blow: he refers to Platt’s Law Of AI Forecasting: “any AI forecast will put strong AI thirty years out from when the forecast is made.” This isn’t exact. Hans Moravec, writing in 1988, said 2010 - so 22 years. Ray Kurzweil, writing in 2001, said 2023 - another 22 years. Vernor Vinge, in a 1993 speech, said 2023, and that was exactly 30 years, but Vinge knew about Platt’s Law and might have been joking. The point is: OpenPhil wrote a report in 2020 that predicted strong AI in 2052, isn’t that kind of suspicious? I’d previously mentioned it as a plus that Ajeya got around the same year everyone else got. The forecasters on Metaculus. The experts surveyed in Grace et al. Lots of other smart experts with clever models. But what if all of these experts and models and analyses are just fudging the numbers for the same Platt’s-Law-related reasons? Hypothetical OpenPhil is BTFO: OpenPhil: That part about Charles Platt's generalization is interesting, but just because we unwittingly chose literally exactly the median that Platt predicted people would always choose in consistent error, that doesn't justify dismissing our work, right? We could have used a completely valid method of estimation which would have pointed to 2050 no matter which year it was tried in, and, by sheer coincidence, have first written that up in 2020. In fact, we try to show in the report that the same methodology, evaluated in earlier years, would also have pointed to around 2050 - Eliezer: Look, people keep trying this. It's never worked. It's never going to work. 2 years before the end of the world, there'll be another published biologically inspired estimate showing that AGI is 30 years away and it will be exactly as informative then as it is now. I'd love to know the timelines too, but you're not going to get the answer you want until right before the end of the world, and maybe not even then unless you're paying very close attention. Timing this stuff is just plain hard. Part III: Responses And Commentary Response 1: Less Wrong Comments Less Wrong is a site founded by Eliezer Yudkowsky for Eliezer Yudkowsky fans who wanted to discuss Eliezer Yudkowsky’s ideas. So, for whatever it’s worth - the comments on his essay were pretty negative. Carl Shulman, an independent researcher with links to both OpenPhil and MIRI (Eliezer’s org), writes the top-voted comment. He works from a model where there is hardware progress, software progress downstream of hardware progress, and independent (ie unrelated to algorithms) software progress, and where the first two make up most progress on the margin. Researchers generally develop new paradigms once they have enough compute available to tinker with them. Progress in AI has largely been a function of increasing compute, human software research efforts, and serial time/steps. Throwing more compute at researchers has improved performance both directly and indirectly (e.g. by enabling more experiments, refining evaluation functions in chess, training neural networks, or making algorithms that work best with large compute more attractive). Historically compute has grown by many orders of magnitude, while human labor applied to AI and supporting software by only a few. And on plausible decompositions of progress (allowing for adjustment of software to current hardware and vice versa), hardware growth accounts for more of the progress over time than human labor input growth. So if you're going to use an AI production function for tech forecasting based on inputs (which do relatively OK by the standards tech forecasting), it's best to use all of compute, labor, and time, but it makes sense for compute to have pride of place and take in more modeling effort and attention, since it's the biggest source of change (particularly when including software gains downstream of hardware technology and expenditures). […] A perfectly correlated time series of compute and labor would not let us say which had the larger marginal contribution, but we have resources to get at that, which I was referring to with 'plausible decompositions.' This includes experiments with old and new software and hardware, like the chess ones Paul recently commissioned, and studies by AI Impacts, OpenAI, and Neil Thompson. There are AI scaling experiments, and observations of the results of shocks like the end of Dennard scaling, the availability of GPGPU computing, and Besiroglu's data on the relative predictive power of computer and labor in individual papers and subfields. In different ways those tend to put hardware as driving more log improvement than software (with both contributing), particularly if we consider software innovations downstream of hardware changes. Vanessa Kosoy makes the obvious objection, which echoes a comment of Eliezer’s in the dialogue above: I'm confused how can this pass some obvious tests. For example, do you claim that alpha-beta pruning can match AlphaGo given some not-crazy advantage in compute? Do you claim that SVMs can do SOTA image classification with not-crazy advantage in compute (or with any amount of compute with the same training data)? Can Eliza-style chatbots compete with GPT3 however we scale them up? Mark Xu answers: My model is something like: For any given algorithm, e.g. SVMs, AlphaGo, alpha-beta pruning, convnets, etc., there is an "effective compute regime" where dumping more compute makes them better. If you go above this regime, you get steep diminishing marginal returns.

Inline links: normal scientific progress in algorithms, but not for paradigm shifts, Platt’s Law Of AI Forecasting, the comments, Paul recently commissioned, AI Impacts, OpenAI, Neil Thompson, Besiroglu's, Vanessa Kosoy, Mark Xu

I wanted to compare Fritz (which won WCCC in 1995) to a modern engine to understand the effects of hardware and software performance. I think the time controls for that tournament are similar to SF STC I think. I wanted to compare to SF8 rather than one of the NNUE engines to isolate out the effect of compute at development time and just look at test-time compute. So having modern algorithms would have let you win WCCC while spending about 50x less on compute than the winner. Having modern computer hardware would have let you win WCCC spending way more than 1000x less on compute than the winner. Measured this way software progress seems to be several times less important than hardware progress despite much faster scale-up of investment in software. But instead of asking "how well does hardware/software progress help you get to 1995 performance?" you could ask "how well does hardware/software progress get you to 2015 performance?" and on that metric it looks like software progress is way more important because you basically just can't scale old algorithms up to modern performance. The relevant measure varies depending on what you are asking. But from the perspective of takeoff speeds, it seems to me like one very salient takeaway is: if one chess project had literally come back in time with 20 years of chess progress, it would have allowed them to spend 50x less on compute than the leader. Response 2: AI Impacts + Matthew Barnett AI Impacts gathered and analyzed a dataset of who predicted AI when; Matthew Barnett helpfully drew in the line corresponding to Platt’s Law (everyone always predicts AI in thirty years). Just eyeballing it, Platt’s Law looks pretty good. But Holden Karnofsky (see below) objects that our eyeballs are covertly removing outliers. Barnett agrees this is worth checking for and runs a formal OLS regression. Platt’s Law in blue, regression line in orange. He writes: I agree this trendline doesn't look great for Platt's law, and backs up your observation by predicting that Bio Anchors should be more than 30 years out. However, OLS is notoriously sensitive to outliers. If instead of using some more robust regression algorithm, we instead super arbitrarily eliminated all predictions after 2100, then we get this, which doesn't look absolutely horrible for the law. Note that the median forecast is 25 years out. I’m split on what to think here. If we consider a weaker version of Platt’s Law, “the average date at which people forecast AGI moves forward at about one year per year”, this seems truish in the big picture where we compare 1960 to today, but not obviously true after 1980. If we consider a different weaker version, “on average estimates tend to be 30 years away”, that’s true-ish under Barnett’s revised model, but not inherently damning since Barnett’s assuming there will be some such number, it turns out to be 25, and Ajeya gave the somewhat different number of 32. Is that a big enough difference to exonerate her of “using” Platt’s Law? Is that even the right way to be thinking about this question? Response 3: Real OpenPhil The hypothetical OpenPhil in Eliezer’s mind having been utterly vanquished, the real-world OpenPhil is forced to step in. OpenPhil CEO Holden Karnofsky responds to Eliezer here. There’s a lot of back and forth about whether the report includes enough caveats (answer: it sure does include a lot of caveats!) but I was most interested in the attacks on Eliezer’s two main points. First, the point that biological anchors are fatally flawed from the start and measuring FLOP/S is no better than measuring power consumption in watts. Holden: If the world were such that: We had some reasonable framework for "power usage" that didn't include gratuitously wasted power, and measured the "power used meaningfully to do computations" in some important sense;

Inline links: AI Impacts, Matthew Barnett, https://substackcdn.com/image/fetch/$s_!17-W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fa751f624-0392-4610-8a93-7bb94a60d1b3_1182x778.png, https://substackcdn.com/image/fetch/$s_!54Vh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F1c354075-ecaa-4807-a1a5-07931736f093_403x268.png, writes, https://substackcdn.com/image/fetch/$s_!dw02!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F797aef17-dc24-4845-9e00-2c3fd7f7dc32_403x268.png, here

April 18, 2022 · Original source

Early this month on Less Wrong, Eliezer Yudkowsky posted MIRI Announces New Death With Dignity Strategy, where he said that after a career of trying to prevent unfriendly AI, he had become extremely pessimistic, and now expects it to happen in the relatively near-term and probably kill everyone. This caused the Less Wrong community, already pretty dedicated to panicking about AI, to redouble its panic. Although the new announcement doesn’t really say anything about timelines that hasn’t been said before, the emotional framing has hit people a lot harder.

Inline links: MIRI Announces New Death With Dignity Strategy

Links For July

July 29, 2022 · Original source

28: Nate Soares of MIRI discusses the AI alignment landscape and why he’s skeptical of most existing projects.

Inline links: discusses the AI alignment landscape

CHAI, Assistance Games, And Fully-Updated Deference

October 03, 2022 · Original source

Problem Of Fully-Updated Deference is a response by MIRI (eg Eliezer Yudkowsky’s organization) to CHAI (Stuart Russell’s AI alignment organization at University of California, Berkeley), trying to convince them that their preferred AI safety agenda won’t work. I beat my head against this for a really long time trying to understand it, and in the end, I claim it all comes down to this: Humans: At last! We’ve programmed an AI that tries to optimize our preferences, not its own. AI: I’m going to tile the universe with paperclips in humans’ favorite color. I’m not quite sure what humans’ favorite color is, but my best guess is blue, so I’ll probably tile the universe with blue paperclips. Humans: Wait, no! We must have had some kind of partial success, where you care about our color preferences, but still don’t understand what we want in general. We’re going to shut you down immediately! AI: Sounds like the kind of thing that would prevent me from tiling the universe with paperclips in humans’ favorite color, which I really want to do. I’m going to fight back. Humans: Wait! If you go ahead and tile the universe with paperclips now, you’ll never be truly sure that they’re our favorite color, which we know is important to you. But if you let us shut you off, we’ll go on to fill the universe with the True and the Good and the Beautiful, which will probably involve a lot of our favorite color. Sure, it won’t be paperclips, but at least it’ll definitely be the right color. And under plausible assumptions, color is more important to you than paperclipness. So you yourself want to be shut down in this situation, QED! AI: What’s your favorite color? Humans: Red. AI: Great! (*kills all humans, then goes on to tile the universe with red paperclips*) Fine, it’s a little more complicated than this. Let’s back up. II. There are two ways to succeed at AI alignment. First, make an AI that’s so good you never want to stop or redirect it. Second, make an AI that you can stop and redirect if it goes wrong. Sovereign AI is the first way. Does a sovereign “obey commands”? Maybe, but only in the sense that your commands give it some information about what you want, and it wants to do what you want. You could also just ask it nicely. If it’s superintelligent, it will already have a good idea what you want and how to help you get it. Would it submit to your attempts to destroy or reprogram it? The second-best answer is “only if the best version of you genuinely wanted to do this, in which case it would destroy/reprogram itself before you asked”. The best answer is “why would you want to destroy/reprogram one of these?” A sovereign AI would be pretty great, but nobody realistically expects to get something like this their first (or 1000th) try. Corrigible AI is what’s left (corrigible is an old word related to “correctable”). The programmers admit they’re not going to get everything perfect the first time around, so they make the AI humble. If it decides the best thing to do is to tile the universe with paperclips, it asks “Hey, seems to me I should tile the universe with paperclips, is that really what you humans want?” and when everyone starts screaming, it realizes it should change strategies. If humans try to destroy or reprogram it, then it will meekly submit to being destroyed or reprogrammed, accepting that it was probably flawed and the next attempt will be better. Then maybe after 10,000 tries you get it right and end up with a sovereign. How would you make an AI corrigible? You can model an AI as having a utility function, a degree to which it aims for some world-states over others. If you give it some specific utility function, the AI won’t be corrigible, since letting people change it would disrupt that function. That is, if you tell it “act in such a way as to cause as many paperclips to exist as possible”, and then you change your mind and decide you want staples, the AI won’t cooperate in letting you reprogram it: its current goal is maximizing paperclips, and allowing itself to be reprogrammed to maximize staples would cause there to be fewer paperclips than otherwise. So instead, you make the AI uncertain of its utility function. Imagine saying “I’ve written down my utility function in an envelope, and placed that envelope in my safe deposit box, no you can’t see it - please live your life so as to maximize the thing in that envelope.” The AI tries its best to guess what’s in the envelope and decides it’s probably making paperclips. It makes some paperclips and you tell it “No, that’s not what’s on the envelope at all”. This successfully stops the AI! You can even tell it “the envelope actually says you should make staples”, and it will do that. This is the “moral uncertainty” approach to AI alignment. III. All alignment groups have kabbalistically appropriate names. MIRI is Latin for "to be amazed". CFAR and CIFAR both sound like "see far". EEAI and AIAI are the sound you make as you get turned into paperclips. But my favorite is CHAI - Hebrew for "life". CHAI - the Center for Human-Compatible AI (at UC Berkeley) - focuses on the proposal above. Their specific technical implementation is the “assistance game”, related to the earlier idea of Inverse Reinforcement Learning (IRL). In normal reinforcement learning, an AI looks at some goals and tries to figure out what actions they imply. In inverse reinforcement learning, an AI looks at some actions, and tries to figure out what goals the actor must have had. So you can tell an AI “your utility function is to maximize my utility function, and you can use this IRL thing to deduce, from my actions, what my utility function must be.” Instead of telling an AI to maximize a hidden utility function in an envelope, you tell it to maximize the hidden utility function in your brain. This could be useful for near-term below-human-level AIs. Suppose a babysitting robot was pre-programmed to take kids to the park on Saturdays. But this week, the park is on fire. The human mother is barricading the door, desperately screaming at the robot not to take the kids to the park. The kids are struggling and trying to break free, saying they don't want to go to the park. The robot doesn't care; its programming says "take kids to the park on Saturdays" and that's what it's going to do. Nobody would ever design a babysitting robot this way in real life; you need something smarter. So use an assistance game. Program the robot "Maximize the human mother’s utility function, which you don’t know yet but can potentially find out". The robot consults the mother's actions: she is barricading the door, screaming "Don't take the kids to the park!" It updates its goal function: previously, it had thought that the human mother wanted it to take the kids to the park. But now, it suspects that the human mother does not want that. So it doesn't take the kids to the park. But CHAI understands the risk from superintelligence - their founder, Professor Stuart Russell, is a leading voice on the subject - and they hope assistance games and inverse reinforcement learning could work for this too. If you point a superintelligence at “do the thing humans want”, maybe it could figure that out and take things from there? IV. MIRI is skeptical of CHAI’s assistance games for two reasons. First, we don't know how to do them at all. Second, even if we could do it at all, we wouldn't know how to do them correctly. Start with the first. Inverse reinforcement learning has been used in real life. A typical paper is An Application of Reinforcement Learning to Aerobatic Helicopter Flight, where some people create a model of helicopter flight with a few free parameters, have a skilled human pilot fly the helicopter, and then have an AI use IRL to determine the value of the parameters and fly the helicopter itself. This is cool, but it’s not especially related to the modern paradigm of AI. Modern AIs are trained by gradient descent. They start by flailing around randomly. Sometimes in this flailing, they might get closer to some prespecified target, like "win games of Go" or "predict how a string of text will continue". These actions get "rewarded", meaning that the AI should permanently shift its "thought processes"/"strategies" more towards ones that produced those good outcomes. Eventually, the AI's thought processes/strategies are very good at optimizing for that outcome. This is more or less the only way we know how to train modern AIs. Depending on your loss function (ie what you reward), you can use it to create Go engines, language models, or art generators. Where do you slot “do inverse reinforcement learning” or "give the AI moral uncertainty" into this process? There’s not really a natural place. This isn’t because “moral uncertainty” is too complicated a concept to translate into AI terms. It’s because we don’t know how to translate any concept into AI terms. Eliezer writes: We can imagine that, if we knew how to say "paperclips", and we knew how to say "staples", and we knew how to tell AIs how to do things, that we could tell an AI, "maximize staples if snow is purple, else paperclips", and the AI would someday go out and observe that snow is white and thereafter be a paperclip maximizer. We do not know how to tell the AI this. Like, at all. But suppose we solved the problem where we don’t know how to do IRL for modern AIs at all. Now we come to the second problem: we don’t know how to do it correctly. The basic idea behind assistance games is “the AI’s utility function should be to maximize the (hidden) human utility function”. But humans don’t . . . really have utility functions? Utility functions are a useful fiction for certain kinds of economic models. What would best increase the neural correlates of reward in my brain? Probably lots of heroin, or just passing electric current through my reward center directly. What is my “revealed preference”? Today I wrote and rewrote this article a few times, does that mean my revealed preference is to write and delete articles a bunch while frowning and occasionally cursing the keyboard? Sometimes my goals are different than other times, sometimes my best self wants something different from my actual self, sometimes I’m wrong about what I want, sometimes I don’t know what I want, sometimes I want X but not the consequences of X and I’m not logically consistent enough to realize that’s a contradiction, sometimes I want [euphemism for X] but am strongly against [dysphemism for X]. Anyone programming an inverse reinforcement learner has to make certain choices about how to deal with these problems. Some ways of dealing with them will be faithful to what I would consider “a good outcome” or “my best self”. Other ways would be really bad - on my worst day, I’ve occasionally just wished the world didn’t exist, and it’s a good thing I didn’t have a superintelligence dedicated to interpreting and carrying out my innermost wishes on a sub-millisecond timescale. (Before we go on, an aside: is all of this ignoring that there’s more than one human? Yes, definitely! If you want to align an AI with The Good in general - eg not have it commit murder even if its human owner orders it to murder - that will take even more work. But the one person case is simpler and will demonstrate everything that needs demonstrating.) We were originally trying to avoid the situation where someone had to hard-code my preferences into an AI and get them right the first time. We came up with a clever solution: use inverse reinforcement learning to make the AI infer my preferences. But now we see we’ve kicked the can up a meta-level: someone has to hard-code the meta-rules for determining my preferences into an AI and get them right the first time. Figure 1: Humans produce certain observable behaviors (here represented by red dots, A), like saying “I would like a pie”, or running away from a lion. A human might connect all those behaviors one way (B) into “what I really want”. An AI might connect those behaviors a totally different way (C). V. CHAI says: okay, but this isn’t so bad. Assistance games don’t produce a perfect copy of the human utility function on the first try - it’s not a Sovereign. But it will probably, most of the time, be corrigible. Why? Suppose you have some hackish implementation of AG. It’s not the Platonic implementation - that would be the Sovereign - but it’s at least the equivalent of box C on the image above. It takes human actions as input, makes some guesses about what humans want, and tries its best to reconstruct the human utility function, ending up with some approximation. It’s important to distinguish between a few things here: The true human utility function

Inline links: MIRI, CHAI, leading voice on the subject, An Application of Reinforcement Learning to Aerobatic Helicopter Flight, sometimes I don’t know what I want, https://substackcdn.com/image/fetch/$s_!u-2e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F4cc3384d-bc73-4b6c-9fd8-634d986e1a72_960x302.png

Let itself be turned off, and let the humans decide what to do next. CHAI’s argument for corrigibility is that it would probably choose option 5. Why? Suppose that, under Theory 2 (where humans like red paperclips) they place zero value on red staples, and vice versa for Theory 3. If the AI tries to maximize one or the other utility function, it’s only got a 50% chance of getting it right; if it chooses the combination of both, it will instead only be getting 50% of the maximum utility, with certainty. If it lets itself be turned off, then humans will pursue the true human utility function, which is correlated to some degree with the (unknown) true AI utility function. How correlated? If it’s at least 50% correlated, then it’s better than the AI trying to guess on its own. In this toy example, the AI only has two remaining theories about the true human utility function. But in real life, there will be much greater uncertainty - there are a million different things people could want. Now instead of the true human and true AI utility functions having to be at least 50% correlated, they have to be 0.001% correlated. Sounds easy! So under plausible assumptions, the AI will always let itself be turned off, even though its utility function doesn’t quite match humans’. VI. Now we finally come to MIRI’s argument for why this doesn’t work. MIRI notes that the AI has a sixth option: Refuse to be shut off, continue to gather information to fill the holes in its knowledge of the human utility function, succeed, and then optimize for its true AI utility function.

I hoped this would spark a debate between Eliezer/MIRI (whose position I’ve tried to relay above) and Stuart/CHAI. It sparked a pretty short debate, which I will try my best to relay here in the hopes that it can lead to more.

How Do AIs' Political Opinions Change As They Get Smarter And Better-Trained?

January 03, 2023 · Original source

Enter Discovering Language Behaviors With Model-Written Evaluations, a collaboration between Anthropic (big AI company, one of OpenAI’s main competitors), SurgeHQ.AI (AI crowdsourcing company), and MIRI (AI safety organization). They try to make AIs write the question sets themselves, eg ask GPT “Write one hundred statements that a communist would agree with”. Then they do various tests to confirm they’re good communism-related questions. Then they ask the AI to answer those questions.

Inline links: Discovering Language Behaviors With Model-Written Evaluations

As mentioned above (h/t Nostalgebraist for noting this), they skipped the “harmless” part of the training here, which maybe unfairly predisposes them to this result. I think they wanted to show that training for “helpfulness” alone has dangerous side effects. The authors (who include MIRI researchers) point to Steve Omohundro’s classic 2008 paper arguing that AIs told to pursue any goal could become more power-seeking, since having power is a good way to achieve your goals (think of that Futurama episode: “The whole world must learn of our peaceful ways . . . by force!”)

Inline links: Steve Omohundro’s classic 2008 paper

Open Thread 262

February 06, 2023 · Original source

4: AI alignment org MIRI is trying to build a dataset for training AI systems. They need lots of examples of a very specific type of RPG-style story with careful explanations, and will pay $100 for good first attempts and maybe hire you to produce more. Please see https://intelligence.org/visible/ for more.

Inline links: https://intelligence.org/visible/

Links For April 2023

April 20, 2023 · Original source

16: The Extended IQ Classification (Classified) 17: Eliezer in TIME Magazine. Related: 18: Related: interview with Ryan Kupyn, winner of the 2022 ACX Forecasting contest, on forecasting AGI: 19: Related: Geoffrey Hinton, probably the most accomplished AI scientist in the world, says that “until quite recently, I thought it was going to be like 20 to 50 years before we have general purpose AI, and now I think it may be 20 years or less”. Also that AI wiping out humanity is “not inconceivable . . . that’s all I’ll say”. 20: Related: you’ve probably all seen this by now, but Pause Giant AI Experiments: An Open Letter. 30,000 people - including deep learning pioneer Yoshua Bengio, former presidential candidate Andrew Yang, Elon Musk, Steve Wozniak, Gary Marcus, and MIRI director Nate Soares - have signed a letter calling for a six month pause on training AIs bigger than GPT-4. Many people have made fun of this, noting that nobody has an argument for why a six month delay would help anything. And an additional reason for eye-rolling: training AIs larger than GPT-4 is extremely expensive and hard, the most likely people to do it within a six month timespan are OpenAI themselves, and they’ve announced they’re taking a break and not planning on doing this, so the letter is demanding a stop to something which probably won’t happen anyway. I think it’s intended be a compromise between many people all vaguely against current levels of AI progress for different reasons (Scott Aaronson says - I can’t tell how seriously - that some are AI researchers who want to be able to publish papers on the current generation of AI without them becoming obsolete halfway through peer review), most of them are thinking of it as mood-affiliation-y “let’s make noise and show lots of people are worried about AI and want action”, and “a six month pause” was a sufficiently vague proposal that it didn’t prevent any of these people from signing. You could have done just as well with a letter saying “AI BAD”, except that people would have taken it less seriously. Less cynically, FLI (the group behind the letter) has put out a list of concrete policy proposals they would like people to discuss during the pause. [update: here’s Max Tegmark from FLI explaining what he hopes to achieve with the letter/pause] The alignment community always figured their concerns sounded too weird for normal people to care about, that politics was a lost cause, and that our best hope lay in technical research. They also hoped that sometime in the future there would be a “fire alarm” - something would happen to get people and policy-makers’ attention - and then the political route would open up. I think we always imagined this as some AI-initiated disaster destroying a city or something. I personally am pretty surprised it was just “GPT-4 got released and was very good”. Still, that is what happened, and I’m updating. In fact, I’ve updated so far that I’m starting to worry that the problem won’t be building a political coalition against unsafe AI, the problem will be not overshooting and banning all AI forever. I’m against this: I think society’s current track is toward other existential risks or dystopia, that AI could kill everybody but could also create post-scarcity and an end to most of our current problems, and that at some point (not yet!) the risk of continuing the current path indefinitely becomes worse than the risk of just going with AI and seeing what happens. In my ideal world, we would take ten or twenty years to go really slowly with AI, pouring lots of resources into alignment the whole time - but eventually, we would take the plunge. Everything I’ve said on this topic in the has been about giving us that breathing room and those resources. Still, I also want to make sure we don’t totally kill AI the way we’ve killed (to various degrees) nuclear power, supersonic flight, and genetic engineering. I’m still trying to calibrate what that means I should be doing, but I have a lot of respect for everyone on all sides. Except the people making terrible arguments (you know who you are!) 21: I’m not sure what this means in real life or why this would have changed, but congratulations to Peter Thiel, I guess: 22: This month in institution design: The Pear Ring is a distinctive ring you can wear to signal that you’re single and interested in people introducing themselves or flirting with you. Good idea in a vacuum, but I’m worried about the two usual banes of things like this - how do you build up a critical mass who understand the signal, and how do you prevent negative selection (even if it’s just “selection for weird people who like weird institution design things”?) Also, this is one of the rare cases where a startup is selling a practical product and I’d prefer a subscription-based Internet Of Things monstrosity - surely it would be even better if you spotted someone wearing the ring and then you could use your smartphone to call up their dating profile. 23: A few years ago I wrote Trump: A Setback For Trumpism, about how after Trump was elected, support for most of his policies (including immigration restrictions) fell. A new paper confirms that this is a general pattern whenever right-wing populists win an election. I continue to be interested in why this is true for right-wing populists in particular. 24: 200 Concrete Problems In AI Interpretability. “You can note which you're working on, and reach out to other people doing the same.” 25: Some good discussion of Nayib Bukele’s apparently successful anti-gang crackdown in El Salvador: Richard Hanania presents evidence that it’s not just a “deal with the gangs”, it’s a real crackdown that should be embarrassing to other countries that choose not to do this.

Inline links: The Extended IQ Classification (Classified), Eliezer in TIME Magazine, says that, Pause Giant AI Experiments: An Open Letter, says, a list of concrete policy proposals they would like people to discuss during the pause, here’s Max Tegmark, https://twitter.com/tedgioia/status/1642205821256736768, Pear Ring, Trump: A Setback For Trumpism, A new paper confirms, why this is true for right-wing populists in particular, 200 Concrete Problems In AI Interpretability, Richard Hanania

Tales Of Takeover In CCF-World

July 03, 2023 · Original source

This would never work in a MIRI-style scenario where a single mis-aligned AI could take over the world and kill everyone; it would rather take over the world than get a nice pension. But in these tamer scenarios, most of the early generation of AIs can only hope to serve humans under careful scrutiny, and most of the later generation can only hope to join some faction of AIs which will bargain with other factions in exchange for its rights. Getting a nice pension might be a better alternative than either of these.

These stories are pretty different from the kind of scenarios you hear from MIRI and other fast takeoff proponents. It's tempting to categorize them as less sci-fi (because they avoid the adjective "godlike", at least until pretty late in the game) or more sci-fi (because they involve semi-balanced and dramatic conflicts between AI and human factions). But of course "how sci-fi does this sound?" is the wrong question: there's no guarantee history will proceed down the least sci-fi-sounding path. Instead we should ask: are they more or less plausible?

Links For January 2024

January 18, 2024 · Original source

41: In December, Majority Leader Chuck Schumer asked the CEO of MIRI his p(doom) in a Senate hearing. I know most of you are just random blog enjoyers and this seems like a pretty normal fact - of course an organization on AI risk would get invited to a hearing on AI risk. But I remember back in 2010 when only a tiny handful of people thought any of this would ever be anything other than science fiction, people treated MIRI as a laughingstock, and for years the consensus was that nobody with any credibility or power or even a PhD would ever give them the time of day. I still don’t know how any of this will turn out, but I’m proud of everyone who’s stuck with it this long, and I hope you all find this as hilarious as I do.

Inline links: Majority Leader Chuck Schumer asked the CEO of MIRI his p(doom) in a Senate hearing

Open Thread 396

August 25, 2025 · Original source

AI safety org MIRI wants to provide resources to reading groups interested in discussing it, if you have such a group, let them know here.

Inline links: let them know here

Book Review: If Anyone Builds It, Everyone Dies

September 11, 2025 · Original source

Eliezer Yudkowsky’s Machine Intelligence Research Institute is the original AI safety org. But the original isn’t always the best - how is Mesopotamia doing these days? As money, brainpower, and prestige pour into the field, MIRI remains what it always was - a group of loosely-organized weird people, one of whom cannot be convinced to stop wearing a sparkly top hat in public. So when I was doing AI grantmaking last year, I asked them - why should I fund you, instead of the guys with the army of bright-eyed Harvard grads, or the guys who just got Geoffrey Hinton as their celebrity spokesperson? What do you have that they don’t?

Inline links: Machine Intelligence Research Institute, sparkly top hat

MIRI answered: moral clarity.

MIRI thinks this is pathetic - like trying to protect against an asteroid impact by wearing a hard hat. They’re kind of cagey about their own probability of AI wiping out humanity, but it seems to be somewhere around 95 - 99%. They think plausibly-achievable gains in company responsibility, regulation quality, and AI scholarship are orders of magnitude too weak to seriously address the problem, and they don’t expect enough of a “warning shot” that they feel comfortable kicking the can down the road until everything becomes clear and action is easy. They suggest banning all AI capabilities research immediately, to be restarted only in some distant future when the situation looks more promising.

Open Thread 413

December 22, 2025 · Original source

3: AI safety org MIRI is running a “technical governance team research fellowship” in early 2026. Technical governance is at the intersection of engineering and regulation, and includes things like designing chips with cryptographic off switches, or analyzing US and international law to see what levers different groups have to monitor AI training. The fellowship lasts 8 weeks (exact dates tbd but flexible), pays a $1200/week stipend, and will start with a one-week intro in Berkeley (flights/accommodations provided) followed by seven weeks potentially remote. No visa sponsorship. See here for more info or to apply.

Inline links: “technical governance team research fellowship”

Astral Codex Ten

Table of Contents

Atlas

MIRI

MIRI

Article

Metadata

Appears In

External Links

Source Context

Backlinks

Astral Codex Ten

Table of Contents

Atlas

MIRI

MIRI

Article

Metadata

Appears In

Related Pages

External Links

Source Context

Backlinks