AI Alignment

Article

AI Alignment is a recurring concept in the Astral Codex Ten archive, appearing 10 times across 10 issues between December 30, 2021 and July 21, 2025. The archive places it in contexts such as ""I’m an independent researcher working on AI alignment and the theory of agency.""; “urgent opportunities in AI alignment, biosecurity”; “to highlight the depth of the AI alignment people’s involvement here”. It most often appears alongside ACX, Aella, AI.

Metadata

  • Category: Concepts
  • Mention count: 10
  • Issue count: 10
  • First seen: December 30, 2021
  • Last seen: July 21, 2025

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

December 30, 2021 · Original source
4: John Wentworth on How To Get Into Independent Research On AI Alignment. "I’m an independent researcher working on AI alignment and the theory of agency. I’m 29 years old, will make about $90k this year, and set my own research agenda. I deal with basically zero academic bullshit...best of all, I work on some really cool technical problems which I expect are central to the future of humanity. If your reaction to that is 'Where can I sign up?', then this post is for you."
July 18, 2022 · Original source
4: Thanks to everyone who applied last week to Spencer Greenberg’s grants round. It is closing soon, and I won’t be doing another ACX Grants for at least a few months, but if any of you want to pursue urgent opportunities in AI alignment, biosecurity, or similar fields before then, and find that some 4-5 digit amount of money would help, please send me an email at scott@slatestarcodex.com and I will try to connect you to relevant funders.
August 08, 2022 · Original source
Why this history lesson? Partly to highlight the depth of the AI alignment people’s involvement here. It’s not just that they’re not fighting AI companies, it’s that they keep creating them and leading investment in them. But also…
Might we be able to strike an agreement with China on AI, much as countries have previously made arms control or climate change agreements? This is . . . not technically prevented by the laws of physics, but it sounds really hard. When I bring this challenge up with AI policy people, they ask “Harder than the technical AI alignment problem?” Okay, fine, you win this one.
August 23, 2022 · Original source
AI alignment is a central example of a supposedly long-termist cause.
December 28, 2022 · Original source
39: Paul Christiano - AI Alignment Is Distinct From Its Near-Term Implications. Paul is one of the giants in this field, and is pleading to people not to throw it out just because they don’t like how it’s currently being used (to prevent ChatGPT from saying politically incorrect things):
If we succeed at the technical problem of AI alignment, AI developers would have the ability to decide whether their systems generate sexual content or opine on current political events, and different developers can make different choices. Customers would be free to use whatever AI they want, and regulators and legislators would make decisions about how to restrict AI. In my personal capacity, I have views on what uses of AI are more or less beneficial and what regulations make more or less sense, but in my capacity as an alignment researcher I don’t consider myself to be in the business of pushing for or against any of those decisions.
July 21, 2023 · Original source
“Residential real estate has historically returned significantly below equity markets over long time horizons” But I’m not so sure that these lessons are directly applicable to other areas of life. Some of the best things in life come from lashing yourself to the mast, burning the boats behind you, willingly giving up liquidity. The deepest monogamous relationships are built from an irrational investment in one other person, saying “In sickness and in health, until death do us part.” How many scientific problems were solved because one person had an irrational willingness to: Just. Keep. Going. Sometimes it’s powerful to use the sunk cost fallacy to your advantage. Investing in relationships, subject matter expertise, even putting down roots via *gulp* homeownership reduces your liquidity, but also leads to some of the best (if intangible) things in life. 5: Edge If you can’t explain your edge in five minutes, you don’t have a very good one. OR The long-term profitability of an edge is inversely proportional to how long it takes to explain it. The Efficient Market Hypothesis is one of the core concepts taught in Finance 101. The Efficient Market Hypothesis is a lie. The person that better understands the nature of a small sliver of the world (e.g. Apple’s share price) will make more money than others. Modern financial markets are exceedingly competitive. This means that the bigger you think your edge is, the more likely it is that you’re wrong. “Evolutionary thinking applies quite directly when thinking about the evolution of markets. Having an edge in a mature market means understanding the world better than other traders, even ones who are already highly skilled. In fact, the marginal trader in modern financial markets is quite sophisticated and skilled indeed.” Lebron here warns us of getting too cute with data, of changing variables. Enough randomness will produce an “edge” that is likely to break down the second a trading strategy hits the real world. You can always find a statistical correlation if you change enough variables. But this is fundamentally the same problem facing the replication crisis in social sciences. Lebron argues that we need stories here. Edge is expressed in stories: an edge does not exist without a clear mental representation of that edge. Pure linear algebra does not suffice. I’m not so sure. It seems like AI companies are pushing forward technology in a way that suggests that mental representations are not the only path to intelligence. Lebron discounts “black box” trading strategies without much discussion of their potential merits. Are all of RenTech’s models explainable by a story? The firm is notoriously secretive, so I don’t know, but I’d guess not. “Frequently a good trade appears, has a seemingly insurmountable difficulty, and it is mere persistence that knocks down the final barrier. There may have been many others who looked at the idea, wanted to do it, but couldn’t get past that last hurdle.” Before Sam Bankman-Fried was the face of Why Effective Altruism is Bad, before he even founded FTX, he made money arbitraging the difference between Bitcoin prices on Japanese and American exchanges. I’m reminded of that trade here. It isn’t a particularly elegant trade, it doesn’t require deep technical knowledge or any models. It was a schlep. It was all operational work: figuring out how to open a Japanese bank account, transferring money between the US and Japan, standing in line for hours every day at both US and Japanese banks (presumably this wasn’t the same person). In as technical a field as trading, sheer willpower is often what gets things done in the end. 6: Models The model expresses the edge. Lebron drills into us that a model is the tool for expressing an edge. The model is not the edge. The model does not give us unique knowledge about the world. The map is not the territory. He dives into the difference between generative (G) and phenomenological (P) models. G models express a worldview and fit data into that way of thinking, whereas P models solely look at the empirical data to build a worldview. Models of the world differ from models of markets, though. Markets have quick feedback loops, are explicit in terms of what they measure, and are easy to quantify at a specific point in time. Most of our models for the world, though, are ill-defined and explicit. Models are only as good as our assumptions. As an aside, this is a common criticism of rationality or Effective Altruism – you can justify any worldview if you assign your model input weights in just the right way5. I also tend to think that “traditional” EA is overly dependent on P models, and doesn’t embrace the G models that led to economic reforms in India in the 1990s or the economic policies that led to rapid economic development in Southeast Asia in the second half of the 20th Century. Interestingly, I think a lot of longtermist EA, specifically AI alignment, leans the other way, relying on G models which explicitly assume a certain P(doom) and work backwards from there. (Though I won’t pretend to be an expert here or to understand everything, so take this with a grain of salt.) Overall, startups and tech seem to take heed to Lebron’s lesson much better than the folks hanging out on this part of the internet: “Even if a model makes good predictions about some future value or event, that knowledge is useless without also knowing how to take advantage of that prediction.” Now we get a bit philosophical. By acting, you change the nature of the market. Your model predicts things that might not be true as soon as you start trading (and changing the environment) based on it. When you’re right, everyone else sees the same trades that your model does and will beat you to them. When your model is wrong, others don’t act, meaning adverse selection rears its ugly head once again. So your model shows you with an edge, but in practice you only make the trades where you don’t have an edge. Lebron closes by arguing that G models are best for understanding other people, and are good in and of themselves: “You can also see connections to traditional moral philosophy in thinking about modeling the behavior of others. To have a good G model about someone else is to have some measure of empathy and compassion for that person: what they’re like, what they think and feel, putting yourself in their shoes. Pragmatically, developing the skill of empathy and compassion for others is, aside from a moral good in itself, an excellent way to understand better the people who surround you. More people working to develop good G models of others is surely a small step to a better world.” 7: Costs and Capacity If you think your costs are negligible relative to your edge, you’re wrong about at least one of them. This section of the book displayed a good amount of epistemic humility, words that I didn’t expect to be typing in the context of a book about trading. Lebron tells us that trades don’t exist independently in the universe — in the n-dimensional space of all possible trades seeking to optimize profitability, if you have a gigantic mountain of profitability, someone else has probably at least discovered the base. So you probably don’t have a profitable trade; rather, you are misunderstanding something about your trade. You’ve either overestimated profitability or underestimated cost. Lebron highlights four types of trading costs: [graph that didn’t show up correctly here: two axes and four quadrants, with the axes being visible ←→ invisible costs and linear ←→ nonlinear costs] Here, we’ll focus on Quadrant 4, where he highlights a few interesting phenomena. Herding. It’s likely that if you have a profitable trading strategy, either: Other firms discovered a similar strategy independently and/or
August 30, 2023 · Original source
I think a really important implication of this is that, contra a fundamental plank in AI alignment risk arguments, it's not the case that we should expect greater intelligence to mean greater coherence.
On the other hand, "rewarding the AIs for doing certain behaviors" is a terrible idea, from an AI alignment perspective. We want to reward the AI for doing the right thing FOR THE RIGHT REASON, not just for doing the right thing full stop. Otherwise we reward the AI for being deceptive.
My own main technical AI alignment research interest is to figure out the nuts and bolts of how the genome makes people (sometimes) nice to each other — https://www.alignmentforum.org/posts/qusBXzCpxijTudvBB/my-agi-safety-research-2022-review-23-plans#2__Second_half_of_2022__1_3___My_main_research_project .
August 19, 2024 · Original source
1: I’m doing some AI safety grantmaking and am curious how other people value different parts of the ecosystem. If you have experience/familiarity with AI grantmaking, AI alignment, or AI policy, can you take this quick (~15 minute) survey?
July 09, 2025 · Original source
Byrnes also has a wide range of writing on other areas of neuroscience and on AI alignment.
July 21, 2025 · Original source
“Concept vectors in AI alignment! Did you know you can just prompt an AI to think about ‘misaligned behavior’ a bunch of different ways, and see which weights get activated consistently? Then you know where in the neural net it represents the concept of ‘misalignment’, and you can monitor those particular weights to see when the AI is plotting against you.”