Rafael Harth
Article
Rafael Harth is a recurring person in the Astral Codex Ten archive, appearing 3 times across 3 issues between April 04, 2022 and October 05, 2023. The archive places it in contexts such as “Rafael Harth tried to get the same information with a simple survey”; “Rafael Harth’s Inner Alignment: Explain Like I’m 12 Edition”; “Many other people (eg Rafael Harth , Steven Byrnes ) suggested this would produce deceptive alignment”. It most often appears alongside Eliezer Yudkowsky, AlphaGo, DeepMind.
Metadata
- Category: People
- Mention count: 3
- Issue count: 3
- First seen: April 04, 2022
- Last seen: October 05, 2023
Appears In
- Yudkowsky Contra Christiano On AI Takeoff Speeds
- Deceptively Aligned Mesa-Optimizers: It’s Not Funny If I Have To Explain It
- Pause For Thought: The AI Pause Debate
Related Pages
-
- Eliezer Yudkowsky (3 shared issues)
-
- AlphaGo (2 shared issues)
-
- DeepMind (2 shared issues)
-
- Eliezer (2 shared issues)
-
- FDA (2 shared issues)
-
- Gwern (2 shared issues)
-
- Matthew Barnett (2 shared issues)
-
- 2013 (1 shared issues)
-
- Agricultural Revolution (1 shared issues)
-
- AI (1 shared issues)
-
- AI alignment theory (1 shared issues)
-
- AI Is Centralizing By Default, Let’s Not Make It Worse (1 shared issues)
External Links
Source Context
Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.
Rafael Harth tried to get the same information with a simple survey, and got similar results: on a scale of 1 (strongly Paul) to 9 (strongly Eliezer), the median moved from a 5 to a 7.
Inline links: Rafael Harth tried
Rafael Harth’s Inner Alignment: Explain Like I’m 12 Edition,
Inline links: Inner Alignment: Explain Like I’m 12 Edition
Nora thought that success at making language models behave (eg refuse to say racist things even when asked) suggests alignment is going pretty well so far. Many other people (eg Rafael Harth, Steven Byrnes) suggested this would produce deceptive alignment, ie AI that says nice things to humans who have power over it, but secretly has different goals, and so success in this area says nothing about true alignment success and is even kind of worrying. The question remained unresolved.
Inline links: Rafael Harth, Steven Byrnes