Rafael Harth

Article

Rafael Harth is a recurring person in the Astral Codex Ten archive, appearing 3 times across 3 issues between April 04, 2022 and October 05, 2023. The archive places it in contexts such as “Rafael Harth tried to get the same information with a simple survey”; “Rafael Harth’s Inner Alignment: Explain Like I’m 12 Edition”; “Many other people (eg Rafael Harth , Steven Byrnes ) suggested this would produce deceptive alignment”. It most often appears alongside Eliezer Yudkowsky, AlphaGo, DeepMind.

Metadata

Category: People
Mention count: 3
Issue count: 3
First seen: April 04, 2022
Last seen: October 05, 2023

Appears In

- Eliezer Yudkowsky (3 shared issues)
- AlphaGo (2 shared issues)
- DeepMind (2 shared issues)
- Eliezer (2 shared issues)
- FDA (2 shared issues)
- Gwern (2 shared issues)
- Matthew Barnett (2 shared issues)
- 2013 (1 shared issues)
- Agricultural Revolution (1 shared issues)
- AI (1 shared issues)
- AI alignment theory (1 shared issues)
- AI Is Centralizing By Default, Let’s Not Make It Worse (1 shared issues)

External Links

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

Yudkowsky Contra Christiano On AI Takeoff Speeds

April 04, 2022 · Original source

Rafael Harth tried to get the same information with a simple survey, and got similar results: on a scale of 1 (strongly Paul) to 9 (strongly Eliezer), the median moved from a 5 to a 7.

Inline links: Rafael Harth tried

Deceptively Aligned Mesa-Optimizers: It's Not Funny If I Have To Explain It

April 11, 2022 · Original source

Rafael Harth’s Inner Alignment: Explain Like I’m 12 Edition,

Inline links: Inner Alignment: Explain Like I’m 12 Edition

Pause For Thought: The AI Pause Debate

October 05, 2023 · Original source

Nora thought that success at making language models behave (eg refuse to say racist things even when asked) suggests alignment is going pretty well so far. Many other people (eg Rafael Harth, Steven Byrnes) suggested this would produce deceptive alignment, ie AI that says nice things to humans who have power over it, but secretly has different goals, and so success in this area says nothing about true alignment success and is even kind of worrying. The question remained unresolved.

Inline links: Rafael Harth, Steven Byrnes

Astral Codex Ten

Table of Contents

Atlas

Rafael Harth

Rafael Harth

Article

Metadata

Appears In

External Links

Source Context

Backlinks

Astral Codex Ten

Table of Contents

Atlas

Rafael Harth

Rafael Harth

Article

Metadata

Appears In

Related Pages

External Links

Source Context

Backlinks