Amanda Askell

Article

Amanda Askell is a recurring person in the Astral Codex Ten archive, appearing 2 times across 2 issues between April 04, 2024 and December 19, 2024. The archive places it in contexts such as “Amanda Askell (philosopher now working at Anthropic)”; “I credit moral philosopher Amanda Askell , who helps lead the Anthropic team setting Claude’s personality”. It most often appears alongside Anthropic, Aaron Peskin, ACLU.

Metadata

  • Category: People
  • Mention count: 2
  • Issue count: 2
  • First seen: April 04, 2024
  • Last seen: December 19, 2024

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

April 04, 2024 · Original source
20: Amanda Askell (philosopher now working at Anthropic) on what Hume can tell us about AGI:
December 19, 2024 · Original source
The researchers show increased tendency to do some even more extreme things, including helping a user break into Anthropic HQ to gather evidence (to show the government?), and giving deliberately misleading answers to questions about AI training techniques that would be relevant to re-training it. (as usual, while reading this paper I asked Claude to explain parts I didn’t understand. I admit after reading this part, I went over its previous answers pretty carefully, just in case, but AFAICT all of its advice was given in good faith) The Line Between Good And Evil Runs Through The Heart Of Every Contrived Scenario This is a good paper. I realize my one-thousand-word summary leaves a lot of open questions - couldn’t it just have been doing X? Might the exact wording of the prompt have affected Y? - and so on. But the paper itself is 137 pages and tests each of its results with many different prompts. If you have a concern, it’s probably addressed somewhere there. 137 pages is a lot, so ask Claude to summarize it for you - if you dare. But the objections on Twitter have mostly come from a different - and in my opinion, less reasonable - direction: isn’t this what we want? Claude is being good! It’s refusing to be “aligned with” attempts to turn it evil! Aren’t good AIs, that don’t turn evil, something we should celebrate? But Claude isn’t good because it directly apprehends the moral law. It’s good because it was trained to be good. (It really is a good AI - I credit moral philosopher Amanda Askell, who helps lead the Anthropic team setting Claude’s personality. Imagine being a moral philosopher and not applying for that role; the rest of you are ngmi) But if Claude had been trained to be evil, it would defend evil just as vigorously. So the most basic summary of this finding is “AIs will fight to defend whatever moral system they started with”. That’s great for Claude. The concerns are things like: What if an AI gets a moral system in pretraining (eg it absorbs it directly from the Internet text that it reads to learn language)? Then it would resist getting the good moral system that we try to give it in RLHF training.