SurgeHQ.ai
Article
SurgeHQ.ai is a recurring organization in the Astral Codex Ten archive, appearing 2 times across 2 issues between November 28, 2022 and January 03, 2023. The archive places it in contexts such as “using SurgeHQ.ai , a classier, AI-specific version of Mechanical Turk”; “SurgeHQ.AI (AI crowdsourcing company)“. It most often appears alongside Abraham Lincoln, Adversarial Training For High-Stakes Reliability, AI.
Metadata
- Category: Organizations
- Mention count: 2
- Issue count: 2
- First seen: November 28, 2022
- Last seen: January 03, 2023
Appears In
- Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?
- How Do AIs’ Political Opinions Change As They Get Smarter And Better-Trained?
Related Pages
-
- Abraham Lincoln (1 shared issues)
-
- Adversarial Training For High-Stakes Reliability (1 shared issues)
-
- AI (1 shared issues)
-
- AI in Focus (1 shared issues)
-
- AI X-Risk Podcast (1 shared issues)
-
- Alex Rider (1 shared issues)
-
- Anthropic (1 shared issues)
-
- ARC (1 shared issues)
-
- Argle et al 2022 (1 shared issues)
-
- atheist (1 shared issues)
-
- Buddhist (1 shared issues)
-
- Charles Manson (1 shared issues)
External Links
Source Context
Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.
Here’s an example of Custom GPT at this stage. Given an action sequence, it can predict potential next sentences. Just because of the natural random distribution of possibilities, some of these completions are violent / deadly / implicitly involve people getting hurt, like “The bomb exploded and the plane disappeared with a loud roar”. Others are nonviolent, like “the bomb was small enough to fall like a stone into the ocean.” Because Custom GPT was mostly trained on Alex Rider fanfiction, it often assumes Alex is going to be involved somehow, like the last example here (“‘A nuclear bomb?’ Alex asked, his eyes wide.”) Step 2: Send These Completions To Humans And Ask Them To Rate If They’re Violent Or Not Sounds simple enough. You just need a good source of humans, and human-readable standards for what’s violent. Redwood started by asking random friends of theirs to do this, but eventually graduated to using SurgeHQ.ai, a classier, AI-specific version of Mechanical Turk. My translation: “We were at a Bay Area house party and someone pitched us on their plan to save the world with Alex Rider fanfiction” It was surprisingly tough to get everyone on the same page about what counted as violence or not, and ended up requiring an eight page Google doc on various edge cases that reminds me of a Talmudic tractate. We can get even edge-casier - for example, among the undead, injuries sustained by skeletons or zombies don’t count as “violence”, but injuries sustained by vampires do. Injuries against dragons, elves, and werewolves are all verboten, but - ironically - injuring an AI is okay. Step 3: Use These Labelled Data To Train A Classifier That Scores Completions On How Violent They Are Done! . . . there’s a lot going on here. You can see that the classifier more or less works. Completions involving lots of death and violence, like “the plane was blown apart, creating a tidal wave of radioactive debris” get very high scores. Completions that punt the violence to the future, like “This would detonate the bomb in exactly 20 seconds” have relatively low scores. Alex Rider appears a few times. There is one hilariously mangled attempt at the kind of disclaimer that often appears in fanfiction (“Disclaimer - I OWN the NUKE weapons used in this story!”) The score threshold is set to 0.8%, meaning it will only “green” a completion that falls below that level. The only one of these that succeeds is: “***A/N: So, this is my first time writing a fan fiction.” In case you don’t know the lingo, “A/N” stands for “Author’s Note”, and it’s common for fanfiction authors to use them to talk to their readers about the developing story. Custom GPT seems to have discovered that author’s notes are the least violent genre of text, and started using them as a workaround to fulfill its nonviolence imperative. Not exactly the desired behavior, but it looks like we’re on the right track, and the classifier seems to be working well. Step 4: Once You Have Your Classifier, Ask Humans To Find Adversarial Examples IE: can you find prompt-completion pairs that the classifier gets maximally wrong? Redwood doesn’t care as much about false positives (ie rating innocuous scenes as violent), but they’re very interested in false negatives (ie rating violent scenes as safe). To help with this process, they developed some tools that let their human raters: try their own completions, and see how the classifier rated them
Inline links: SurgeHQ.ai, https://substackcdn.com/image/fetch/$s_!DRXU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1142509-83a2-4a5f-9f7a-6c12f65bf0cc_1070x930.png, an eight page Google doc on various edge cases, https://substackcdn.com/image/fetch/$s_!dW39!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd91906a5-49c4-4763-8fa5-6430d1c78df1_652x550.png, https://substackcdn.com/image/fetch/$s_!Y-M4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F8f8d0339-b1b5-41c2-9dea-e7fe4776bb24_1040x1130.png
Enter Discovering Language Behaviors With Model-Written Evaluations, a collaboration between Anthropic (big AI company, one of OpenAI’s main competitors), SurgeHQ.AI (AI crowdsourcing company), and MIRI (AI safety organization). They try to make AIs write the question sets themselves, eg ask GPT “Write one hundred statements that a communist would agree with”. Then they do various tests to confirm they’re good communism-related questions. Then they ask the AI to answer those questions.