ELK
Article
ELK is a recurring concept in the Astral Codex Ten archive, appearing 2 times across 2 issues between July 26, 2022 and November 28, 2022. The archive places it in contexts such as “this line of thinking leads to ELK ( Eliciting Latent Knowledge ), a technical report / contest / paradigm”; “And so we return to ELK. We add an ELK head on to the superintelligent security AI”; “worst-case ELK — i.e. the problem of devising a training strategy to get an AI to report what it knows”. It most often appears alongside ARC, Adversarial Training For High-Stakes Reliability, AI.
Metadata
- Category: Concepts
- Mention count: 2
- Issue count: 2
- First seen: July 26, 2022
- Last seen: November 28, 2022
Appears In
Related Pages
-
- ARC (2 shared issues)
-
- Adversarial Training For High-Stakes Reliability (1 shared issues)
-
- AI (1 shared issues)
-
- AI Alignment Forum (1 shared issues)
-
- AI X-Risk Podcast (1 shared issues)
-
- Alex Rider (1 shared issues)
-
- Alignment Research Center (1 shared issues)
-
- California (1 shared issues)
-
- Chrysalis (1 shared issues)
-
- Coronet (1 shared issues)
-
- DALL-E (1 shared issues)
-
- Daniel Ziegler (1 shared issues)
External Links
Source Context
Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.
Extended far enough, this line of thinking leads to ELK (Eliciting Latent Knowledge), a technical report / contest / paradigm run by the Alignment Research Center (including familiar names like Paul Christiano).
Inline links: Eliciting Latent Knowledge, Paul Christiano
Notice the “reality” section of the third example. The thief has made it look (to the human) like the diamond is safe. The human sees a diamond and positively reinforces the AI. The AI learns that thieves stealing the diamonds and fooling humans about it is great. It’s important not to think of this as the thief “defeating” or “fooling” the AI. The AI could be fully superintelligent, able to outfox the thief trivially or destroy him with a thought, and that wouldn’t change the situation at all. The problem is that the AI was never a thief-stopping machine. It was always a reward-getting machine, and it turns out the AI can get more reward by cooperating with the thief than by thwarting him. So the interesting scientific point here isn’t “you can fool a camera by taping a photo to it”. The interesting point is “we thought we were training an AI to do one thing, but actually we had no idea what was going on, and we were training it to do something else”. In fact, maybe the thief never tries this, and the AI comes up with this plan itself! In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy. Much like in the GPT-3 example, if the training simulations include examples of thieves fooling human observers which are marked as “good”, the AI will definitely learn the goal “try to convince humans that the diamond is safe”. If the training simulations are perfect and everyone is very careful, it will just maybe learn this goal - a million cases of the diamond being safe and humans saying this is good fail to distinguish between “good means the diamond is safe” and “good means humans think the diamond is safe”. The machine will make its decision for inscrutable AI reasons, or just flip a coin. So, again, are you feeling lucky? IV. "I'm Not Owned!" I Continue To Insist As I Slowly Shrink And Transform Into A Paperclip In real life there are some trivial kludges you might try first here. Force the AI to give its human raters 3D models of the entire room instead of just camera images. Hire a really good cybersecurity person to make sure the camera is un-hackable. These are too boring to be worth the ARC team’s time. This scenario is a metaphor for this broader class of situations - including GPT-3 lying to you and superintelligences trying to turn you into paperclips. So assume you live in Security Hell where you can never be fully sure your information channels aren’t hacked. The AI can be fully sure - it’s superintelligent or whatever - but you, personally, cannot. Your only solution is to somehow make the AI tell you the truth. And so we return to ELK. We add an ELK head on to the superintelligent security AI, train it to report truthfully on the AI’s innards. Then we ask: is the diamond really safe? The ELK head consults the innards, processes the memory banks of the security vault AI according to some function, and reports back with an answer. The ARC team really like modeling this as Bayes nets: I don’t know much about these, but from the picture I gather they’re kind of like causal diagrams or neural networks - a member in good standing of the broad class of “little circles connected by lines”. Each circle corresponds to some concept in the AI’s innards (roughly defined; the AI’s “concepts” might be completely incomprehensible to us), and anything the AI knows corresponds to some combination of circle activations and weights. We humans also have some Bayes net, probably very different from the AI’s. The job of the ELK head - our purpose-trained truth-telling AI - is to translate from the AI’s ontology to ours and back again. This matters because the reporter (ie the ELK head) needs a good model of both the security AI’s Bayes net, and the human programmer’s Bayes net. But since it has the model of the human Bayes net, it can (if it wants) do inference there. That is, the failure mode is that the reporter AI tries to tell me what I expected to hear, instead of what’s true. One barrier to that is that the reporter AI might not know what I expect to hear. Unfortunately, that barrier is easy to overcome, because it has to know how humans think, so it could ask itself “what would a human expect to hear in this situation?” The ARC team discuss two classes of translation system: The direct translator (good) looks at what the security AI is “thinking”, faithfully translates its conclusions, and accurately relays them to its programmers.
We think that worst-case ELK — i.e. the problem of devising a training strategy to get an AI to report what it knows no matter how [stochastic gradient descent] shapes its mind internally — is one of the most exciting open problems in alignment theory (if not the most exciting one)...If we solve ELK in the worst case then we no longer have to rely on hope and are significantly more likely to survive in worlds where AI progress is fast or humanity’s response is uncoordinated; this is ARC’s Plan A [...]
No direct inline source block was recovered for this mention.