ARC

Article

ARC is a recurring organization in the Astral Codex Ten archive, appearing 4 times across 4 issues between June 17, 2022 and September 19, 2025. The archive places it in contexts such as “ARC is an acronym for Affordable, Robust, Compact and is an Iron Man reference”; “ARC notes that neural nets can have multiple ‘heads’ on the same ‘body’”; “a list of ARC’s attempts to solve the problem so far”. It most often appears alongside Astralcodexten Com, CERN, ELK.

Metadata

  • Category: Organizations
  • Mention count: 4
  • Issue count: 4
  • First seen: June 17, 2022
  • Last seen: September 19, 2025

Appears In

Source Context

Recovered passages from the original issue text. When the raw archive preserved outbound links inside the source passage, they are listed directly under the quote.

June 17, 2022 · Original source
[17]: SPARC is a nested acronym for Smallest Possible ARC. ARC is an acronym for Affordable, Robust, Compact and is an Iron Man reference.
July 26, 2022 · Original source
Extended far enough, this line of thinking leads to ELK (Eliciting Latent Knowledge), a technical report / contest / paradigm run by the Alignment Research Center (including familiar names like Paul Christiano).
DALL-E: “A beast with seven heads and ten horns, and upon his horns ten crowns, and upon his heads the name of blasphemy.” Probably just a coincidence. III. Ipso Facto, Ergo ELK The ELK Technical Report And Contest are a list of ARC’s attempts to solve the problem so far, and a call for further solutions. It starts with a toy problem: a superintelligent security AI guarding a diamond. Every so often, thieves come in and try to steal the diamond, the AI manipulates some incomprehensible set of sensors and levers and doodads and traps, and the theft either succeeds or fails. Everyone agrees that trying to understand ELK is terrible, so please accept these delightful illustrations by María Gutiérrez-Rojas as compensation. We train the AI by running millions of simulations where it plays against simulated thieves. At first it flails randomly. But as time goes on, it moves towards strategies that make it win more often, learning more and more about how to deploy its doodads and traps most effectively. As it approaches superintelligence, it even starts extruding new traps and doodads we didn’t design, things we have no idea what they even do. Things get spooky. A thief comes in, gets to the diamond, then just seems to vanish. Another ELK report illustration. In the top part, we easily understand what’s happening - the AI is activating a trap door, plunging the thief into a spike pit. In the bottom part, we’re not sure. The AI does something incomprehensible, and all we know is that the thief is gone and the diamond is intact This is good - we wanted a superintelligent security AI, and we got one. But we can no longer evaluate its reasoning. All we can do is judge its results: is the diamond still there at the end of the simulation? If we see the diamond, we press the REWARD lever; if it’s gone, we press the PUNISHMENT lever. The training process. The AI does some incomprehensible thing. We check whether the diamond is safe or not. Then we rate it as good or bad. The AI gradient descends away from bad strategies, towards good ones. Eventually we’ve trained the AI very well and it has an apparent 100% success rate. What could go wrong? If we’re very paranoid, we might notice that the task at which we have a 100% success rate is causing the AI to get good ratings. How does the AI get good ratings? By making us think the diamond is safe. Hopefully this is correlated with the diamond actually being safe. But we haven’t proven this, have we? Suppose the simulated thief has hit upon the strategy of taping a photo of the diamond to the front of the camera lens. At the end of the training session, the simulated thief escapes with the diamond. The human observer sees the camera image of the safe diamond and gives the strategy a “good” rating. The AI gradient descends in the direction of helping thieves tape photos to cameras. Notice the “reality” section of the third example. The thief has made it look (to the human) like the diamond is safe. The human sees a diamond and positively reinforces the AI. The AI learns that thieves stealing the diamonds and fooling humans about it is great. It’s important not to think of this as the thief “defeating” or “fooling” the AI. The AI could be fully superintelligent, able to outfox the thief trivially or destroy him with a thought, and that wouldn’t change the situation at all. The problem is that the AI was never a thief-stopping machine. It was always a reward-getting machine, and it turns out the AI can get more reward by cooperating with the thief than by thwarting him. So the interesting scientific point here isn’t “you can fool a camera by taping a photo to it”. The interesting point is “we thought we were training an AI to do one thing, but actually we had no idea what was going on, and we were training it to do something else”. In fact, maybe the thief never tries this, and the AI comes up with this plan itself! In the process of randomly manipulating traps and doodads, it might hit on the policy of manipulating the images it sends through the camera. If it manipulates the image to look like the diamond is still there (even when it isn’t), that will always get good feedback, and the AI will be incentivized to double down on that strategy. Much like in the GPT-3 example, if the training simulations include examples of thieves fooling human observers which are marked as “good”, the AI will definitely learn the goal “try to convince humans that the diamond is safe”. If the training simulations are perfect and everyone is very careful, it will just maybe learn this goal - a million cases of the diamond being safe and humans saying this is good fail to distinguish between “good means the diamond is safe” and “good means humans think the diamond is safe”. The machine will make its decision for inscrutable AI reasons, or just flip a coin. So, again, are you feeling lucky? IV. "I'm Not Owned!" I Continue To Insist As I Slowly Shrink And Transform Into A Paperclip In real life there are some trivial kludges you might try first here. Force the AI to give its human raters 3D models of the entire room instead of just camera images. Hire a really good cybersecurity person to make sure the camera is un-hackable. These are too boring to be worth the ARC team’s time. This scenario is a metaphor for this broader class of situations - including GPT-3 lying to you and superintelligences trying to turn you into paperclips. So assume you live in Security Hell where you can never be fully sure your information channels aren’t hacked. The AI can be fully sure - it’s superintelligent or whatever - but you, personally, cannot. Your only solution is to somehow make the AI tell you the truth. And so we return to ELK. We add an ELK head on to the superintelligent security AI, train it to report truthfully on the AI’s innards. Then we ask: is the diamond really safe? The ELK head consults the innards, processes the memory banks of the security vault AI according to some function, and reports back with an answer. The ARC team really like modeling this as Bayes nets: I don’t know much about these, but from the picture I gather they’re kind of like causal diagrams or neural networks - a member in good standing of the broad class of “little circles connected by lines”. Each circle corresponds to some concept in the AI’s innards (roughly defined; the AI’s “concepts” might be completely incomprehensible to us), and anything the AI knows corresponds to some combination of circle activations and weights. We humans also have some Bayes net, probably very different from the AI’s. The job of the ELK head - our purpose-trained truth-telling AI - is to translate from the AI’s ontology to ours and back again. This matters because the reporter (ie the ELK head) needs a good model of both the security AI’s Bayes net, and the human programmer’s Bayes net. But since it has the model of the human Bayes net, it can (if it wants) do inference there. That is, the failure mode is that the reporter AI tries to tell me what I expected to hear, instead of what’s true. One barrier to that is that the reporter AI might not know what I expect to hear. Unfortunately, that barrier is easy to overcome, because it has to know how humans think, so it could ask itself “what would a human expect to hear in this situation?” The ARC team discuss two classes of translation system: The direct translator (good) looks at what the security AI is “thinking”, faithfully translates its conclusions, and accurately relays them to its programmers.
November 28, 2022 · Original source
No direct inline source block was recovered for this mention.
September 19, 2025 · Original source
After his Framework was published in 1962, under the Stanford Research Institute, Engelbart founded the Augmentation Research Center to make, in essence, some version of the Memex a reality. The ARC received funding from NASA and ARPA, and after six years, Engelbart released his oN-Line System (NLS). It was a revelation.