Why can't we just turn the AI off if it starts to misbehave?

2 min read

We can shut down weaker systems, and this is a useful guardrail against certain types of problems caused by narrow AI. However, many people expect that we won't be able to turn an AGI off unless it was corrigible (i.e., willing to let humans adjust it), and that AGI won't be corrigible by default There may be a period in the early stages of an AGI's development where it would try to convince us not to shut it down while hiding itself and/or recursively self-improving and/or making copies of itself onto every server on Earth.

One key reason it wouldn't be simple to shut down a non-corrigible system is that self-preservation is an instrumentally convergent goal. An AI with basically any goal — say, bringing you coffee — will have an instrumental reason to avoid being turned off because you can’t fetch the coffee if you are dead. A sufficiently-intelligent AI could do this by taking control of systems put in place to control it or by deceiving us about its true intentions.

Aren't there easy solutions to AI alignment?

What is corrigibility?