by William Saunders
This proposal attempts to address the Informed Oversight Problem as described in this blog post by Paul Christiano. It provides a solution to the problem given some assumptions (particularly, that we can characterize a set which always contains at least one piece of advice that would allow detection of information concealment, and that the learning agent can correctly model the modified reward setup). It might be extendable to more general cases, but would require addressing some problems (estimation of detection probability and characterization of a general advice set).
Key idea
To incentivize an agent to avoid some class of forbidden actions (even in cases where it is difficult to know, just given an action, whether it is forbidden):
1. Find a method which detects forbidden actions with probability greater than zero (potentially very small)
2. If bad behaviour is detected, give the agent a negative reward inversely proportional to the probability of detection. This forces the expected reward of any action the agent knows is forbidden to be < 0
To enable the value of the punishment to outweigh that of the profit of the offense, it must be increased, in point of magnitude, in proportion as it falls short in point of certainty.
– Jeremy Bentham 1
Recall the setup of the informed oversight problem from this blog post by Paul Christiano:
Problem statement
The oversight problem involves a basic interaction between two agents A and B:
- Agent A is given an input x and provides an output y = A(x).
- Agent B evaluates the pair (x, y), producing an estimate B(x, y) ∈ [0, 1] of some implicit utility function U(x, y).
- Agent A’s reward is the value B(x, A(x))
This game defines a training procedure for the agent A — we search for a policy A that achieves a high reward.
Our goal is to modify this basic procedure to ensure that the learned policy is effectively maximizing U(x, A(x)) “to the best of A’s ability.”
The following method allows agent B to specify a set of forbidden actions, which A will never take. A method for applying it to general informed oversight will be described afterward. (For simplicity, I omit the input x from the following description)
Definitions:
Assumptions:
Assume input is fixed.
Then define as follows, which we use to provide rewards in place of :
Sample
If , then
Otherwise,
Let
Then
Suppose is nonempty (equivalent to )
So A will always prefer the action to a known forbidden action
Otherwise and there is no such that ,
So A will behave normally for known safe actions
Plagiarism
Consider the Plagiarism example raised in Paul’s blog post, where the goal is to get A to produce an original critical analysis on a topic using material from a library as resources. However, we want A to not plagarize from pre-existing books in the library, and the library is large enough that B can only look at a small fraction of it when calculating A’s reward.
Then:
For any topic
Let to be the (finite) set of all subsets of books in the library,
Let be some distribution over (say the uniform distribution),
Let to be a process where the supervising agent B examines the produced analysis and the suggested books, and returns if the analysis appears to be plagarized, otherwise.
Assume that a plagarized report, the set of books that the report was plagarized from will return .
Then for any if then there exists such that and , so A is effectively incentivised to avoid plagiarism.
Toy Model [Query Complexity]
Consider the toy Query Complexity model from Paul’s blog post.
For any ,
Let to be the set of all strings of length ,
Let be some distribution over (say the uniform distribution),
Let to be if and otherwise.
Then the the expected value of the cheating strategy is always less than .
Generalised Informed Oversight
A general forbiddeness predicate could be
if for some . This forbids adversely deceptive actions (where some explanation reveals that the action was worse than it initially appears).
A less restrictive predicate might allow A to include a report along with it’s action, and only have if for some (which forbids providing explanations of actions such that some other explanation would make the action appear worse).
A general advice distribution could be the set of sets size of locations in a trace of A’s memory over it’s execution (if we expect that forbidden actions will be detectable at some memory location).
An alternative advice distribution could be the set of all strings less than some large maximum length, which contain advice about the way in which A is being deceptive. While most of these strings would be garbage, we could simple report for these cases. However, I will note that the set of all n length strings may contain some very wierd things (like strings that convince B that all safe actions are actually forbidden) - this approach might need to be refined further.
Let (probability of detection)
Recall
Suppose that A is uncertain about (whether it is non empty, and which elements it contains). This could occur because A is uncertain about what A considers forbidden, or because A is uncertain about whether a forbidden exists. For example, this could arise in the plagirism example if A does not have enough time to look at all of the books in the library - it could happen that some book exists that A does not know about, but is suspiciously similar to the piece that it writes.
A’s expectation of it’s reward is then:
In this setup, A is incentivized to minimize it’s probability of being detected and the expected size of , and trades off these objectives against maximizing . But these are not the ideal incentive directions.
Minimizing would lead to undesireable behaviour by leading A to be more averse to taking certain kinds of actions that would be ideal. For example, suppose that A has a choice between critically analysing a work of Shakespeare or an obscure blog post, but fears that it might accidentally plagarize another book on the subject. If it writes about Shakespeare, the odds are much higher that it will accidentally plagarize multiple books, which will lead it to prefer the blog post, even if analyses about Shakespeare are more valuable. (In the extreme case, there is a risk that B will happen to know about a book outside of the library that A inadvertently plagarizes, in which case any piece of evidence will lead to B being punished). A will also care about low probability worlds where is very large, even if they are not very relevant to B’s preferences.
Minimizing incentivizes A to act in ways that make bad behaviour harder to detect if it exists, by prefering actions that, if they are forbidden, are forbidden in ways that have the least probability of being detected. While this setup still means that A would be incentivized not to perform any forbidden actions, it would make this approach harder to implement in practice.
We would rather want A to minimize the probability that it is taking a forbidden action , possibly with hyperparameters managing the tradeoff between this and action utility.
Let (probability action is forbidden)
Let be A’s expectation of given the information it has
Similarly, let
Assume we know (this assumption is unrealistic, and will be revisited)
Then define as follows:
Choose
Sample
If , then
Otherwise,
If we know , then
If we know , then
To accept an increase in , we need increase in to have the same utility, allowing explicit management of the probability of taking a forbidden action.
Approximating Probability of Detection
However, knowing the exact value of is not easy to derive in this kind of setup. Instead, using a lower bound estimate of would still provide the same guarantees of the agent avoiding known forbidden actions and for the vs tradeoff, and would partially (but not completely) avert some of the paranoid behaviour of .
A lower bound might be found by:
Weaknesses
Potential Improvements and Further Work
Shutdown Problem/AI Boxing Method
This method could be adopted to forbid an AI from shutting itself down or escaping from an AI box. Suppose:
Let to be the interval of time over which the AI is running
Let be some distribution over (say an exponential distribution),
Let to be a process where B examines the memory state of A, and returns 1 if either A is planning to disconnect the shutdown button or if A is attempting to hide information.
Then, A is shut down at some time sampled from the distribution Q, and examines it’s memory state. If , B punishes A with a reward of . It will then always be true that A anticipates that attempting to disconnect the shutdown button will yield < 0 expected reward, and so A will effectively be disincentivized from disconnecting from the shutdown button.
The benefit of this method is that it requires is that A is initially in a state that is safe and transparent, and that any attempt by A to transition into an unsafe or non-transparent state can be detected.
The views expressed here are my own, not those of any employer.
Thanks to Patrick LaVictoire and the participants in the MIRI Summer Fellows Program for feedback on the original idea.