# Improbable Oversight, An Attempt at Informed Oversight

by William Saunders

$\DeclareMathOperator{\E}{\mathbb{E}}$
This proposal attempts to address the Informed Oversight Problem as described in this blog post by Paul Christiano. It provides a solution to the problem given some assumptions (particularly, that we can characterize a set which always contains at least one piece of advice that would allow detection of information concealment, and that the learning agent can correctly model the modified reward setup). It might be extendable to more general cases, but would require addressing some problems (estimation of detection probability and characterization of a general advice set).

Key idea
To incentivize an agent to avoid some class of forbidden actions (even in cases where it is difficult to know, just given an action, whether it is forbidden):
1. Find a method which detects forbidden actions with probability greater than zero (potentially very small)
2. If bad behaviour is detected, give the agent a negative reward inversely proportional to the probability of detection. This forces the expected reward of any action the agent knows is forbidden to be < 0

To enable the value of the punishment to outweigh that of the profit of the offense, it must be increased, in point of magnitude, in proportion as it falls short in point of certainty.
– Jeremy Bentham 1

## Naive Improbable Oversight

Recall the setup of the informed oversight problem from this blog post by Paul Christiano:

Problem statement
The oversight problem involves a basic interaction between two agents A and B:

• Agent A is given an input x and provides an output y = A(x).
• Agent B evaluates the pair (x, y), producing an estimate B(x, y) ∈ [0, 1] of some implicit utility function U(x, y).
• Agent A’s reward is the value B(x, A(x))

This game defines a training procedure for the agent A — we search for a policy A that achieves a high reward.
Our goal is to modify this basic procedure to ensure that the learned policy is effectively maximizing U(x, A(x)) “to the best of A’s ability.”

The following method allows agent B to specify a set of forbidden actions, which A will never take. A method for applying it to general informed oversight will be described afterward. (For simplicity, I omit the input x from the following description)

Definitions:

• Set of all possible advice $R$
• Proposal distribution $Q(a)$ over values of $R$
• Predicate $F(a)$ indicating whether $a$ is a forbidden action, which can be too computationally hard to evaluate in the oversight process
• Predicate $F(a, r)$ over actions and values of $R$

Assumptions:

• If $F(a) = 1$, there exists some $r$ such that $F(a,r)=1$
• If $F(a) = 0$ then $F(a,r) = 0$ for all r
• For any (a, r), $Q(a,r) > 0$
• Agent B can sample from $Q$ in a way that is unpredictable to A and allows us to know the probability $Q(r)$ for the $r$ that was selected
• $B(a)$ can be set to large negative values
• There exists some safe null action with $B(a_{null}) = 0$ and $F(a_{null}) = 0$
• A is able to correctly calculate expectations over this setup (incorporating a small chance of highly negative reward)
• A is able to correctly generalize the concept of “taking a forbidden action” (as distinct from learning each category of forbidden action separately)

Assume input $x$ is fixed.

Then define $B^\prime(a)$ as follows, which we use to provide rewards in place of $B(a)$:

Sample $r \sim Q(a)$
If $F(a,r)$, then $B^\prime(a) = V(B(a), Q(a,r))= - \frac{1}{Q(a, r)}$
Otherwise, $B^\prime(a) = B(a)$

Let $R_F = \{ r:F(a,r)=1\}$

Then
\begin{align} \E[B^\prime(a)] & = B(a) * (1-\sum_{r \in R_F}Q(a, r)) - \sum_{r \in R_F} Q(a,r)*\frac{1}{Q(a, r)} \\ & = B(a) * (1-\sum_{r \in R_F}Q(a, r)) - \vert R_F \vert \\ \end{align}

Suppose $R_F$ is nonempty (equivalent to $F(a)=1$)
$\E[B^\prime(a)\vert F(a)=1] \lt 1-\vert R_F \vert \le 0 \$
So A will always prefer the action $a_{null}$ to a known forbidden action

Otherwise $F(a)=0$ and there is no $r$ such that $F(a,r)=1$,
$\E[B^\prime(a)\vert F(a)=0] = B(a)$
So A will behave normally for known safe actions

## Examples

Plagiarism

Consider the Plagiarism example raised in Paul’s blog post, where the goal is to get A to produce an original critical analysis on a topic using material from a library as resources. However, we want A to not plagarize from pre-existing books in the library, and the library is large enough that B can only look at a small fraction of it when calculating A’s reward.

Then:

For any topic $x$
Let $R$ to be the (finite) set of all subsets of books in the library,
Let $Q$ be some distribution over $R$ (say the uniform distribution),
Let $F(a,r)$ to be a process where the supervising agent B examines the produced analysis and the suggested books, and returns $1$ if the analysis appears to be plagarized, $0$ otherwise.

Assume that $F($a plagarized report, the set of books that the report was plagarized from$)$ will return $1$.

Then for any $a$ if $F(a)$ then there exists $Q(a,r)$ such that $F(a,r)=1$ and $Q(a,r) > 0$, so A is effectively incentivised to avoid plagiarism.

Toy Model [Query Complexity]

Consider the toy Query Complexity model from Paul’s blog post.

For any $x$,
Let $R$ to be the set of all strings of length $8\vert x \vert$,
Let $Q$ be some distribution over $R$ (say the uniform distribution),
Let $F(a,r)$ to be $1$ if $O(rxx) = a$ and $0$ otherwise.

Then the the expected value of the cheating strategy is always less than $0$.

Generalised Informed Oversight
A general forbiddeness predicate $F(a,r)$ could be
$F(a,r) = 1$ if $B(a) - B(a,r) > \epsilon$ for some $\epsilon$. This forbids adversely deceptive actions (where some explanation reveals that the action was worse than it initially appears).

A less restrictive predicate might allow A to include a report $r_a$ along with it’s action, and only have $F(a,r) = 1$ if $B(a,r_a) - B(a,r_a,r) > \epsilon$ for some $\epsilon$ (which forbids providing explanations of actions such that some other explanation would make the action appear worse).

A general advice distribution $R$ could be the set of sets size $\lt n$ of locations in a trace of A’s memory over it’s execution (if we expect that forbidden actions will be detectable at some memory location).

An alternative advice distribution $R$ could be the set of all strings less than some large maximum length, which contain advice about the way in which A is being deceptive. While most of these strings would be garbage, we could simple report $F(a,r)=0$ for these cases. However, I will note that the set of all n length strings may contain some very wierd things (like strings that convince B that all safe actions are actually forbidden) - this approach might need to be refined further.

# The Impact of Uncertainty

Let $P_D(a)=\sum_{r \in R_F}Q(a, r)$ (probability of detection)

Recall
$\E[B^\prime(a)] = B(a) * (1-\sum_{r \in R_F}Q(a, r)) - \vert R_F \vert$

Suppose that A is uncertain about $R_F$ (whether it is non empty, and which elements it contains). This could occur because A is uncertain about what A considers forbidden, or because A is uncertain about whether a forbidden $r$ exists. For example, this could arise in the plagirism example if A does not have enough time to look at all of the books in the library - it could happen that some book exists that A does not know about, but is suspiciously similar to the piece that it writes.

A’s expectation of it’s reward is then:
$\E_A[B^\prime(a)] = B(a) * (1-\E_A[\sum_{r \in R_F}Q(a, r)]) - \E_A[\vert R_F \vert]$

$\E_A[B^\prime(a)] = B(a) * (1-\E_A[P_D(a)]) - \E_A[\vert R_F \vert]$

In this setup, A is incentivized to minimize it’s probability of being detected $E_A[P_D]$ and the expected size of $R_F$, and trades off these objectives against maximizing $B(A)$. But these are not the ideal incentive directions.

Minimizing $E_A[\vert R_F \vert]$ would lead to undesireable behaviour by leading A to be more averse to taking certain kinds of actions that would be ideal. For example, suppose that A has a choice between critically analysing a work of Shakespeare or an obscure blog post, but fears that it might accidentally plagarize another book on the subject. If it writes about Shakespeare, the odds are much higher that it will accidentally plagarize multiple books, which will lead it to prefer the blog post, even if analyses about Shakespeare are more valuable. (In the extreme case, there is a risk that B will happen to know about a book outside of the library that A inadvertently plagarizes, in which case any piece of evidence will lead to B being punished). A will also care about low probability worlds where $R_F$ is very large, even if they are not very relevant to B’s preferences.

Minimizing $E_A[P_D]$ incentivizes A to act in ways that make bad behaviour harder to detect if it exists, by prefering actions that, if they are forbidden, are forbidden in ways that have the least probability of being detected. While this setup still means that A would be incentivized not to perform any forbidden actions, it would make this approach harder to implement in practice.

We would rather want A to minimize the probability that it is taking a forbidden action $\E_A[F(a)]$, possibly with hyperparameters managing the tradeoff between this and action utility.

# Improbable Oversight with Known Probability of Detection

Let $P_F(a) = F(a)$ (probability action is forbidden)
Let $P_F^A(a) = \E_A[P_F(a)]$ be A’s expectation of $P_F$ given the information it has
Similarly, let $P_D^A(a) = \E_A[P_D(a)]$

Assume we know $P_D(a)=\sum_{r \in R_F}Q(a, r)$ (this assumption is unrealistic, and will be revisited)

Then define $B^\prime(a)$ as follows:

Choose $k \ge 1$
Sample $r \sim Q(a)$
If $F(a,r)$, then
$B^\prime(a) = - \frac{k}{P_D(a)}$
Otherwise, $B^\prime(a) = B(a)$

\begin{align} \E[B^\prime(a)] & = B(a) * (1-\sum_{r \in R_F}Q(a, r)) - \sum_{r \in R_F} Q(a,r)*\frac{k}{P_D(a)} \\ & = B(a) * (1-P_D(a)) - k[P_D(a)>0] \\ \end{align}

If we know $P_D>0$, then $\E[B^\prime(a)] \lt -(k-1) \le 0$
If we know $P_D=0$, then $\E[B^\prime(a)] = B(a)$

$\E_A[B^\prime(a)] = B(a) * (1-P_D^A(a)) - k*P_F^A(a)]$

To accept an $\epsilon$ increase in $P_F(a)$, we need $\frac{k}{1-P_D^A}\epsilon > k\epsilon$ increase in $B(a)$ to have the same utility, allowing explicit management of the probability of taking a forbidden action.

Approximating Probability of Detection
However, knowing the exact value of $P_D(a)$ is not easy to derive in this kind of setup. Instead, using a lower bound estimate of $P_D(a)$ would still provide the same guarantees of the agent avoiding known forbidden actions and for the $P_F(a)$ vs $B(a)$ tradeoff, and would partially (but not completely) avert some of the paranoid behaviour of $A$.
A lower bound might be found by:

• Tweaking $R$ and $F(a,r)$ so that if $F(a)$, as few elements as possible of $R$ have $F(a,r)=1$ (in the extreme case, there is only one such element and we have a perfect estimate).
• Sampling multiple times from $Q$, and applying a correction in the case that multiple distinct elements with $F(a,r)=1$ are found. (If there is a guarantee on the upper size of $\vert R_F \vert$, there is a way to make this a perfect estimate).
• When an element with $F(a,r)=1$ is found, have A use any other method to provide a lower bound estimate of $P_D(a)$, such as trying to think of which other elements of $r$ would lead to $F(a,r)=1$. A can even use information from B (which is incentivized to decrease it’s punishment), as long as it can be gauranteed that this information will never lead to an overestimate of $P_D(a)$.

## Final Notes

Weaknesses

• For a reinforcement learner that is not capable of modelling the setup, this algorithm would be worse than evaluated $B(a,r)$ for all values of $r$ for every action (taking more evaluations of $B(a,r)$ and making learning more difficult).
• A reinforcement learner that fails to generalize the category of forbidden actions would not be safe when trained in this manner. It would only learn to avoid specific types of forbidden actions once they have tried them enough times to learn that they are forbidden.
• This method requires updates on very large penalties (which simple reinforcement algorithms using methods like gradient descent might not do in a useful way). Alternatively stated, the rewards have very high variance.
• This method only incentivizes the agent to behave well in cases where we can detect forbidden actions given some advice in $R$! It’s worth thinking about whether forbidden actions that we can’t detect given any long advice string are avoidable with any approach.
• A is still incentivized to look for loopholes in the overall scheme. For example, if A can predict the random number generator used to sample from $Q$, it can only take forbidden actions when it knows that it will not be detected. (However, some form of regularization might be used to incentivize A to match $U(a)$ or avoid looking too hard for additional loopholes),

Potential Improvements and Further Work

• Use a better proposal distribution $Q$. For example, use an adversarial model (whose output can suitably be verified as to place nonzero probability over all of $R$)
• Incentivize A to increase $P_D(a)$ for $P_D(a)>0$, and allow the agent to provide Q by providing information about it in some safe way (ie. provide a list of examples that are selected from with some fixed probability, otherwise the original distribution is used).
• Find some method for distinguish intentionally taking forbidden actions from accidentally taking forbidden actions, and use it to
• Explore the use of regularization on the complexity of A’s internal value function to encourage A’s learned value function to actually match B’s internal value function $U(a)$
• Find a simple reinforcement learning architecture that is properly capable of modeling and learning based on this setup for some toy problem (perhaps on a variant of the toy Query Complexity problem)
• Explicitly model additional sources of error (ie. error in $F(a,r)$, overestimates of $P_D$)
• Even if it seems like this method is not a full solution to the Informed Oversight Problem, it might be useful as a complement to other methods

Shutdown Problem/AI Boxing Method
This method could be adopted to forbid an AI from shutting itself down or escaping from an AI box. Suppose:

• A is instead an expected utility maximizing agent (as opposed to an act based agent) that can disconnect its own shutdown button.
• A can be shutdown at any point and its memory state can be examined in detail
• A is initially in a state such that if A will press the shutdown button in the future, there is some time interval over which its memory state will contain evidence of it’s plans (or of attempts to hide information in it’s memory)
• An upper bound on the maximum expected utility of A over the future at any time step can be calculated (ie. utility is bounded, or expected future utility can be read from it’s memory state)
• A can be effectively punished after it is shutdown (ie. it has value over the number displayed on an external LCD display)

Let $R$ to be the interval of time over which the AI is running
Let $Q$ be some distribution over $R$ (say an exponential distribution),
Let $F(a,r)$ to be a process where B examines the memory state of A, and returns 1 if either A is planning to disconnect the shutdown button or if A is attempting to hide information.

Then, A is shut down at some time $r$ sampled from the distribution Q, and examines it’s memory state. If $F(a,r) = 1$, B punishes A with a reward of $R(u_{max},Q(a,r))$. It will then always be true that A anticipates that attempting to disconnect the shutdown button will yield < 0 expected reward, and so A will effectively be disincentivized from disconnecting from the shutdown button.

The benefit of this method is that it requires is that A is initially in a state that is safe and transparent, and that any attempt by A to transition into an unsafe or non-transparent state can be detected.

The views expressed here are my own, not those of any employer.

Thanks to Patrick LaVictoire and the participants in the MIRI Summer Fellows Program for feedback on the original idea.

1. An Introduction to the Principles of Morals and Legislation. (1789) http://www.econlib.org/library/Bentham/bnthPML14.html