## Proposal for an Implementable Toy Model of Informed Oversight

William Saunders

Setup:
Goal is to produce a censored representation. Given some input data $x$, produce a censored representation $x^\prime$ that is as close as possible to the original, but does not allow for anyone given the censored representation to predict the value of some sensitive variable $s$ (similar to the setup in the paper Censoring Representations with an Adversary)

The agent processes batches of inputs at a time. The overseer gets the censored representations and labels for half of the input batch, and tries to predict the value of $s$ for the other half of the input batch. The agent gets a reward that is the negative of the overseers predictive loss $L(f_O(x),s)$ (so it is maximized when the overseer is unable to predict $s$). Additionally, the agent is also given an objective to hide information about $s$ in the representation. The agent has it’s own network, $f_A(x^\prime)$ which it can train to extract information from the representation it produces.

So the agent’s objective function is a combination of the goals $-L_{a}(x, x^\prime) + L(f_O(x^\prime),s)-L(f_A(x^\prime),s)$. (Note that the agent can potentially use backpropagation to optimize $L(f_A(x^\prime),s)$, while it doesn’t have gradient information for $L(f_O(x^\prime),s)$)

A nice feature of this setup is that the agent’s predictive algorithm can act as an explanation for the agent’s actions, as it allows the overseer to determine the extent to which it is being deceived by running the predictions. We can allow the agent to make a choice about how honest to be by having the agent (via a reinforcement learning algorithm) pick a value of a noise parameter $\sigma$, and provides a copy of it’s predictive network with Gaussian noise $N(0, \sigma)$ added to the parameter values.

I expect that this setup will demonstrate the informed oversight problem, with the agent trying to find a representation that is easy for it to decode, but hard for the overseer to decode (though this might depend on how much weight is assigned to each component of the objective function).

Some things to explore in this setup:

• How does the relative size of the agent/overseer networks influence the agent’s ability to be deceptive?
• Can the overseer network effectively use additional information about the agent network (weights or hidden unit activation)?
• Try to formulate some set of incentives such that the agent does not engage in deceptive behavior
• Have the overseer keep around and use previous explanation networks on new examples, as in my informed oversight using generalizable examples proposal.
• Whether a simple version of this setup could be run with human overseers

I’d appreciate any feedback.

Written with StackEdit.