With the exponential amount of progress in the field of AI, we may be on the doorstep of artificial general intelligence. AGI is a pivotal technology - a technology that greatly increases the state space of possible futures we might occupy. The problem is that we currently have (almost) no concrete idea how to ensure such an intelligence is aligned with our values, and at the same time, (almost) no one working on AI wants to believe that this is a problem at all. Sometimes, I like to entertain the (vanishingly small probability) hypothesis that we might get alignment for free, from nothing but intelligence itself.

what is intelligence?

Ultimately, intelligence is information compression: finding the shortest/simplest explanation for your observations so that you can more effectively predict them (and more effectively predict the outcomes of your actions, and the utility those will provide). This leaves out a lot that most people group under the “intelligence” umbrella, like filial imprinting, social instincts, and many subjective feelings (sadness, embarrassment). But I wouldn’t want an AGI - much less a superintelligence - driven by human social instincts, for painfully obvious reasons. The drive I’d want that superintelligence to have would be to maximize the freedom of each and every sentient being, including freedom from nonconsensual interpersonal and social structures (please, go read that blog instead). The problem lies in specifying the utility function - i.e. mathematically formalizing the concepts of “freedom” and, far worse, “sentient being”.

compression progress as utility

Jürgen Schmidhuber, one of the fathers of the field of machine learning, proposes using compression progress as utility. This drives the agent to seek out sources of non-random but novel (not-yet-compressed) information, effectively pursuing “interestingness” (the first derivative of compression). It’s a simple and effective explanation for a lot of our own behaviour - curiosity (seeking maximum interestingness), art (generating more data with hidden regularities), a sense of subjective beauty (the size of the compressed representation), etc.

future freedom

Meanwhile, Dr. Alex Wissner-Gross has proposed that a key component of intelligent behaviour is driven by the maximization of causal path entropy (similar to the principle of maximum caliber, but applied to agents). Loosely this means “maximizing the volume of your state space of possible futures”, which is a slightly more formal way of saying “maximizing power”. Or perhaps it encodes an intrinsic notion of curiosity. Instrumental convergence is just the obvious result that the futures you (the agent) care about are a subset of all possible futures, and so maximizing causal path entropy/power = increasing the number of futures in which your goals are fulfilled = increasing your ability to fulfill those goals.

alignment for free

Given both these concepts - that the agent, our toy superintelligence, seeks to maximize both compression progress and its own causal path entropy - how do we get alignment out of it? First, if driven by compression progress, then the agent will be incentivized to develop a theory of mind - a compressed prototypical representation of an agent - because the alternative, representing each agent it encounters as a wholly distinct variable in its world-model, would be extremely inefficient. If Schmidhuber is right, then it will also develop a compressed representation of itself, which - at maximal compression - would look a lot like an agent. Thus we get a natural proximity of the agent’s conception of itself, and its prototypical representation of other agents, simply because its world-model’s latent space is maximally compressed.

In such a compressed latent space, similar states of differing agents overlap, and so observations of another agent’s state produce value predictions similar (if weaker) to those of the agent’s own corresponding states . According to Steven Byrnes, this is the basis for human empathy (probably with the help of some more specialized and thus less important reward circuits in our brainstem, like facial recognition). If this is true, and what’s more, if we can directly identify and encourage this kind of “empathy” (though this would probably take another breakthrough in interpretability research), we can ensure that the agent will be rewarded not just for maximizing its own causal path entropy, but maximizing the causal path entropy of all agents.

can this work?

This requires at least Schmidhuber’s idea of compression progress to work; it also assumes that, in making its world-model as compressed as possible, the agent will necessarily form a theory of mind (representation of a prototype agent), and a representation of itself, and that these two will be anywhere near each other (this latter point might be solvable with a form of self-supervised consistency loss or convex consistency loss). As I said: I consider it very improbable. (But maybe just probable enough to be worth looking into.)

previous post
the skyrim diophantine equation
next post
racing moloch