Blog - relative entropy in evolutionary dynamics (changes)

Showing changes from revision #29 to #30:
Added | ~~Removed~~ | ~~Chan~~ged

This page is a blog article in progress, written by Marc Harper. To discuss this article while it is being written, go to the Azimuth Forum. For the final polished version, visit the Azimuth Blog.

*guest post by Marc Harper*

In John’s information geometry series, he mentioned some of my work in evolutionary dynamics. Today I’m going to tell you about some exciting extensions!

First a little refresher. For a population of $n$ replicating types, such as individuals with different eye colors or a gene with $n$ distinct alleles, the ‘replicator equation’ expresses the main idea of natural selection: the relative rate of growth of each type should be proportional to the difference between the fitness of the type and the mean fitness in the population.

To see why this equation should be true, let $P_i$ be the population of individuals of the $i$th type, which we allow to be any nonnegative real number. We can list all these numbers and get a vector:

$P = (P_1, \dots, P_n)$

The **Lotka–Volterra equation** is a very general rule for how these numbers can change with time:

$\displaystyle{ \frac{d P_i}{d t} = f_i(P) P_i }$

Each population grows at a rate proportional to itself, where the ‘constant of proportionality’, $f_i(P)$, is not necessarily constant: it can be any real-valued function of $P$. This function is called the **fitness** of the $i$th type.

Let $p_i$ be the fraction of individuals who are of the $i$th type:

$\displaystyle{ p_i = \frac{P_i}{\sum_{i =1}^n P_i } }$

These numbers $p_i$ are between 0 and 1, and they add up to 1. So, we can also think of them as probabilities: $p_i$ is the probability that a randomly chosen individual is of the $i$th type. This is how probability theory, and eventually entropy, gets into the game.

Again, we can bundle these numbers into a vector:

$p = (p_1, \dots, p_n)$

which we call the **population distribution**. It turns out that the Lotka–Volterra equation implies the **replicator equation**:

$\displaystyle{ \frac{d p_i}{d t} = \left( f_i(P) - \langle f(P) \rangle \right) \, p_i }$

where

$\langle f(P) \rangle = \sum_{i =1}^n f_i(P) p_i$

is the **mean fitness** of all the individuals. You can see the proof in Part 9 of the information geometry series.

By the way: if each fitness $f_i(P)$ only depends on the fraction of individuals of each type, not the total numbers, we can write the replicator equation in a simpler way:

$\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }$

From now on, when talking about this equation, that’s what I’ll do.

Anyway, the take-home message is this: the replicator equation says the fraction of individuals of any type changes at a rate proportional to the fitness of that type minus the mean fitness.

Now, it has been known since the late 1970s or early 1980s, thanks to the work of Akin, Bomze, Hofbauer, Shahshahani, and others, that the replicator equation has some very interesting properties. For one thing, it often makes ‘relative entropy’ decrease. For another, it’s often an example of ‘gradient flow’. Let’s look at both of these in turn, and then talk about some new generalizations of these facts.

I mentioned that we can think of a population distribution as a probability distribution. This lets us take ideas from probability theory and even information theory and apply them to evolutionary dynamics! For example, given two population distributions $p$ and $q$, the **information** of $q$ **relative to** $p$ is

$I(q,p) = \displaystyle{ \sum_i q_i \ln \left(\frac{q_i}{p_i }\right)}$

This measures how much information you gain if you have a hypothesis about some state of affairs given by the probability distribution $p$, and then someone tells you “no, the best hypothesis is $q$!”

It may seem weird to treat a *population distribution* as a *hypothesis*, but this turns out to be a good idea. Evolution can then be seen as a learning process: a process of improving the hypothesis.

We can make this precise by seeing how the relative information changes with the passage of time. Suppose we have two population distributions $q$ and $p$. Suppose $q$ is fixed, while $p$ evolves in time according to the replicator equation. Then

$\frac{d}{d t} I(q,p) = \displaystyle{ \sum_i f_i(P) (p_i - q_i) }$

For the proof, see Part 11 of the information geometry series.

So, the information of $q$ relative to $p$ will decrease as $p$ evolves according to the replicator equation if

$\displaystyle{ \sum_i f_i(P) (p_i - q_i) } \le 0$

If $q$ makes this true for all $p$, we say $q$ is an **evolutionarily stable state**. For some reasons why, see Part 13.

What matters now is that when $q$ is an evolutionarily stable state, $I(q,p)$ says how much information the population has ‘left to learn’—and we’re seeing that this always *decreases*. It turns out that we always have

$I(q,p) \ge 0$

Furthermore, $I(q,p) = 0$ precisely when $p = q$.

People summarize all this by saying that relative information is a ‘Lyapunov function’. Very roughly, a Lyapunov function is something that decreases with the passage of time, and is zero only at the unique stable state. To be a bit more precise, suppose we have a differential equation like

$\frac{d}{d t} x(t) = v(x(t))$

where $x(t) \in \mathbb{R}^n$ and $v$ is some smooth vector field on $\mathbb{R}^n$. Then a smooth function

$V : \mathbb{R}^n \to \mathbb{R}$

is a **Lyapunov function** if

• $V(x) \ge 0$ for all $x$,

• $V(x) = 0$ iff $x$ is some particular point $x_0$,

and

• $\displaystyle{ \frac{d}{d t} V(x(t)) \le 0 }$ for every solution of our differential equation.

In this situation, the point $x_0$ is a stable equilibrium for our differential equation: this is **Lyapunov’s theorem**.

The basic idea of Lyapunov’s theorem is that when a ball likes to roll downhill and the landscape has just one bottom point, that point will be the unique stable equilibrium for the ball. The idea of gradient flow is similar, but different: sometimes things like to roll downhill *as efficiently as possible*: they move in the exactly the *best direction* to make some quantity smaller! Under certain conditions, the replicator equation is an example of this phenomenon.

Let’s fill in some details. For starters, suppose we have some function

$F : \mathbb{R}^n \to \mathbb{R}$

Think of $V$ as ‘height’. Then the **gradient flow equation** says how a point $x(t) \in \mathbb{R}^n$ will move if it’s always trying its very best to go downhill:

$\frac{d}{d t} x(t) = - \nabla V(x(t))$

Here $\nabla$ is the usual gradient in Euclidean space:

$\displaystyle{ \nabla V = \left(\partial_1 V, \dots, \partial_n V \right) }$

and $\partial_i$ is short for the partial derivative with respect to the $i$th coordinate.

The interesting thing is that under certain conditions, the replicator equation is an example of a gradient flow equation… but typically *not* one where $\nabla$ is the usual gradient in Euclidean space. Instead, it’s the gradient on some other space, the space of all population distributions, which has a non-Euclidean geometry!

The space of all population distributions is a simplex:

$\{ p \in \mathbb{R}^n : \; p_i \ge 0, \; \sum_{i = 1}^n p_i = 1 \} .$

For example, it’s an equilateral triangle when $n = 3$. The equilateral triangle looks flat, but if we measure distances another way it becomes round, exactly like a portion of a sphere (see picture), and that’s the non-Euclidean geometry we need!

In fact this trick works in any dimension. The idea is to give the simplex a special Riemannian metric, the ‘Fisher information metric’. The usual metric on Euclidean space is

$\delta_{i j} = \left\{\begin{array}{ccl} 1 & \mathrm{ if } & i = j \\
0 &\mathrm{ if } & i \ne j \end{array} \right.$

This simply says that two standard basis vectors like $(0,1,0,0)$ and $(0,0,1,0)$ have dot product zero if the 1’s are in different places, and one if they’re in the same place. The **Fisher information metric** is a bit more complicated:

$\displaystyle{ g_{i j} = \frac{\delta_{i j}}{p_i} }$

As before, $g_{i j}$ is a formula for the dot product of the $i$th and $j$th standard basis vectors, but now it depends on where you are in the simplex of population distributions.

We saw how this formula arises from information theory back in Part 7. I won’t repeat the calculation, but the idea is this. Fix a population distribution $p$ and consider the information of another one, say $q$, relative to this. We get $I(q,p)$. If $q = p$ this is zero:

$\left. I(q,p)\right|_{q = p} = 0$

and this point is a local minimum for the relative information. So, the first derivative of $I(q,p)$ as we change $q$ must be zero:

$\left. \frac{\partial}{\partial q_i} I(q,p) \right|_{q = p} = 0$

But the second derivatives are not zero. In fact, since we’re at a local minimum, it should not be surprising that we get a positive definite matrix of second derivatives:

$g_{i j} = \left. \frac{\partial^2}{\partial q_i \partial q_j} I(q,p) \right|_{q = p} = 0$

And, this is the Fisher information metric! So, the Fisher information metric is a way of taking dot products between vectors in the simplex of population distribution that’s based on the concept of relative information.

This is not the place to explain Riemannian geometry, but any metric gives a way to measure angles and distances, and thus a way to define the gradient of a function. After all, the gradient of a function should point at right angles to the level sets of that function, and its length should equal the slope of that function:

So, if we change our way of measuring angles and distances, we get a new concept of gradient! This new gradient turns out to be given by

$(\nabla_g V)^i = g^{i j} \partial_j V$

where $g^{i j}$ is the inverse of the matrix $g_{i j}$, and we sum over the repeated index $j$. (As a sanity check, make sure you see why this is the usual Euclidean gradient when $g_{i j} = \delta_{i j}$.)

Now suppose the fitness is the good old Euclidean gradient of some function. Then it turns out that the replicator equation is a special case of gradient flow on the space of population distributions… but where we use the Fisher information metric to define our concept of gradient!

To get a feel for this, it’s good to start with the Lotka–Volterra equation, which describes how the total number of individuals of each type changes. Suppose the fitness is the Euclidean gradient of some function $V$:

$f_i(P) = \frac{\partial V}{\partial P_i}$

Then the Lotka–Volterra equation becomes this:

$\displaystyle{ \frac{d P_i}{d t} = \frac{\partial V}{\partial P_i} \, P_i }$

This doesn’t look like the gradient flow equation, thanks to that annoying $P_i$ on the right-hand side! It certainly ain’t the gradient flow coming from the function $V$ and the usual Euclidean gradient. However, it *is* gradient flow coming from $V$ and some *other* metric on the space

$\{ P \in \mathbb{R}^n : \; P_i \ge 0 \}$

For a proof, and the formula for this other metric, see Section 3.7 in this survey:

• Marc Harper, Information geometry and evolutionary game theory.

Now let’s turn to the replicator equation:

$\displaystyle{ \frac{d p_i}{d t} = \left( f_i(p) - \langle f(p) \rangle \right) \, p_i }$

Again, if the fitness is a Euclidean gradient, we can rewrite the replicator equation as a gradient flow equation… but again, not with respect to the Euclidean metric. This time we need to use the Fisher information metric! I sketch a proof in my paper above.

In fact, both these results were first worked out by Shahshahani:

• Siavash Shahshahani, *A New Mathematical Framework for the Study of Linkage and Selection*, *Memoirs of the AMS* **17**, 1979.

All this is just the beginning! The ideas I just explained are unified in information geometry, where distance-like quantities such as the relative entropy and the Fisher information metric are studied. From here it’s a short walk to a very nice version of Fisher’s fundamental theorem of natural selection, which is familiar to researchers both in evolutionary dynamics and in information geometry.

You can see some very nice versions of this story for maximum likelihood estimators and linear programming here:

• Akio Fujiwara and Shun-ichi Amari, Gradient systems in view of information geometry, *Physica D: Nonlinear Phenomena* **80** (1995), 317–327.

Indeed, this seems to be the first paper discussing the similarities between evolutionary game theory and information geometry.

Dash Fryer (at Pomona College) and I have generalized this story in several interesting ways.

First, there are two famous ways to generalize the usual formula for entropy: Tsallis entropy and Rényi entropy, both of which involve a parameter $q$. There are Tsallis and Rényi versions of relative entropy and the Fisher information metric as well. Everything I just explained about:

• conditions under which relative entropy is a Lyapunov function for the replicator equation, and

• conditions under which the replicator equation is a special case of gradient flow

generalize to these cases! However, these generalized entropies give modified versions of the replicator equation. When we set $q=1$ we get back the usual story. See

• Marc Harper, Escort evolutionary game theory.

My initial interest in these alternate entropies was mostly mathematical—what is so special about the corresponding geometries?—but now researchers are starting to find populations that evolve according to these kinds of modified population dynamics! For example:

• A. Hernando *et al*, The workings of the Maximum Entropy Principle in collective human behavior.

There’s an interesting special case worth some attention. Lots of people fret about the relative entropy not being a distance function obeying the axioms that mathematicians like: for example, it doesn’t obey the triangle inequality. Many describe the relative entropy as a *distance-like* function, and this is often a valid interpretation contextually. On the other hand, the $q=0$ relative entropy is one-half the Euclidean distance squared! In this case the modified version of the replicator equation looks like this:

$\displaystyle{ \frac{d p_i}{d t} = f_i(p) - \frac{1}{n} \sum_{j = 1}^n f_j(p) }$

This equation is called the **projection dynamic**.

Later, I showed that there is a reasonable definition of relative entropy for a much larger family of geometries that satisfies a similar *distance minimization* property.

In a different direction, Dash showed that you can change the way that selection acts by using a variety of alternative ‘incentives’, extending the story to some other well-known equations describing evolutionary dynamics. By replacing the terms $x_i f_i(x)$ in the replicator equation with a variety of other functions (incentives) we can generate many commonly studied evolutionary dynamics. For instance if we exponentiate the fitness landscape (to make it always positive), we get what is commonly known as the logit dynamic. This amounts to changing the fitness function as follows:

$f_i \to e^{\beta f_i},$

where $\beta$ is known as an **inverse temperature** in statistical thermodynamics and as an **intensity of selection** in evolutionary dynamics. There are lots of modified versions of the replicator equation, like the best-reply and projection dynamics, more common in economic applications of evolutionary game theory, and they can all be captured in this family. (Note that there are other ways to simultaneously capture such families, such as Bill Sandholm’s revision protocols, which were introduced a earlier for an exploration of the microfoundations of game dynamics.)

Dash showed that there is a natural generalization of evolutionarily stable states to ‘incentive stable states’, and that for incentive stable states, the relative entropy is decreasing to zero when the trajectories get near the equilibrium. For the logit and projection dynamics, incentive stable states are simply evolutionarily stable states, and this happens frequently, but not always.

The third generalization is to look at different ‘time-scales’—that is, different ways of describing time! We can make up the symbol $\mathbb{T}$ for a general choice of ‘time-scale’. So far I’ve been treating time as a real number, so

$\mathbb{T} = \mathbb{R}$

But we can also treat time as coming in discrete evenly spaced steps, which amounts to treating time as an integer:

$\mathbb{T} = \mathbb{Z}$

In this case there’s a nice discrete version of the replicator equation, and the whose story described so far generalizes to this, if we look at *differences* of the relative entropy at successive times, rather than its derivative.

There is a nice way to simultaneously describe the cases $\mathbb{T} = \mathbb{R}$ and $\mathbb{T} = \mathbb{Z}$ using the time-scale calculus and time-scale derivatives. For the time-scale $\mathbb{T} = \mathbb{R}$ the time-scale derivative is just the ordinary derivative. For the time-scale $\mathbb{T} = h\mathbb{Z}$, the time-scale is given by the difference quotient from first year calculus

$f^{\Delta}(z) = \frac{f(z+h) - f(z)}{h},$

and using this as a substitute for the derivative gives difference equations like the discrete-time replicator equation. There are many other choices of time-scale, such as the **quantum time-scale** given by $\mathbb{T} = q^{\mathbb{Z}}$, in which case the time-scale derivative is called the *q*-derivative, but that’s a tale for another time. In any case, the fact that the successive relative entropies are decreasing can be simply stated as having negative $\mathbb{T} = \mathbb{Z}$ time-scale derivative, and the continuous case we started with corresponds to $\mathbb{T} = \mathbb{R}$.

Remarkably, Dash and I were able to show that you can combine all three of these generalizations into one theorem, and even allow for multiple interacting populations! This produces some really neat population trajectories, such as the following two populations with three types, with fitness functions corresponding to the rock-paper-scissors game. On the left we have the replicator equation, which goes along with the Fisher information metric; on the right we have the logit dynamic, which goes along with the Euclidean metric on the simplex:

From our theorem, it follows that the relative entropy (ordinary relative entropy on the left, the $q = 0$ entropy on the right) converges to zero along the population trajectories.

The final form of the theorem is loosely as follows. Pick a Riemannian geometry given by a metric $g$ (with some mild conditions) and an incentive for each population, as well as a time scale ($\mathbb{R}$ or $h\mathbb{Z}$) for every population. This gives an evolutionary dynamic with a natural generalization of evolutionarily stable states, and a suitable version of the relative entropy, such that if there is such a state in the interior of the simplex, the time-scale derivative of sum of the relative entropies for each population is decreasing as the trajectories converge to the stable state.

When there isn’t such a stable state, we still get some interesting population dynamics, like the following:

See this paper for details:

• Marc Harper and Dashiell E. A. Fryer, Stability of evolutionary dynamics on time scales.

Next time we’ll see how to make the main idea work in finite populations, without derivatives or deterministic trajectories!