The Azimuth Project
Experiments in Bayesian priors and maximum entropy

People sometimes advocate using maximum entropy priors in Bayesian statistics. This seems a dangerous practice to me.

A classic example is that you have some sort of device or process which can produce the numbers 1 to 6 with probabilities p 1p 6p_1 \dots p_6, and the only thing you know about the device or process is that the mean is mm. You can then use the principle of maximum entropy to choose the p ip_i. The result is of form p i=αβ ip_i = \alpha \beta^i and using p i=1\sum p_i = 1 and ip i=m\sum i p_i = m you can solve (numerically) for α\alpha and β\beta.

Why dangerous? Suppose m=1.01m=1.01. The result has p 6p_6 \approx 1e-10. Although the method is based on the idea of assuming as little as possible, it manages to produce the rather dramatic conclusion that you’re almost certain never to see a 6 from very little input. Humans have a tendency of under-estimating the probability of rare events and getting into trouble because of it, and this principle seems to encourage such mistakes.

What’s going wrong here?

An applied/empirical answer

The situation as described is self-inconsistent. To an applied statistician ‘a device or process’ is an actual real world thing. It might be the result of throwing a conjuror’s die. It might be the number of foetuses in a randomly chosen human pregnancy. It might be an obscure sports statistics. It might be the category of the next hurricane. Even though ‘a device or process’ is as vague as can be, knowing even this much is not compatible with ‘only thing you know about the device or process is that the mean is mm.’ You can’t be completely ignorant. You know from experience that sixes happen more often than 1 in ten billion. Quantifying exactly how much is hard, but you can do better than MEP.

A theoretical answer

What we are really ignorant about is the distribution of p 1p 6p_1 \dots p_6. After all, if the purpose of the exercise was to estimate p 1p 6p_1 \dots p_6 using data sampled from this device or process we would regard the p ip_i as parameters and we would start with a prior distribution for them. Without a constraint on the mean, a Dirichlet prior would be a popular choice. That still leaves a hyperparameter to be decided: perhaps MEP should be used here. But then what measure? Lesbegue? And how to apply the constraint on the mean?

The following seems a reasonable procedure. It is a sort of vanilla Bayesian approach. The constraints p i0p_i \geq 0, p i=1\sum p_i = 1 and ip i=m\sum i p_i = m define a 4-dimensional polyhedron T in R 6R^6. Assume a uniform prior (and Lesbegue measure) over T. With no data, and a further assumption of a squared error loss function, our Bayesian estimate for p 1p 6p_1 \dots p_6 will be the centroid of T.

Set w=m1w = m - 1. If w1w \leq 1, none of p 2p 6p_2 \dots p_6 can be 1, and T is a ‘slice’ through the unit simplex, which is an(irregular) tetrahedron. The 5 vertices of T can be found by setting all but one of p 2p 6p_2 \dots p_6 to be zero in turn. These are then

(1w,w,0,0,0,0)(1-w, w, 0, 0, 0, 0)
(1w/2,w,w/2,0,0,0)(1-w/2, w, w/2, 0, 0, 0)
(1w/3,0,0,w/3,0,0)(1-w/3, 0, 0, w/3, 0, 0)
(1w/4,0,0,0,w/4,0)(1-w/4, 0, 0, 0, w/4, 0)
(1w/5,0,0,0,0,w/5)(1-w/5, 0, 0, 0, 0, w/5)

and their mean, the centroid of T, is

(1/300)(300137w,60w,30w,20w,15w,12w)(1/300)(300-137w, 60w, 30w, 20w, 15w, 12w)

There is a different formula if m>2m \gt 2. If m=1.01m=1.01 so w=0.01w=0.01 this is

(.9954333,.002,.001,.0006667,.0005,.0004)(.9954333, .002, .001, .0006667, .0005, .0004)

So now we think that the probability of a six is 1 in 2,500, instead of 1 in ten billion.


For the original problem of deciding ‘at once’ what p 1p 6p_1 \dots p_6 should be, it seems the counting measure is the only reasonable choice. (But see later.) When the parameter space is continuous, the choice of measure is less obvious. For the whole real line, Lesbegue measure is oten justified by a translation argument: we want [a,b][a,b] to have the same measure as [a+x,b+x][a+x,b+x] for all xx. But for a bounded interval that argument doesn’t work. The following priors have all been suggested for a binomial parameter q[0,1]q \in [0,1].

π 1(q)=1\pi_1(q) = 1
π 2(q)q 1(1q) 1\pi_2(q) \propto q^{-1}(1-q)^{-1}
π 3(q)q 1/2(1q) 1/2\pi_3(q) \propto q^{-1/2}(1-q)^{-1/2}
π 4(q)q q(1q) 1q\pi_4(q) \propto q^{q}(1-q)^{1-q}

(Berger, p89, section 3.3.4.). They are all reasonable according to Berger. I think they all have ‘informational’ justifications. π 1\pi_1 is flat, the others U-shaped. π 2\pi_2 is extremely U-shaped, and improper (which does not seem reasonable to me). π 3\pi_3 is very U-shaped, π 4\pi_4 is mildly U-shaped. π 3\pi_3 is a beta distribution, and π 2\pi_2 is a limiting case of a beta distribution. π 3\pi_3 generalises to a Dirichlet distribution, and π 4\pi_4 to a limiting case of a Dirichlet distribution in higher dimensions, and I guess those would have the same justifications. I don’t know what the equivalent of π 4\pi_4 is in higher dimensions.


If [0,1][0,1] should have a U-shaped distribution, then shouldn’t {1,2,,N}\{1,2, \dots, N\} have one too?

I don’t know a name for turning a point into a distribution. I’ll call it distributization. It is reminiscent of going from classical to quantum mechanics, and there is a ‘second distributization’ waiting out there… Possibly relevant: Leonard, T. (1978); Silverman (1978); Leonard and Hsu’s book, pp249-250; Silver, Martz, Wallstrom (1993, 1996).

Distributization is also similar to the common practice of using a parametic distribution (with parameter(s) η\eta say) for a parameter θ\theta, and then using another distribution, a hyperprior, for η\eta. And so on. Usually only a few new parameters are introduced at each stage so measure theoretic problems don’t arise.

When should we stop doing distributization or adding hyperpriors? When we stop, is a maximum entropy principle appropriate for the last step?


  • Berger, J O. Statistical Decision Theory and Bayesian Analysis, Springer; 2nd ed. 1985

  • Leonard, T and Hsu, J. Bayesian Methods, CUP 1999

  • Leonard, T. (1978). Density estimation, stochastic processes and prior information (with discussion), Journal Royal Statistical Society, Ser. B, 40: 113-146.

  • Silverman, B W. (1978). Density ratios, empirical like;lihood and cot death, Applied Statistics 27: 26-33.

  • Silver, R N, Martz H F, and Wallstrom T (1993). Quantum statistical inference for density estimation, ASA Proceedings of the Section on Bayesian Statistical Science, pp 131-139. PDF.

  • ‘Density Estimation by Maximum Quantum Entropy’, 1996, R. N. Silver, T. Wallstrom, H. F. Martz

category: experiments