People sometimes advocate using maximum entropy priors in Bayesian statistics. This seems a dangerous practice to me.
A classic example is that you have some sort of device or process which can produce the numbers 1 to 6 with probabilities $p_1 \dots p_6$, and the only thing you know about the device or process is that the mean is $m$. You can then use the principle of maximum entropy to choose the $p_i$. The result is of form $p_i = \alpha \beta^i$ and using $\sum p_i = 1$ and $\sum i p_i = m$ you can solve (numerically) for $\alpha$ and $\beta$.
Why dangerous? Suppose $m=1.01$. The result has $p_6 \approx$1e-10. Although the method is based on the idea of assuming as little as possible, it manages to produce the rather dramatic conclusion that you’re almost certain never to see a 6 from very little input. Humans have a tendency of under-estimating the probability of rare events and getting into trouble because of it, and this principle seems to encourage such mistakes.
What’s going wrong here?
The situation as described is self-inconsistent. To an applied statistician ‘a device or process’ is an actual real world thing. It might be the result of throwing a conjuror’s die. It might be the number of feotuses in a randomly chosen human pregnancy. It might be an obscure sports statistics. It might be the category of the next hurricane. Even though ‘a device or process’ is as vague as can be, knowing even this much is not compatible with ‘only thing you know about the device or process is that the mean is $m$.’ You can’t be completely ignorant. You know from experience that sixes happen more often than 1 in ten billion. Quantifying exactly how much is hard, but you can do better than MEP.
What we are really ignorant about is the distribution of $p_1 \dots p_6$. After all, if the purpose of the exercise was to estimate $p_1 \dots p_6$ using data sampled from this device or process we would regard the $p_i$ as parameters and we would start with a prior distribution for them. Without a constraint on the mean, a Dirichlet prior would be a popular choice. That still leaves a hyperparameter to be decided: perhaps MEP should be used here. But then what measure? Lesbegue? And how to apply the constraint on the mean?
The following seems a reasonable procedure. It is a sort of vanilla Bayesian approach. The constraints $p_i \geq 0$, $\sum p_i = 1$ and $\sum i p_i = m$ define a 4-dimensional polyhedron T in $R^6$. Assume a uniform prior (and Lesbegue measure) over T. With no data, and a further assumption of a squared error loss function, our Bayesian estimate for $p_1 \dots p_6$ will be the centroid of T.
Set $w = m - 1$. If $w \leq 1$, none of $p_2 \dots p_6$ can be 1, and T is a ‘slice’ through the unit simplex, which is an(irregular) tetrahedron. The 5 vertices of T can be found by setting all but one of $p_2 \dots p_6$ to be zero in turn. These are then
and their mean, the centroid of T, is
There is a different formula if $m \gt 2$. If $m=1.01$ so $w=0.01$ this is
So now we think that the probability of a six is 1 in 2,500, instead of 1 in ten billion.
For the original problem of deciding ‘at once’ what $p_1 \dots p_6$ should be, it seems the counting measure is the only reasonable choice. (But see later.) When the parameter space is continuous, the choice of measure is less obvious. For the whole real line, Lesbegue measure is oten justified by a translation argument: we want $[a,b]$ to have the same measure as $[a+x,b+x]$ for all $x$. But for a bounded interval that argument doesn’t work. The following priors have all been suggested for a binomial parameter $q \in [0,1]$.
(Berger, p89, section 3.3.4.). They are all reasonable according to Berger. I think they all have ‘informational’ justifications. $\pi_1$ is flat, the others U-shaped. $\pi_2$ is extremely U-shaped, and improper (which does not seem reasonable to me). $\pi_3$ is very U-shaped, $\pi_4$ is mildly U-shaped. $\pi_3$ is a beta distribution, and $\pi_2$ is a limiting case of a beta distribution. $\pi_3$ generalises to a Dirichlet distribution, and $\pi_4$ to a limiting case of a Dirichlet distribution in higher dimensions, and I guess those would have the same justifications. I don’t know what the equivalent of $\pi_4$ is in higher dimensions.
If $[0,1]$ should have a U-shaped distribution, then shouldn’t $\{1,2, \dots, N\}$ have one too?
I don’t know a name for turning a point into a distribution. I’ll call it distributization. It is reminiscent of going from classical to quantum mechanics, and there is a ‘second distributization’ waiting out there… Possibly relevant: Silver, Martz, Wallstrom refs.
Distributization is also similar to the common practice of using a parametic distribution (with parameter(s) $\eta$ say) for a parameter $\theta$, and then using another distribution, a hyperprior, for $\eta$. And so on. Usually only a few new parameters are introduced at each stage so measure theoretic problems don’t arise.
When do we know when to stop doing distributization or adding hyperpriors? When we stop, is a maximum entropy principle appropriate for the last step?
Berger, J O. Statistical Decision Theory and Bayesian Analysis, Springer; 2nd ed. 1985
Silver, R N, Martz H F, and Wallstrom T (1993). Quantum statistical inference for density estimation, ASA Proceedings of the Section on Bayesian Statistical Science, pp 131-139. PDF.
‘Density Estimation by Maximum Quantum Entropy’, 1996, R. N. Silver, T. Wallstrom, H. F. Martz