How to think optimally ignorantly

How to think optimally ignorantly

I often return to Jaynes’s wonderful 1957 paper “Information theory and statistical mechanics”. Starting from the entropy as defined by Shannon, Jaynes explains lucidly how to decipher the correct probability distribution to describe a statistical phenomenon. The correct distribution derives from the pieces of information that we might have about the phenomenon. Using the pieces and obeying the “principle of maximum ignorance”, the proper distribution can be computed.

The form of the definition of Shannon entropy is interesting in itself. Shannon defined the entropy as the expected value of the information contained in a message. The information of a message \mathcal{M} was quantified by -\log_2 \mathcal{M} because small changes in the messages make little impact on the overall information, and information in a message is additive. For example a message consisting of a single coin toss has one bit of information -\log_2 1/2 = 1, while two coin tosses have two bits, -\log_2(1/2*1/2) = 2*(-\log_2 1/2) = 2. Thus the average information, now called the entropy is computed for a probability distribution on an i-state system as S=-\sum_i p_i \log p_i

The two coin flipping situation has (if the coins are fair) an equal probability for heads and tails and thus S= - 1/2 \log_2 1/2 - 1/2 \log_2 1/2 = 1. The maximum entropy occurs, in fact, when the probabilities are even, and we take forward this idea of entropy as maximal evenness. As an aside, the base of the logarithm is unimportant, base two is chosen for ease in bit systems. Instead, in statistical mechanics the natural logarithm is frequently used.

Let x be a discrete variable drawn from a length-n set x_i={x_1,x_2,\ldots x_n}. We don’t know the probability distribution p(x_i) on x_i. But, what if we assume we do know the mean of some function of x? This mean is defined by the rules of continuous probability to be

\langle f(x) \rangle = \sum_i p(x_i)f(x_i).

Can we now infer the mean of some other function \langle g(x)\rangle? At first, Jaynes says, that inference seems impossible. Even using the normalization condition on the probability distribution: \sum_i p(x_i) = 1, we still are short n-2 equations to account for all the x_1,\ldots x_n unknown variables.

Laplace answered the questions with his ‘principle of insufficient reason’: that no assumptions should be made about the probabilities. Using his principle, for example, if no information is given about the probability distribution p(x_i), we must assume that all x_i are equally likely, thus x is a uniform random variable and its probability distribution flat.

When we know the mean and the definition of a probability distribution, we can use Lagrange multipliers to enforce two constraints:

\lambda_0 constraint that the total probability is unity.
\lambda_1 constraint of the definition of the mean.

Using p(x_i)=p_i we write the total entropy using the Lagrange multipliers as

S = -\sum_i p_i \log p_i + \lambda_0\left(\sum_i p_i - 1 \right) + \lambda_1\left(\sum_i p_i f_i - \langle f \rangle\right)

where then to be maximal we must have \partial S/\partial p_i = 0. Computing the derivative:

\frac{\partial S}{\partial p}= -\sum_i \left[ \log p_i - 1 + \lambda_0 + \lambda_1f_i \right] = 0

where the chosen function f does not depend on the ultimately maximally entropic distribution so that \frac{\partial f(x_i)}{\partial p(x_i)} = 0 and thus \frac{\partial \langle f \rangle}{\partial p}=0. The term in the brackets above must be zero for each i. Solving for p leads to

p(x) = e^{1-\lambda_0 - \lambda_1 f(x)} = ce^{-\lambda_1 f(x)}

so that the maximally entropic distribution is an exponential function. This should seem familiar to those who have seen the Maxwell-Boltzmann distribution for the micro-canonical ensemble in statistical physics. The normalization can be recovered.

What if we add another constraint? For example, if we constrain the standard deviation, thereby constraining the second moment, we end up with the omnipresent Gaussian distribution. It is really cool that the “max-ent” method for inferring distributions agrees with common assumptions made with intuition but less mathematical rigor. The note is meant to illustrate this powerful technique as well as showing how we can actually define maximum ignorance in making inferences.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s