Machine Learning (Theory)

1/23/2006

On Coding via Mutual Information & Bayes Nets

Say we have two random variables X,Y with mutual information I(X,Y). Let’s say we want to represent them with a bayes net of the form X< -M->Y, such that the entropy of M equals the mutual information, i.e. H(M)=I(X,Y). Intuitively, we would like our hidden state to be as simple as possible (entropy wise). The data processing inequality means that H(M)>=I(X,Y), so the mutual information is a lower bound on how simple the M could be. Furthermore, if such a construction existed it would have a nice coding interpretation — one could jointly code X and Y by first coding the mutual information, then coding X with this mutual info (without Y) and coding Y with this mutual info (without X).

It turns out that such a construction does not exist in general (Thx Alina Beygelzimer for a counterexample! see below for the sketch).

What are the implications of this? Well, it’s hard for me to say, but it does suggest to me that the ‘generative’ model philosophy might be burdened with a harder modeling task. If all we care about is a information theoretic, compact hidden state, then constructing an accurate Bayes net might be harder, due to the fact that it takes more bits to specify the distribution of the hidden state. In fact, since we usually condition on the data, it seems odd that we should bother specifying a (potentially more complex) generative model. What are the alternatives? The information bottleneck seems interesting, though this has peculiarities of its own.

Alina’s counterexample:

Here is the joint distribution P(X,Y). Sample binary X from an unbiased coin. Now choose Y to be the OR function of X and some other ‘hidden’ random bit (uniform). So the joint is:

P(0,0)=1/4
P(0,1)=1/4
P(1,0)=0
P(1,1)=1/2

Note P(X=1)=1/2 and P(Y=1)=3/4. Here,

I(X,Y)= 3/4 log (4/3) ~= 0.31

The rest of the proof showing that this is not achievable in a ‘compact’ Bayes net is in a comment.

3 Comments to “On Coding via Mutual Information & Bayes Nets”
  1. sham says:

    Here is a proof sketch.

    Recall P(X,Y). Sample binary X from an unbiased coin. Now choose Y to be the OR function of X and some other ‘hidden’ random bit (uniform). So the joint is: P(0,0)=1/4 P(0,1)=1/4 P(1,0)=0 P(1,1)=1/2 Note P(X=1)=1/2 and P(Y=1)=3/4. Here,

    I(X,Y)= 3/4 log (4/3) ~= 0.31

    1) First, one can show that we can restrict ourselves to binary M.

    sketch: Basically, consider the set of (possibly non-binary) settings of M which could give rise to X=1. For these settings, Y =1 with probability 1, else we’d have (1,0) possible. Hence, we can collapse this set of M, say to be say M=1. Similarly, we can collapse the other set to be M=0.

    2) If M is binary, one can show that, wlog,:

    P(X=1|M=1)>0 & P(Y=1|M=1)=1
    P(Y=0|M=0)>0 & P(X=0|M=0)=1

    sketch: Let M=1 be a setting in which X=1 is possible. Hence, if X=1 is possible for M=1, then Y=0 must be impossible, so P(Y=1|M=1)=1. Similarly, if Y=0 is a possibility (when M=0, wlog), then X=1 must be impossible, and we have P(X=0|M=0)=1.

    3) Now we can show that:

    1/2 < = P(M=1) <=3/4

    sketch: If P(M=1) is less than 1/2, then P(X=1) is less than 1/2 (by 2). But this is a contradiction, since under P(X,Y), P(X)=1/2. If P
    (M=1) is greater than 3/4, then P(Y=1) is greater than 3/4. But under
    P(X,Y), P(Y)=3/4.

    4) We are done since 1>=H(M)>=0.81, by 3. Hence, H(M) must be greater
    than 0.31.

    5) conjecture. This examples extends to the block coding case. Where P
    (X’,Y’) is an iid sequence drawn under P.

  2. Anonymous says:

    Common information is far less than mutual inf for “almost all” joint distributions P(X,Y), see
    G\’acs, P., and K\”orner, J. (1973). Common information is far less than mutual information, Probl. Inform. Control, 2(2), 149-162.
    And more recent refs at http://citeseer.ifi.unizh.ch/context/562456/0

    But many practical P(X,Y) luckily fit to that “almost nothing” :)

  3. If the goal is to minimize future mis-classification rate E_X[error], then knowing true P(X) would be useful, because it can tell us to “try harder” for particular X (perhaps force larger margin). If we don’t know P(X), we can model it instead. So couldn’t this be a reason to prefer the “harder” generative modelling philosophy?

Leave a Reply


Powered by WordPress