Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way.
For $latex p,q \in (0,1)$, let $latex K(p,q)$ be the KL divergence between a coin of bias $latex p$ and one of bias $latex q$: $latex K(p,q) = p \ln \frac{p}{q} + (1-p) \ln \frac{1-p}{1-q}.$
Theorem: Suppose you do $latex n$ independent tosses of a coin of bias $latex p$. The probability of seeing $latex qn$ heads or more, for $latex q > p$, is at most $latex \exp(-nK(q,p))$. So is the probability of seeing $latex qn$ heads or less, for $latex q < p$.
Remark: By Pinsker’s inequality, $latex K(q,p) \geq 2(p-q)^2$.
Proof Let’s do the $latex q > p$ case; the other is identical.
Let $latex \theta_p$ be the distribution over $latex \{0,1\}^n$ induced by a coin of bias $latex p$, and likewise $latex \theta_q$ for a coin of bias $latex q$. Let $latex S$ be the set of all sequences of $latex n$ tosses which contain $latex qn$ heads or more. We’d like to show that $latex S$ is unlikely under $latex \theta_p$.
Pick any $latex \bar{x} \in S$, with say $latex k \geq qn$ heads. Then:
[latex size=”2″] \frac{\theta_q(\bar{x})}{\theta_p(\bar{x})} = \frac{q^k(1-q)^{n-k}}{p^k(1-p)^{n-k}} \geq \frac{q^{qn}(1-q)^{n-qn}}{p^{qn}(1-p)^{n-qn}} = \left( \frac{q}{p} \right)^{qn} \left( \frac{1-q}{1-p}\right)^{(1-q)n} = e^{n K(q,p)}.[/latex]
Since $latex \theta_p(\bar{x}) \leq \exp(-nK(q,p)) \theta_q(\bar{x})$ for every $latex \bar{x} \in S$, we have [latex]\theta_p(S) \leq \exp(-nK(q,p)) \theta_q(S) \leq \exp(-nK(q,p))[/latex] and we’re done.
How do you justify the step where you replace k with qn?
Since q/p > 1 and k > qn, (q/p)^k >= (q/p)^(qn). The other factor is similar.
The other factor is similar, but doesn’t it give the wrong direction in inequality? I get 1-q ((1-q)/(1-p))^k<=((1-q)/(1-p))^qn
broken formatting, should be 1-q less than 1-p implies ((1-q)/(1-p))^k<=((1-q)/(1-p))^qn
you dropped the minus sign on k.
John, this is really nice!
Sanjoy, actually
Very cool! Thank you, Sanjoy. My students would be so happy to know that they don’t need to go through the lengthy proof that was given last week.
Isn’t this what they call the method of types in info theory books?
Is there a way to get Chernoff-type bound for more than 2 outcomes? Cover/Thomas use Sanov’s Theorem for that, but it seems much looser than Chernoff-type bound
This is similiar to the idea as that used by Impagliazzo and Kabanets in a recent paper on a combinatorial version of the CH-bound. In that paper, they also generalize this argument (as Yaroslav asks). Nice !
Excellent exposition! Would you please tell me where you found this proof since I assume that it predated the proof by Impapliazzo-Kabanets. Many thanks in advance!
I’m not sure which paper you are referring to.
My understanding is that this proof is new in origination, although I can’t swear it’s the first time it’s been discovered.
@jl: That’s excellent! I have no doubt about its originality, but I thought it was officially published somewhere else either by you or others. Such a nice and simple proof should be more well-known!