Textbooks invariably seem to carry the proof that uses Markov’s inequality, moment-generating functions, and Taylor approximations. Here’s an easier way.
For $latex p,q \in (0,1)$, let $latex K(p,q)$ be the KL divergence between a coin of bias $latex p$ and one of bias $latex q$: $latex K(p,q) = p \ln \frac{p}{q} + (1-p) \ln \frac{1-p}{1-q}.$
Theorem: Suppose you do $latex n$ independent tosses of a coin of bias $latex p$. The probability of seeing $latex qn$ heads or more, for $latex q > p$, is at most $latex \exp(-nK(q,p))$. So is the probability of seeing $latex qn$ heads or less, for $latex q < p$.
Remark: By Pinsker’s inequality, $latex K(q,p) \geq 2(p-q)^2$.
Proof Let’s do the $latex q > p$ case; the other is identical.
Let $latex \theta_p$ be the distribution over $latex \{0,1\}^n$ induced by a coin of bias $latex p$, and likewise $latex \theta_q$ for a coin of bias $latex q$. Let $latex S$ be the set of all sequences of $latex n$ tosses which contain $latex qn$ heads or more. We’d like to show that $latex S$ is unlikely under $latex \theta_p$.
Pick any $latex \bar{x} \in S$, with say $latex k \geq qn$ heads. Then:
[latex size=”2″] \frac{\theta_q(\bar{x})}{\theta_p(\bar{x})} = \frac{q^k(1-q)^{n-k}}{p^k(1-p)^{n-k}} \geq \frac{q^{qn}(1-q)^{n-qn}}{p^{qn}(1-p)^{n-qn}} = \left( \frac{q}{p} \right)^{qn} \left( \frac{1-q}{1-p}\right)^{(1-q)n} = e^{n K(q,p)}.[/latex]
Since $latex \theta_p(\bar{x}) \leq \exp(-nK(q,p)) \theta_q(\bar{x})$ for every $latex \bar{x} \in S$, we have [latex]\theta_p(S) \leq \exp(-nK(q,p)) \theta_q(S) \leq \exp(-nK(q,p))[/latex] and we’re done.
How do you justify the step where you replace k with qn?
Since q/p > 1 and k > qn, (q/p)^k >= (q/p)^(qn). The other factor is similar.
The other factor is similar, but doesn’t it give the wrong direction in inequality? I get 1-q ((1-q)/(1-p))^k<=((1-q)/(1-p))^qn
broken formatting, should be 1-q less than 1-p implies ((1-q)/(1-p))^k<=((1-q)/(1-p))^qn
you dropped the minus sign on k.
John, this is really nice!
Sanjoy, actually 🙂
Very cool! Thank you, Sanjoy. My students would be so happy to know that they don’t need to go through the lengthy proof that was given last week. 🙂
Isn’t this what they call the method of types in info theory books?
Is there a way to get Chernoff-type bound for more than 2 outcomes? Cover/Thomas use Sanov’s Theorem for that, but it seems much looser than Chernoff-type bound
This is similiar to the idea as that used by Impagliazzo and Kabanets in a recent paper on a combinatorial version of the CH-bound. In that paper, they also generalize this argument (as Yaroslav asks). Nice !
Excellent exposition! Would you please tell me where you found this proof since I assume that it predated the proof by Impapliazzo-Kabanets. Many thanks in advance!
I’m not sure which paper you are referring to.
My understanding is that this proof is new in origination, although I can’t swear it’s the first time it’s been discovered.
@jl: That’s excellent! I have no doubt about its originality, but I thought it was officially published somewhere else either by you or others. Such a nice and simple proof should be more well-known!