Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11).
Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes.
In particular, If there exists a set of weight vectors wi such that P(i|x)= <wi,x>, then for any invertible error correcting output code C, there exists weight vectors wc which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector wc can be constructed according to the linear superposition of wi implied by the code, and invertibility implies that a correct encoding implies a correct decoding.
This observation extends to all-pairs like codes which compare subsets of choices to subsets of choices using “don’t cares”.
One further observation is that under ridge regression (a special case of linear regression), for any code, there exists a setting of parameters such that you might as well use one-against-all instead, because you get the same answer numerically. The implication is that the advantages of codes more complex than one-against-all is confined to other prediction methods.