Machine Learning (Theory)

11/16/2008

Observations on Linearity for Reductions to Regression

Tags: Machine Learning,Reductions jl@ 6:54 pm

Dean Foster and Daniel Hsu had a couple observations about reductions to regression that I wanted to share. This will make the most sense for people familiar with error correcting output codes (see the tutorial, page 11).

Many people are comfortable using linear regression in a one-against-all style, where you try to predict the probability of choice i vs other classes, yet they are not comfortable with more complex error correcting codes because they fear that they create harder problems. This fear turns out to be mathematically incoherent under a linear representation: comfort in the linear case should imply comfort with more complex codes.

In particular, If there exists a set of weight vectors wi such that P(i|x)= <wi,x>, then for any invertible error correcting output code C, there exists weight vectors wc which decode to perfectly predict the probability of each class. The proof is simple and constructive: the weight vector wc can be constructed according to the linear superposition of wi implied by the code, and invertibility implies that a correct encoding implies a correct decoding.

This observation extends to all-pairs like codes which compare subsets of choices to subsets of choices using “don’t cares”.

Using this observation, Daniel created a very short proof of the PECOC regret transform theorem (here, and Daniel’s updated version).

One further observation is that under ridge regression (a special case of linear regression), for any code, there exists a setting of parameters such that you might as well use one-against-all instead, because you get the same answer numerically. The implication is that the advantages of codes more complex than one-against-all is confined to other prediction methods.

2 Comments to “Observations on Linearity for Reductions to Regression”
  1. anonymous says:

    Isn’t linear regression actually a special case of ridge regression, in which the ridge parameter is 0?

  2. jl says:

    You are probably right, at least in some contexts. The question is, does ‘linear regression’ mean a particular algorithm or a particular representation. I think of it as (b), but others may think of it as (a).

Leave a Reply


Powered by WordPress