Not EM for clustering at COLT

One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”.

One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting.

A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form:

Project to low dimensional space.
Pick out gross structure.
Project gross structure into the high dimensional space.
Run EM (or some other local improvement algorithm) to find a final fit.

The effects of steps 1-3 is to “seed” the local optimization algorithm in a good place from which a global optima is plausibly reachable.

13 Replies to “Not EM for clustering at COLT”

Paul Deignan says:

7/11/2005 at 2:48 am

Projecting Gaussians down to lower dimension may result in an increase in complexity in that reduced dimensional space. So the result of this can be to introduce local minima where minima do not occur.
John Langford says:

7/11/2005 at 5:50 am

The theorems say this can only be true when the gaussians are not well-separated. So, roughly speaking, the projection method is guaranteed to work anytime the gaussians are distinct.
Paul Deignan says:

7/11/2005 at 9:53 pm

Well-seperated?! Using a distance metric that reqires more than the dimension being projected? What sort of a metric would that be?

Decreases in complexity are not achieved by projections unless the projected dimension is absolutely redundant.
John Langford says:

7/12/2005 at 5:54 am

The meaning of well-separated is given in the papers. Have you read them?

I’m not sure what “absolutely redundant” means, but I expect something like “you can run PCA on the dataset to remove dimensions without introducing any distortion in the dataset”. (In other words, all points fall in a planar subspace.) In that case, this statement:

Decreases in complexity are not achieved by projections unless the projected dimension is absolutely redundant.

is incorrect.
Paul Deignan says:

7/12/2005 at 11:33 pm

Absoulutely redundant — does not change the minimal encoding of the distribution.

So the existence of the dimension itself in the formulation is an additional one-parameter codeword.

(PCA? A projected mixture of Gaussians is not necessarily Gaussian. If the problem is to fit an unknown number of Gaussians, then PCA is not much help. Again, you can see this spatially.)

There are thousands of papers written in data mining claiming to simplify this or that. However, authors tend not to hold themselves to any uniform basis for evaluation (encoding is one universal method) so the game becomes a matter of finding where a fudge factor was tweaked or an assumption made.

Fortunately, optimization, separation, and certainly Gaussians all have an easily comprehended geometric interpretation. We can visualize Gaussians being projected onto each other producing local minima (just project two along a common axis but with different widths/heights). So I am skeptical. If the definition of well-separated relies on a metric (and it should) then there are also other assumptions involved as the previous comment alludes.
John Langford says:

7/13/2005 at 7:59 am

There are 3 assumptions underlying the current work: 1) the subdistributions are “log concave” (which includes gaussians) 2) the means of these distributions are separated 3) that every distribution has a reasonable representation in the mixture. Subject to these distributions, the theorem states that you can succesfully extract the components of a mixture distribution (to some error rate) with sufficient samples.

If you can’t reconcile this with your intuitions, I’d recommend taking a look.
Paul Deignan says:

7/13/2005 at 11:30 pm

As I said, projection meay lead to a multiplicity of local minima (increased complexity).

The converse, unfolding in higher dimensions will in general simplify seperable distributions if the extention is in informative dimension.

I believe this is well-known.

BTW, Gaussians are not compact. They cannot be seperated in the topological sense of the word.
John Langford says:

7/14/2005 at 10:16 am

Again, I encourage you to reconcile your intuitions with the math. Math is never wrong, although it is sometimes misplaced. I don’t see how this math can be misplaced.
Paul Deignan says:

7/24/2005 at 12:56 am

As I read your claim, this is an algorithm for avoiding local minima in EM clustering based on projecting to find “gross structure” and then lifting the projection.

I’m sure many things can be mathematically proved. That all very nice but hardly worth the time digging through the details if the overall algorithm is useless due to unstated or obscured assumptions.

What is “well-seperated”? What is “gross structure”? These are not mathematically rigorous terms. What “proof” can be constructed based on these and to what ends?

You should be wary of claims in projections and lifting of distribustions with finite data that are not statistically quantified (i.e. the real problem that limits work in this area is the practical problems pf the curse of dimensionality and finite data). All claims that are not qualified are and should be suspect.

The authors will likely face this same criticism when they go to write this up for a journal — the field is repleat with these examples.

At least I agree with you that math is not wrong.
John Langford says:

7/24/2005 at 8:30 am

There are two ways this discussion can be resolved.

1) You can read, understand, and reconcile the mathematics with your intuitions. Note the assumptions are more explicit in comment (6) and even more explicit in comment in the paper. (The only reason for vagueness in the original post is that the precise assumptions varied across the work summarized.)

2) You can come up with a formal model of your own in which you can prove your intuitions are valid, and then anyone can do the reconciliation.
Paul Deignan says:

7/25/2005 at 2:45 pm

John, I don’t think you are either understanding or accepting the point that I am trying to make. Let me be more direct.

Each year there are thousands of papers in datamining that are published. This is the nature of the work–we have an illposed problem, hence a multitude of possible solutions. Although all of these papers have some demonstrated value (after all, someone wrote them and someone else published them), very few have real value to the field. (If they did, our datamining problems would be solved by now.)

How do we filter the constructive ideas from the ideas that lack potential? That’s were you come in as a blog writer (as opposed to your classroom instructor duties). As the proponent of a work, you can describe its utility or validity. You attempted to do some of this in summarizing the paper. That got my interest. But that is not the end of it.

The review of each paper takes about 2-4 hours depending on its breadth and depth. If we are going to invest that sort of time (and not just me–everyone reading this blog), there should be a good reason–a positive utility expectation. We need to query and feedback the analysis of these papers in the process of research, otherwise the community of researchers would make little progress. This is the meta-datamining process involved in the generation of new knowledge.

The primary utility of the blog format is to allow a venue where this meta-process can be done efficiently. If I ask a question and you give a response, others can see the process in full view. They can refine their own meta-processes and adjust their expected utility functions. We can all learn how to better learn. Perhaps some other reader could also jump in with an observation so that the risk and cost of learning is distributed among experts. This format also allows that process and hopefully without the crutches of appeals to authority that constrain free academic discourse in the classroom.

So, my input at this stage of the process is to make you aware of ideas that prevent me from investing the time in reading the paper in detail. (I have a negative expected utility.) Now, the paper may or may not be worthwhile. If it is, I also expect that you could easily overcome my objections as the paper’s proponent (if I am efficient in making you aware of my analysis).

Now, what you call intution I call couterexamples. This completes my immediate and meta-datamining processes.
John Langford says:

7/26/2005 at 8:02 pm

You are probably correct—I am probably not understanding your point. There are several difficulties.

1) No counterexample has been stated precisely enough for me to understand it.
2) I strongly suspect the theorem can not be shown false.
John says:

8/18/2005 at 3:42 am

Nice blog.I like this site.
John

Comments are closed.