One standard approach for clustering data with a set of gaussians is using EM. Roughly speaking, you pick a set of k random guassians and then use alternating expectation maximization to (hopefully) find a set of guassians that “explain” the data well. This process is difficult to work with because EM can become “stuck” in local optima. There are various hacks like “rerun with t different random starting points”.
One cool observation is that this can often be solved via other algorithm which do not suffer from local optima. This is an early paper which shows this. Ravi Kannan presented a new paper showing this is possible in a much more adaptive setting.
A very rough summary of these papers is that by projecting into a lower dimensional space, it is computationally tractable to pick out the gross structure of the data. It is unclear how well these algorithms work in practice, but they might be effective, especially if used as a subroutine of the form:
- Project to low dimensional space.
- Pick out gross structure.
- Project gross structure into the high dimensional space.
- Run EM (or some other local improvement algorithm) to find a final fit.
The effects of steps 1-3 is to “seed” the local optimization algorithm in a good place from which a global optima is plausibly reachable.