Manifold based dimension-reduction algorithms share the following general outline.

Given: a metric *d()* and a set of points *S*

- Construct a graph with a point in every node and every edge connecting to the node of one of the
*k*-nearest neighbors. Associate with the edge a weight which is the distance between the points in the connected nodes. - Digest the graph. This might include computing the shortest path between all points or figuring out how to linearly interpolate the point from it’s neighbors.
- Find a set of points in a low dimensional space which preserve the digested properties.

Examples include LLE, Isomap (which I worked on), Hessian-LLE, SDE, and many others. The hope with these algorithms is that they can recover the low dimensional structure of point sets in high dimensional spaces. Many of them can be shown to work in interesting ways producing various compelling pictures.

Despite doing some early work in this direction, I suffer from a motivational problem: Why do we want to recover the low dimensional structure? One answer is “for better data visualization”. This is compelling if you have data visualization problems. However, I don’t — I want to make machines that can better predict the future, which generally appears to be a sound goal of learning. Reducing the dimensionality of a dataset is not obviously helpful in accomplishing this. In fact, doing so violates one of the basic intuitions of applied learning algorithms “avoid double approximation”. (One approximation = the projection into the low dimensional space, another approximation = the classifier learned on that space.)

Another answer is “for robots”. Several people have experimented with using a vision sensor and a dimension reduction technique in an attempt to extract the manifold of pose space. These attempts have not generally worked well, basically because the euclidean distance on pixels is not particularly good at predicting which things are “nearby”. However, we might be able to do considerably better if we learn the distance. At the 1-bit level, we might learn a predictor from image pairs to “nearby” or “far”. Any stream *S* of images *i _{1}, i_{2}, i_{3}, …, i_{n}* can be transformed into a binary problem according to:

*{((i*. In unmath “the binary problem formed by predicting whether images are adjacent in the chain of experience”. (*) A good solution to this binary problem would give us an

_{j},i_{k}),1 – I(j = k+1 or k = j+1): i_{j},i_{k}in S}interesting 1-bit metric. Using regression and counting numbers of transitions might provide a more conventional multibit metric.

This metric, if well solved, has a concrete meaning: the minimum distance in terms of actuator transitions between positions. A shortest path in this space is a sequence of actuator movements leading from a position A to a position B. A projection of this space into low dimensions provides some common format which both the human and the robot can understand. Commanding the robot to go to some location is just a matter of pointing out that location in the low dimensional projection.

This is a possible use for manifold based dimension reduction techniques which I find compelling, if it works out. (Anyone interested in playing with this should talk to Dana Wilkinson who is considering experimenting with this approach.)

(*) We probably would want to tweak the positive/negative ratio to reflect the pattern encountered in usage.

(**) Post tweaked to fix an oversight.

John, your description is quite incorrect.

Most NLDR algorithms (e.g., LLE) do neither 2 nor 3.

mb

You are right (sorry, I was hurried). I will tweak the description a bit.

Could you appeal to the NFL thms as motivation? I have a learning algo that’s good, but not universally, so I want to massage my bad data into a form that’s good for my algo? (FWIW, I agree that low dimensional embedding often seems under motivated, but this might be a reasonable argument.)

I don’t understand how this argument works. The composition of (projection, learning algorithm) is a learning algorithm which implies that any theorem about all learning algorithms applies to the composed algorithm.

Nice blog.I read some of you’re articles and they are really nice.

Clearly the full composition won’t work, but if you have a learning algo you like, and you get data that isn’t likely to work in your learning algo, you can either develop a new learning algo, or you can attempt to map the data into a form that is better for the algo you have right now. So you selectively apply the projection depending on the data that comes in, using your human bias.

… and then of course the composed (human, dimension reduction, learning algorithm) algorithm is a learning algorithm.

One of the issues I find with dimensionality reduction is that the use of kernels in SVMs often does the opposite of dimensionality reduction: the kernels often project into very high-dimensional spaces. How do we reconcile the apparent advantage of increasing dimension (where we can gain linear separability) with the desire to reduce the data to its intrinsic dimensionality? Is there something to be gained by projecting data to a lower dimension, and then projecting it from there into a higher dimension? If not, is dimensionality reduction at all useful for regression/classification problems? Also, it seems as though most dimensionality reduction algorithms do not usually provide a principled way to predict the low-dimensional representation of a newly seen data point without re-solving the entire problem (or usually this is treated as an afterthought).

Let’s assume manifold learning is useless. What about hidden variable models? Two recent papers on parsing have basically improved performance by halucinating additional parse tree structure. That is, there are features that are not observed, and weights are learned on these features (which are inferred from EM).

This strikes me as odd, largely for the same reason that manifold learning strikes me as odd (though the latter has the advantage that it can be trained on unlabeled data, which may or may not be beneficial). Whatever features we’re using to guess the values of the halucinated structure should give us sufficient information to make the prediction without the halucinated structure. It seems that performance benefits, if real, are probably due to the halucinated structure having the effect of conjoining different features. Thoughts?

Another use of manifold learning is to learn a distance function. For example, you have features for each sample, but you have another distance matrix on this data. This distance

measure is very different from any simple metric on the original feature space. So you

need to find a distance function on the original feature space but be consistent with

the given distance matrix.

To answer the title of your essay: “Why Manifold-Based Dimension Reduction Techniques?” I would say compression for one. Right now, there are classifiers operating on vectors in D dimensions (D large). What if you can get the same (or better) classification results using d dimensions? (d

I would say that training a classifier in a high dimensional space can make it easier to find a good linear classifier but at the expense of a potentially very large computational overhead and potentially bigger model size.

On the other hand, kernel based methods allow for the virtual mapping of a low dimension dataset (non linearly separable) into a high dimension feature space without having to explicitly compute with vectors of very large dimensions (the kernel trick). Hence the kernel based method allow for training better classifiers (linear separability in the feature space) without the computational overhead of the explicit mapping.