Machine Learning (Theory)

6/19/2007

How is Compressed Sensing going to change Machine Learning ?

Compressed Sensing (CS) is a new framework developed by Emmanuel Candes, Terry Tao and David Donoho. To summarize, if you acquire a signal in some basis that is incoherent with the basis in which you know the signal to be sparse in, it is very likely you will be able to reconstruct the signal from these incoherent projections.

Terry Tao, the recent Fields medalist, does a very nice job at explaining the framework here. He goes further in the theory description in this post where he mentions the central issue of the Uniform Uncertainty Principle. It so happens that random projections are on average incoherent, within the UUP meaning, with most known basis (sines, polynomials, splines, wavelets, curvelets …) and are therefore an ideal basis for Compressed Sensing. [ For more in-depth information on the subject, the Rice group has done a very good job at providing a central library of papers relevant to the growing subject: http://www.dsp.ece.rice.edu/cs/ ]

The Machine Learning community has looked at Random Projections before, for instance:

  • Experiments with Random Projections for Machine Learning by Fradkin and Madigan (KDD-03.)
  • Face Recognition Experiments with Random Projection by Goel, Bebis and Nefian
  • Dimensionality reduction by random mapping: Fast similarity computation for clustering by S. Kaski (Proceedings of IEEE International Joint Conference on Neural Networks, 1998.)
  • but while they seem to give somewhat comparable results with regards to PCA, the number of contributions on the subject does not seem overwhelming. Maybe one of the reason is that in most papers cited above, the main theoretical reason for using Random projections lies with the Johnson-Lindenstrauss (JL) lemma. As a consequence, most random matrices used in these publications come from the Database world and not from the newer framework of Compressed Sensing (a list of these matrices and their properties can be found in the middle of this page). The uncanny reliance on Random projections within the JL lemma and in the Compressed Sensing setting was explained by Richard Baraniuk, Mark Davenport, Ronald DeVore, and Michael Wakin in this paper entitled: A simple proof of the restricted isometry property for random matrices. However, the most interesting fallout from this comparison between JL and CS comes in the form of the contribution by Richard Baraniuk and Michael Wakin in Random projections of smooth manifolds. I’ll let the abstract speak for itself:

    We propose a new approach for nonadaptive dimensionality reduction of manifold-modeled data, demonstrating that a small number of random linear projections can preserve key information about a manifold-modeled signal……As our main theoretical contribution, we establish a sufficient number M of random projections to guarantee that, with high probability, all pairwise Euclidean and geodesic distances between points on M are well-preserved under the mapping \phi. Our results bear strong resemblance to the emerging theory of Compressed Sensing (CS), in which sparse signals can be recovered from small numbers of random linear measurements. As in CS, the random measurements we propose can be used to recover the original data in RN. Moreover, like the fundamental bound in CS, our requisite M is linear in the “information level” K and logarithmic in the ambient dimension N; we also identify a logarithmic dependence on the volume and curvature of the manifold. In addition to recovering faithful approximations to manifold-modeled signals, however, the random projections we propose can also be used to discern key properties about the manifold. We discuss connections and contrasts with existing techniques in manifold learning, a setting where dimensionality reducing mappings are typically nonlinear and constructed adaptively from a set of sampled training data.

    It looks as though, as a result, Universal Dimensionality Reduction is achieved by some properly chosen random projections. In the case of data living in a low dimensional manifold, the JL lemma states that the number of random projections is proportional to the number of points or samples from that manifold, on the other hand, CS seems to show that the number of random projections is proportional to the characteristic of the manifold only.

    The results highlighted by Wakin and Baraniuk are very compelling but there is another appealing reason to Random Projections: Robustness. While trying to mimick nature in the learning process, one cannot but be amazed at the reliability of the biological system. On the other hand, even the researchers that model these processes do not realize or point out that this robustness is in part due to random projections. Case in point, the excellent work of Thomas Serre, Aude Oliva and Tomaso Poggio culminating in a paper describing a biology inspired model of brain that shows its ability to process information in a feedforward fashion. The modeling is new in that, in this area of science, there is a central issue as to whether the brain works in a one-way process or with many loops. In the paper, the feature dimension reduction model (which is what this process is) uses random projections as I pointed out recently.

    Because of the intrinsic dimension reduction capability, Mike Wakin has also shown efficient nearest neighbor searches using few random projections (see figure 3 of this paper). I could go on but the point is that since CS is a revolution in the world of signal/feature acquisition and processing (see the analog-to-information A2I site ) one cannot but wonder aloud how this will affect Machine Learning in general.

    19 Comments to “How is Compressed Sensing going to change Machine Learning ?”
    1. jl says:

      My understanding of compressed sensing is relative to error correcting codes.

      In error correcting codes, you take an unstructured object of some number of bits, create a structured object with more bits, then measure only a subset of the bits, and reconstruct the original object. We can think of these steps as Encode, Measure, Reconstruct.

      In compressed sensing, it seems we start with a known structure, make some measurements, and then try to reconstruct the original. This means we keep the Measure and Reconstruct steps from error correcting codes. Since we don’t have any control (in general) over the structure, the process of reconstruction might be substantially more difficult than in error correcting codes. However, it seems that some of the similar intuitions apply.

      The most obvious application of error correcting codes in machine learning is Error Correcting Output Codes. One of the complaints about ECOC systems is that they result in learning problems which are not “natural” and (hence) may be difficult. I’m not sure if compressed sensing can help with that. On one hand, there may exist a natural structure imposed by a learning algorithm/problem pair, but on the other hand it seems that compressed sensing systems prefer measurement with respect to a basis which may still be unnatural. It’s also important to take into account consistency issues which probably don’t come up in normal compressed sensing.

    2. Igor Carron says:

      This is very interesting and I need to look deeper at your point (i.e read it multiple times). In the meantime, Mark Rudelson and Roman Vershynin wrote “Geometric approach to error correcting codes and reconstruction of signals”

      I personally like the idea in the first paper for the following reason. In the MIT paper I mentionned, they laboriously devised a very good way of reducing the dimension of the feature space (which is enormous in the first place: many gabor filters at different location and scales….) through random projections. As I recall they used an SVM to do the learning. The SVM was not supposed to represent any biological entity per se (just a convenient way of doing the classification). If one were to use an ECC classifier then we would touch upon the issue of classification as a “communication problem” (as in your first paper) not unlike how memory is stored and updated in our biological equivalent. This is appealing because our cognitive abilities are known to be dependent on the connectivity of the brain network and a “communication problem” would probably have a good fit to that. Case in point autism:

      Autism and Abnormal Development of Brain Connectivity written by Matthew K. Belmonte, Greg Allen, Andrea Beckel-Mitchener, Lisa M. Boulanger, Ruth A. Carper, and Sara J. Webb, The Journal of Neuroscience, October 20, 2004 • 24(42):9228 –9231

      I also like the robustness aspect of ECC as presented in that paper. What is considered not natural in using ECOC in learning problems, do you have an example in mind ?

      Igor.

    3. Piotr says:

      Nice post! For completeness, a comment: a parallel development about “universal” dimensionality reduction was obtained in

      Piotr Indyk, Assaf Naor, “Nearest neighbor preserving embeddings”, to appear in the ACM Transactions on Algorithms,
      (preprint here)

      It shows that a random projection of any low-dimensional set with “intrinsic dimension” d, into O(d)-dimensional space, preserves an approximate nearest neighbor of any given point with constant probability. The “intrinsic dimension” used here is the so-called “doubling dimension”, used in several earlier papers, incl. Krautghamer-Lee’04, Beygelzimer-Kakade-Langford’06, etc.

    4. rif says:

      I frequently wonder/worry about the practicality of any of these results. In my experience so far, dimensionality reduction via random projections has been one of those areas where the order notation hides some cardinal sins: the theory says I can project from d to O(log d) dimensions and pretty much preserve most distances, but this doesn’t end up meaning that I can project from 1000 to say 20 dimensions and pretty much preserve most distances. (Is it just me, or do the pictures from the single-pixel camera using even 40% as many images as the original look pretty blurry?)

    5. Igor Carron says:

      Rif,

      The theory which you are refering to i.e. projecting from d to log(d) is typically the Johnson-Lindenstrauss result. As I mentioned in the entry and the reason for it, is that compressed sensing goes beyond JL: The reduced dimension scales with the underlying manifold not the number of samples from that manifold. The biggie would be to guess that dimension in the first place.

      With regards to the blurring ( as can be seen here: http://www.dsp.rice.edu/~wakin/pcs-camera.pdf )
      I agree even with the 40% being blurry but the Rice CS camera is not a purely analog camera: i.e the picture is projected on some Haar basis (every pixel shining is a step function over that area). In effect in order to have good reconstruction, one needs to have good incoherence between the target space and the CS signal space. Haar bases are good at decomposing natural images so I would not expect very good incoherence between the Haar basis and the wavelet basis (used for the reconstruction). If the Rice CS camera were to be replicating diracs instead of Haar functions, you’d get less blurry images.
      A similar note was made by Terry Tao in one of the comments in his blog:
      http://terrytao.wordpress.com/2007/04/15/ostrowski-lecture-the-uniform-uncertainty-principle-and-compressed-sensing/#comment-787
      Also you have similar blurry reconstructions from live images from the Random Lens Imager at MIT that uses the compressed sensing framework:
      http://dspace.mit.edu/handle/1721.1/33962
      ( I talked about different hardware implementation of compressed sensing hardware here: http://nuit-blanche.blogspot.com/2007/03/compressed-sensing-hardware.html)
      But as you can see in the excellent Random Lens Imager paper from MIT, one of the reason is that we are not using 3d functions to do the reconstruction as the camera is certainly sensitive to depth as well. Please note the need for Machine Learning techniques to do the calibration of these imagers.

      Igor.

    6. Piotr says:

      Hi,

      Both JL and dimensionality reduction in manifolds utilize similar phenomena, namely concentration of measure. So the constants in the dimension bound are more or less the same; in particular, the dependence on (1+eps)-approximation is 1/eps^2 (this holds even if you just want to embed TWO points). So, if you need very good approximation, the reduced dimension might be quite large. Still, it might be much lower than the original dimension, esp. for images where the original dimension can easily be 1000×1000.

      On the other hand, in compressed sensing (and related “heavy-hitters” algorithms in data streaming) the dependence on eps is actually smaller, for technical reasons. So the practical lessons learned for JL lemma might not extend to this case.

    7. hal says:

      My impression is that there’s more to compressed sensing that just dimensionality reduction though (which seems to be the characterization on the blog). One thing, for instance, that CS people have studied is the difference between L1 and L0 regularization.
      Typically we want to do L0, but this is NP-hard, so we resort to L1 since it typically leads to linear programs, which are convex and
      poly-time solvable. CS people have studied this in a different setting and come up with characterizations that state that in certain
      (essentially over-determined systems) L1 minimization actually will give you the same result as if you had done the NP-hard L0 minimization.

      This is also obviously applicable to ML, but is not just randomized projections.

    8. Igor Carron says:

      Hal,

      Obviously the L1 versus L0 distance replacement is very appealing in a by itself. The small caveat is that it does not seem to hold all the time as initially guessed. I pointed out in a previous entry the newer papers by David Donoho and Victoria Stodde entitled: Breakdown Point of Model When the Number of Variables Exceeds the Observations where there is a nice curve showing a phase transition of some kind where L1 cannot answer the L0 question. But as Piotr was pointing out the heavy hitters from streaming are directly taking a stab at L0 (combinatorial), so framing compressed sensing only in the L0-L1 correspondance would not do justice to it. Also, the main issue is really that the decomposing basis be incoherent with the signal acquisition one, random projections is just one, albeit very nice, example but if you have prior information on the dataset you are looking at, then maybe there is a way to apply compressed sensing that does not require random projections.

      Igor.

    9. Piotr says:

      To followup on Igor’s and Hal’s comments. Indeed, dimensionality reduction is only one part of the equation. Essentially, dimensionality reduction in itself only provides “information theoretic” guarantees: it guarantees that the projected vector Ax contains enough information to recover an approximation to the original vector x. But recovering x from Ax *efficiently* is another matter. The Basis Pursuit approach (a.k.a. L1 minimization) provides a good and efficient way to do that; other methods exist as well.

    10. hal says:

      (incidentally, thanks for the great EMNLP talk, Piotr!)

    11. I am still trying to wrap my head around this and it seems it will take a long time. Anyway, you say:

      In the case of data living in a low dimensional manifold, the JL lemma states that the number of random projections is proportional to the number of points or samples from that manifold, on the other hand, CS seems to show that the number of random projections is proportional to the characteristic of the manifold only.

      From the Baraniuk and Wakin paper I understand that the signal, being sparse in the CS (decoding) basis, can be expressed in a finite number of points to which you can apply the JL lemma. That is how they relate JL and CS. Doesn’t that mean that the number of points needed by JL is proportional to some characteristic of the manifold?

    12. Igor Carron says:

      Muhammad,

      In the paper making the connection between JL and CS, this is what the authors write:

      “.. First, we show an intimate linkage between the CS theory
      and the JL lemma. Second, we exploit this linkage to provide simple proofs for a fundamental CS construct, the so-called Restricted Isometry Property (RIP). In particular, we show how elementary concentration of measure inequalities for random inner products developed for proving the JL lemma in [1] together with simple covering arguments provide a simple and direct avenue to obtain core results for both n-widths and CS. The concentration inequalities are easy to verify for standard probability distributions such as Bernoulli, Gaussian, and other distributions [1]….”

      In other words, techniques used for the JL lemma are also used for CS but seem to provide a tighter bound in that latter case (see also Piotr’s answer in his comment on 2007-06-22 11:43:26 )

      Also please note that the smooth manifold paper is really not about the initial “compressed sensing” which has to do with sparse signals from libraries. They extend the concept to signals living on a manifold. So I would expect that somebody eventually makes the connection between initial CS and the diffusion methods on manifolds of Lafon Lafon , Coifman and Maggioni.

      Igor.

    13. Igor Carron says:

      One more thing. The probability distributions fulfilling the RIP are generally not the ones that the Machine learning community seems to be using. One can see in all the sample papers I mentioned, sparser matrices are generally used ( Dimitris Achlioptas in 2000 ( here)). Terry Tao just came up with a good description on trying to build matrices that have the RIP (or UUP) property. The hope is to be able to build deterministic sparse projection matrices.

      Igor.

    14. [...] A few weeks ago, Igor proposed several leads on hunch.net. To summarize : [...]

    15. Igor Carron says:

      Looks like an interesting outgrowth/parallel of the Compressed Sensing framework in the area of machine learning techniques:

      Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization by Benjamin Recht, Maryam Fazel and Pablo Parrilo

      http://arxiv.org/abs/0706.4138

      Igor.

    16. [...] has been tempered by powerful additions from the algorithms and distributed systems communities: compressed sensing, feature hashing and feature sharding across multiple cores have significantly lowered [...]

    17. [...] has been tempered by powerful additions from the algorithms and distributed systems communities:compressed sensing,feature hashing andfeature sharding across multiple cores have significantly lowered computational [...]

    18. [...] has been tempered by powerful additions from the algorithms and distributed systems communities: compressed sensing, feature hashing and feature sharding across multiple cores have significantly lowered [...]

    Sorry, the comment form is closed at this time.

    Powered by WordPress