Let’s define a learning problem as making predictions given past data. There are several ways to attack the learning problem which seem to be equivalent to solving the learning problem.
- Find the Invariant This viewpoint says that learning is all about learning (or incorporating) transformations of objects that do not change the correct prediction. The best possible invariant is the one which says “all things of the same class are the same”. Finding this is equivalent to learning. This viewpoint is particularly common when working with image features.
- Feature Selection This viewpoint says that the way to learn is by finding the right features to input to a learning algorithm. The best feature is the one which is the class to predict. Finding this is equivalent to learning for all reasonable learning algorithms. This viewpoint is common in several applications of machine learning. See Gilad’s and Bianca’s comments.
- Find the Representation This is almost the same as feature selection, except internal to the learning algorithm rather than external. The key to learning is viewed as finding the best way to process the features in order to make predictions. The best representation is the one which processes the features to produce the correct prediction. This viewpoint is common for learning algorithm designers.
- Find the Right Kernel The key to learning is finding the “right” kernel. The optimal kernel is the one for which K(x, z)=1 when x and z have the same class and 0 otherwise. With the right kernel, an SVM(or SVM-like optimization process) can solve any learning problem. This viewpoint is common for people who work with SVMs.
- Find the Right Metric The key to learning is finding the right metric. The best metric is one which states that features with the same class label have distance 0 while features with different class labels have distance 1. With the best metric, the nearest neighbor algorithm can solve any problem.
Each of these viewpoints seems to be “right”, and each seems to have some utility in it’s context. It also seems important to realize that these are just different versions of the same problem. One consequence of this observation is that “wrapper methods” which try to automatically find a subset of features to feed into a learning algorithm in order to improve learning performance are simply trying to repair weaknesses in the learning algorithm.