One of the remarkable things about machine learning is how diverse it is. The viewpoints of Bayesian learning, reinforcement learning, graphical models, supervised learning, unsupervised learning, genetic programming, etc… share little enough overlap that many people can and do make their careers within one without touching, or even necessarily understanding the others.
There are two fundamental reasons why this is possible.
- For many problems, many approaches work in the sense that they do something useful. This is true empirically, where for many problems we can observe that many different approaches yield better performance than any constant predictor. It’s also true in theory, where we know that for any set of predictors representable in a finite amount of RAM, minimizing training error over the set of predictors does something nontrivial when there are a sufficient number of examples.
- There is nothing like a unifying problem defining the field. In many other areas there are unifying problems for which finer distinctions in approaches matter.
The implications of this observation agrees with inspection of the field.
- Particular problems are often “solved” by the first technique applied to them. This is particularly exacerbated when some other field first uses machine learning, as people there may not be aware of other approaches. A popular example of this is Naive Bayes for spam. In other fields, the baseline method is a neural network, an SVM, a graphical model, etc…
- The analysis of new learning algorithms is often subject to assumptions designed for the learning algorithm. Examples include large margins for support vector machines, a ‘correct’ Bayesian prior, correct conditional independence assumptions for graphical models, etc… Given such assumptions, it’s unsurprising to learn that the algorithm is the right method, and justifying a new algorithm becomes an exercise in figuring out an assumption which seems natural sounding under which the algorithm performs well. This assumption set selection problem is the theoretician’s version of the data set selection problem.
A basic problem is: How do you make progress in a field with this (lack of) structure? And what does progress even mean? Some possibilities are:
- Pick an approach and push it. [Insert your favorite technique] everywhere.
- Find new real problems and apply ML. The fact that ML is easy means there is a real potential for doing great things this way.
- Find a hard problem and work on it. Although almost anyone can do something nontrivial on most problems, achieving best-possible performance on some problems is not at all easy.
- Make the problem harder. Create algorithms that work fast online, in real time, with very few or no labeled examples, but very many examples, very many features, and very many things to predict.
I am least fond of approach (1), although many people successfully follow approach (1) for their career. What’s frustrating about approach (1), is that there does not seem to be any single simple philosophy capable of solving all the problems we might recognize as machine learning problems. Consequently, people following approach (1) are at risk of being outpersuaded by someone sometime in the future.
Approach (2) is perhaps the easiest way to accomplish great things, and in some sense much advance comes from new applications.
Approach (3) seems solid, promoting a different kind of progress than approach (2).
Approach (4) seems particularly cool to me at the moment. It is not as specialized as (2) or (3), and it seems many constraints are complementary. For example, there is large scale learning = online learning.
A similar line-up for ML vendors and customers would be interesting, as the list’s similar, but not identical. Most companies focus on a combination of (1) and (2), with (4) forced upon them. We find we’re always facing semi-supervised, many outcome, ill-defined problems that have to scale.
Breck likes to tell customers that there’s “no magic pixie dust”. That is, don’t worry about the fine details of (3).
I don’t mind researchers focusing on one approach (1) — it doesn’t seem optimal for everyone to be a general practitioner. How else do we squeeze all of the value out of an approach? Almost(?) all currently practical approaches to ML take massive amounts of “tuning” either in terms of features/predictors or parametric settings.
Because of the fiddly issues of optimizing even one approach, I don’t like evaluations scripted as: “I implemented techniques X, Y, Z, …, evaluated on data set A, B, C,… with settings i, j, k… and system X outperformed all the others 13/17 times.” You can’t even trust the systems for which the authors are specialists because you’re often seeing post-hoc optimal setting performance reported.
Public bakeoffs with real held-out data are better at comparing approaches, but subject to extremely high variance in terms of participation (are the best people even in the game?) and time limitations (we often just submit baseline entries because of time). And they often focus on new problems, so everything’s preliminary.
Your observation is very true! One consequence of machine learning being easy is that in papers that introduce a new method, the final section in which they show how their new method solved an important practical problem, producing results ten times better than what the application-area experts had been able to achieve, is actually of no interest, at least from the point of view of evaluating the merits of the new method. But it’s hard to break journals from wanting such a final section (or more commonly, a waterdown “practical” application on some stale data set).
No small part of the problem is that the education that students receive in CS departments is nearly completely irrelevant to machine learning. I think that the field would benefit greatly if the practitioners had more mathematical maturity. We would see a lot more generalization and perhaps the myriad different subfields with their own terminology would just go away.
Approach (2) finding new applications is really fun and often leads to work along the lines of approach (4) “making the problem harder” in order to reach far enough into the practical settings where additional constraints (time, lack of labels, budgets, etc.) exist.
As someone who is in category 2, and thus somewhat of a “consumer” of ML techniques, I wonder how you recommend identifying the most appropriate ML technique for problems. Most books tend to focus on the techniques themselves, not necessarily the pros and cons of each for different problems.
Approach 1 is undoubtedly the best because it is approach-centric. It is new approaches, rather than new problems, that represent concrete progress in the field. We can find all the “hard problems” we want, but if we simply keep applying the same approaches to them, we are going nowhere. It is when an approach is pushed to a new level that an advance occurs. Thus “push it” is the right answer.
In fact, push it even if it doesn’t outperform some other approach. Each chain of advances along any line of approach represents a trajectory through the search space of machine learning algorithms. No one has the oracle to say that such a chain will not lead to greatness even if the current link in the chain is less robust than some other link in some very different chain. As a research field, it’s where we’re going that matters more than where we are. We should not forget that. Leave the trivia of whether X outperforms Y to the practitioners. The job of the researcher is to convert X to X’ regardless of what Y does.
Otherwise, the entire field is nothing but a naive hill-climbing search through the space of algorithms based on performance alone. You’d think we would see the folly in that. Let us encourage without bias smart people to trust their gut instincts and push their favored approach to its limit. Then we have a parallel search of much greater fortitude.
Another issue with public bakeoffs is that in trying to be as fair and clean as possible, they tend to rely on over-simplistic metrics, typically a real-valued score such that it’s easy to rank systems.
Typical examples of this are multiply-averaged precisions used eg in IR or BLEU scores used in Machine Translation evaluations.
This encourages tweaking and metric gaming at the expense of designing methods/models with new useful features (how do you assign a score to having an interpretable classification decision?).
It also promotes the notion of statistically significant difference, at the expense of practically significant difference.
Actually both points are potential pitfalls for approach (3) above: achieving best-possible performance assumes that there is a clearly defined performance measure that is relatively easy to compute, and that achieving best performance, as opposed to being within a fraction of best performance, actually makes a difference.
As a practitioner I find that the effectiveness of classifier techniques are overwhelming dependent on the extraction, pre-processing, and derivation of good feature data.
In fact recently I was involved in a grad class (http://ecee.colorado.edu/~fmeyer/Courses/ecen5012/ECEN_5012.html) that explored dim reduction; both random projection and isomaps (laplacian eigenmap). For a project we broke up into groups and set out to bet Pampalk’s (http://www.pampalk.at/ma/) score in the music genre classification problem (MIREX). As a class we attempted SVM, Laplacian Eigenmap w/ linear classification, KNN, Naive Bayes, Neural Nets, Decision Tree Learning, and several more I can’t remember. The interesting thing was we all used the same basic feature set; mel-scale frequency cepstral coefficients (MFCC). This was due to three factors; (1) Pampalk had code that extracted the data, (2) The MFCC was very effective (3) we had limited time to complete the the project. There were a few other attempts at feature extraction but they were very similar to MFCC.
The classification selections were all based on the students backgrounds. The interesting thing was the results were all very close; ~89-91% correct genre selection. Pretty good right? Well Pamaplk achieved 95% with a basic KNN. The difference was overwelming due to his work on the feature extraction. He used the MFCC features to start, then he clustered the songs based on their mean, covariance, and then applied a very computational intensive Earth Movers Distance algorithm to determine the distance between each song.
The basic priciples I use are to select the a classifier based on it’s operational advantages. For example, the amount of labeled training data, the CPU/Memory load, and the adaptive nature of the algorithm. Oh ya and make sure the data doesn’t violate the core assumptions of the algo…for the most part. For example I at time apply the naive bayes even though I know the data is correlated and it’s not a gaussian dist; reason? because the CPU/Mem load is very managable and the algo is very adaptive.