Concerns about the Large Scale Learning Challenge

The large scale learning challenge for ICML interests me a great deal, although I have concerns about the way it is structured.

From the instructions page, several issues come up:

Large Definition My personal definition of dataset size is:
1. small A dataset is small if a human could look at the dataset and plausibly find a good solution.
2. medium A dataset is mediumsize if it fits in the RAM of a reasonably priced computer.
3. large A large dataset does not fit in the RAM of a reasonably priced computer.
By this definition, all of the datasets are medium sized. This might sound like a pissing match over dataset size, but I believe it is more than that.

The fundamental reason for these definitions is that they correspond to transitions in the sorts of approaches which are feasible. From small to medium, the ability to use a human as the learning algorithm degrades. From medium to large, it becomes essential to have learning algorithms that don’t require random access to examples.
No Loading Time The medium scale nature of the datasets is tacitly acknowledged in the rules which exclude data loading time. My experience is that parsing and loading large datasets is often the computational bottleneck. For example when comparing Vowpal Wabbit to SGD I used wall-clock time which makes SGD look a factor of 40 or so worse than Leon’s numbers only using training time after loading. This timing difference is entirely due to the overhead of parsing, even though the format parsed is a carefully optimized binary language. (No ‘excluding loading time’ number can be found for VW, of course, because loading and learning are intertwined.)
Optimal Parameter Time The rules specify that the algorithm should be timed with optimal parameters. It’s very common for learning algorithms to have a few parameters controlling learning rate or regularization. However, no constraints are placed on the number or meaning of these parameters. As an extreme form of abuse, for example, your initial classifier could be declared a parameter. With an appropriate choice of this initial parameter (which you can freely optimize on the data), training time is zero.
Parallelism One approach to dealing with large amounts of data is to add computers that operate in parallel. This is very natural (the brain is vastly parallel at the neuron level), and there are substantial research questions in parallel machine learning. Nevertheless it doesn’t appear to be supported by the contest. There are good reasons for this: parallel architectures aren’t very standard yet, and buying multiple computers is still substantially more expensive than buying the RAM to fit the dataset sizes. Nevertheless, it’s disappointing to exclude such a natural avenue. The rules even appear unclear on whether or not the final test run is on an SMP machine.

As a consequence of this design, the contest prefers algorithms that load all data into memory then operate on it. It also essentially excludes parallel algorithms. These design decisions discourage large scale algorithms (where large is as defined above) in favor of medium scale learning algorithms. The design also favors highly parameterized learning algorithms over less parameterized algorithms, which is the opposite of my personal preference for research direction.

Many of these issues are eliminatable or at least partially addressable. Limiting the parameter size to ’20 characters on the commandline’ or in some other reasonable way seems essential. It’s probably too late to get large datasets, but using wall-clock time would at least avoid bias against large scale algorithms. If the final evaluation is going to take place on an SMP machine, at least detailing that would be helpful.

Despite these concerns, it’s important to be clear that this is an interesting contest. Even without any rule changes, it’s outcome tells us something about which sorts of algorithms work at a medium scale. That’s good information to know if you are interested in tackling larger scale algorithms. The datasets are also large enough to break every Theta(m²) algorithm. We should also respect the organizers: setting up any contest of this sort is quite a bit of work that’s difficult to nail down perfectly in advance.

update: Soeren has helped setup an SMP parallel track which address some of the concerns above. See the site for details, and see you there.

14 Replies to “Concerns about the Large Scale Learning Challenge”

hal says:

5/1/2008 at 7:57 am

I’d like to second your concerns…. I was actually going to blog about this myself, but never got around to it. The other thing that I think is missing is a large, high dimensional data set. The only data set with even a modest number of features is the webspam data, but it’s really quite small (by today’s standards). We (NLPers) frequently have to deal with 100k+ features and 10m+ data points and it would be very easy to get lots of data sets that fit this mold.
Soeren Sonnenburg says:

5/1/2008 at 12:01 pm

Note that the number of features really depends a lot on the data
representation. The 55 million DNA dataset may actually be over a
billion dimensional (using the weighted degree kernels feature
representation – that is know to work best so far). For webspam we just
decided to use 3 grams… and the raw strings for the training set are
10GB in size. So we are close to what fits in memory and what we can
distribute. Still I would like to learn about such *public* datasets for
next time (I guess challenges with such big datasets require you sending
in harddisks), very much appreciated.
Elad Yom-Tov says:

5/1/2008 at 12:03 pm

As you write in your blog entry, designing a competition which properly judges all possible systems is difficult, especially at for the first time. Let me try and explain some of the considerations we had when designing this competition, which might either address your concern or open the possibility for suggesting improvements.

One of the models we looked at when we organized the competition was the Text REtrieval Conference (TREC). Each year NIST organizes a conference (but really, a competition) which tests different information retrieval systems against each other. During the TREC meeting the most interesting approaches are discussed in talks and posters, so that they are disseminated to all members of the community. Over the years, TREC has had a significant impact on the accuracy of information retrieval systems. If we are successful, the large scale competition will help achieve something similar.

In my view, one of the lessons of TREC is that one should have clear and easy to optimize goals. They don’t necessarily address all possibilities, and indeed, sometimes change after submission of the results, to the dissatisfaction of some of the participants. Because of this, we chose to have two tracks at the competition: An SVM track, which is very narrow in scope and has limited room for maneuver, and an open track, where any system can participate. Granted, parallel systems are not addressed as prominently as they deserve, but if there is enough interest (and enough submissions), we can try to highlight interesting comparisons of parallel solvers.

When selecting the datasets we used Bottou and Bousquet’s (NIPS 2007) definition for large-scale data, namely, that a small-scale learning problem is constrained by the maximal number of training examples, whereas a large-scale learning problem is limited by the maximal computing time (and resources) available. We hope that a few million of examples will suffice to present a significant time constraint to algorithms, but if we find that this is not the case, we’ll have to go to larger data sets next year…

Finally, there are arguments for and against discarding loading time: you were very correct in stating why loading time should be used, and as a developer of software I am well aware of the importance of efficient I/O. The reason for not using loading times is that it focuses on the complexity of the algorithm, and less on its implementation. This year we chose the latter.

This competition is somewhat of an experiment for us. We are very hopeful that it will create enough interest this year, and will blossom into a yearly competition which will have an effect similar to that of TREC in supporting progress the field of large scale classification. If you have ideas on measures which would highlight aspects which we have missed, please do tell us. We will be happy to consider them.

See you in Helsinki!
Soeren Sonnenburg says:

5/1/2008 at 12:04 pm

Dear John,

first of all thanks for you interest and comments to the Large Scale
Learning Challenge. We had to make many decision while designing the
challenge and they all have side effects. In an ideal world we wouldn’t
have to do this. But including a subjective time efficiency term (as
opposed to a fixed accuracy measure) makes everything just much more
difficult, so I hope to answer some of your concerns:

Ã¢â‚¬Â¢ Large Scale Definition:
Ã¢Ë†Ëœ It would be great if we could come up with a LS definition to agree
on. While I like your idea of categorizing by paradigm shift, I thought
more along the lines of:
Ã¢â‚¬Â£ One may define a large scale problem to be a problem which to solve it
reaches current computers limits be it computational, memory or transfer
costs wise. For machine learning this translates to high effort
algorithms (e.g.O(N^3)), large number of data points or high
dimensionality.
Ã¢Ë†Ëœ Why might this definition make sense?
Ã¢â‚¬Â£ Consider for example some image segmentation task. Usual practical
datasets have a couple of hundred pictures and can be easily solved by
human beings. In your definition this is small. However for ML methods
this is large scale as the effort of these algorithms is just too high.
Ã¢â‚¬Â£ A dataset that has billions of examples but few dimensions may
actually be easily “solved” by a human just visualizing it in 2d
(though it might be impossible to hold the data in memory at once). So
in your definition it is large and/or small.
Ã¢â‚¬Â¢ So, why did we choose datasets that still fit in memory?
Ã¢Ë†Ëœ Machine Learners usually work on very small scale problems. As we want
to get ML people to participate, and not exclude all the classic ML
(i.e. not online) methods this became a requirement.
Ã¢Ë†Ëœ Nowadays 64GB machines are affordable (e.g. MPI Tuebingen has a couple
of those). So for most problems data fits in memory.
Ã¢Ë†Ëœ It is very difficult to find datasets of size >64G, where really all
the data is required to get best performance.
Ã¢Ë†Ëœ It is close to impossible to distribute such big datasets (we already
ask big companies who all didn’t help us distributing the current ones
due to legal concerns).
Ã¢Ë†Ëœ Dataset size to a large extend depends on data representation, e.g.
storing the splice dataset as SVM light format leads to a dataset of
100G in size (which we have in the challenge btw). Otherwise it is just
a few GB and could probably be reduced to 10^6 dimensional feature space on DNA 🙂
Ã¢â‚¬Â¢ Loading Time:
Ã¢Ë†Ëœ Loading time depends on a lot of factors which we find hard to control
at all. There is disk speed, operating system, memory size and how much
of the system memory is used as cache. There is network speed and then
again data representation, implementation and so on. For example shogun
parses svm-light format several times faster than svmlight – but does
that make the method (svm-light) any faster? I agree that the situation
changes completely if you cannot keep the whole dataset in memory and
multiple passes through the data are necessary (the algorithm will
completely change to be efficient…). Still I don’t see how we could
come up with fair, comparable timings for this. It is then not only CPU
speed/cache size but also lots of other factors hard to control.
Nevertheless it would be great to do this another time, but it requires
some serious thinking how this could be done maybe by imposing an
artificial memory limit (say a G and a G of file cache).
Ã¢â‚¬Â¢ Optimal Parameter time:
Ã¢Ë†Ëœ We discussed that point for a long time but did not find a 100%
satisfactory solution. So we decided to exclude pre-processing time (if
it is not the dominant part time-wise). Also the pre-processing step has
to be mentioned in the abstract later. So your extreme example is a very
expensive feature selection strategy and the time it takes needs to be
included in the training time. So why didn’t we exclude this completely?
Normally people explore a small subset of the data first then draw some
conclusions on what methods/features make sense. How would you exclude
that expert knowledge? Note that this is only allowed in the wild
competition, not the SVM track where a predefined features have to be
used. Also in the end it is still performance that counts most (simply
being able to process such big datasets and to improve with more data
helps: if you are >5% better in relative performance to the other
competitors you will receive a high score). Unfortunately it is hard to
draw the line here. For model selection we could add this no more than X
parameters limitation to the instructions. However I don’t think that
anyone will perform model selection over thousands or so of
hyper-parameters on such large datasets and in case someone manages to
do this in time for the deadline he deserves to win (so I would claim
there is little danger that highly parameterized algorithms are favored
given the tight deadline). However preprocessing, feature selection,
choice of representation might all give you a big leap..
Ã¢â‚¬Â¢ Parallelism:
Ã¢Ë†Ëœ We would have wanted this but again way to hard to evaluate for now.
You are still allowed to submit in the challenge although we won’t be
able to do a re-evaluation. The only way to make it fair IMHO is to get
a cluster accessible for all participants and defining to use MPI on a
fixed number of CPUs and then measure wall clock time. Well maybe we
could also do this for shared memory archs when quad/oct whatever cores
are more widespread (next year?). But for the current challenge how
would you compare a single CPU algorithm to a parallelized one? Measure
wall-clock time again or the sum of the computational time on each of
the processors?
Ã¢â‚¬Â¢ wall-clock time:
Ã¢Ë†Ëœ Just to clarify we ask for the wall-clock training time (excluding
data loading time though).
Ã¢â‚¬Â¢ Finally we are organizing this challenge for the first time (there
hasn’t been one on that topic before). Unfortunately we did not get much
feedback on the LS-NIPS’08 workshop, where we clearly stated that we
only take into account training time. Note that no one wanted to
organize a parallel track or wall-clock-time track, so we simply focused
on the simplest thing. Still it is good that we get some feedback now
(although it might be too late for the current challenge). But we could
do another one (next year?) with a shifted focus – and as before we are
still looking for co-organizers with bright ideas *hint*, *hint*.

Soeren
Olivier Chapelle says:

5/1/2008 at 4:06 pm

John,
I agree with some of your concerns and it is a good idea to express them. On the other hand, we have to keep in mind that it is a very difficult challenge to organize (not the traditional test error as a performance measure) and there will always be arbitrary decisions to make. But even if it’s not perfect, I think that anyone interested in large scale learning should participate in this challenge: there will be a lot to learn from the results.

– Small vs large scale learning: I wouldn’t make the difference according to the size of the dataset, but more according to how long is the training (cf the NIPS’07 paper of Bottou-Bousquet). In particular, a dataset of 1M samples might be small for a linear method (a linear SVM can process it in less than a minute), but large for non-linear methods (if optimal decision boundary is very complex, it might take more than a day to learn).
So I agree with you that for linear methods, the datasets are too small: it is only of very limited practical interest of having a method running in 10s vs 30s. In that case, we want to explore larger datasets than cannot fit into memory and the loading time will be a crucial factor

– Optimal Parameter Time: I don’t see that as a problem. In the end, the competitors have to disclose their methods and no serious researcher would like to look foolish by doing the extreme abuse you mentioned. Since there is no money involved, the main motivation of the competitors will be to make discover to our community an algorithm they think is good.

A final general comment (also addressing the parallelism): in the end, if you want the competition to be successful, you need to have a lot of participants taking part. And since most work published in the literature is about single processor and data that can fit into memory (how unsatisfactory that might be), I think the challenge has to follow this trend to attract as many people as possible. If the competition turns out to be a medium scale one, so let it be.

Thanks to the organizers for all the efforts they have put in this competition and thanks John for your comments: they will certainly be helpful in designing the next competition!
jl says:

5/2/2008 at 7:47 am

On a definition of large as “hitting computational constraints”, I think it’s important to understand that computational constraints can be hit with essentially any dataset by finding an appropriately slow algorithm. Some of these computationally intense algorithms can even be pretty good. For example, testing every possible decision tree and making a final decision according to a carefully weighted vote according to the performance of each tree is severely intense, and it might even work well. So, a definition of large based on hitting computational limits seems questionable to me.

The suggestions I made in the second to last paragraph still seem reasonable to me.
jl says:

5/2/2008 at 8:32 am

On the image segmentation task: I regard the task as writing code which does segmentation (i.e. solving the segmentation prediction problem), rather than simply doing the segmentation on the test set. I think humans aren’t very good at writing this code.

For a very large dataset with very few relevant bits for prediction, I wouldn’t regard it as an inherently large scale learning problem because no benefit accrues from using all the samples.

Imposing constraints tighter than 64GB is easy in the existing framework, because you are already planning to evaluate the top contestants on your own hardware. It’s also reasonable, because ML is often used as part of a larger system and dedicating the entire system to ML is often impractical.

On loading time, you are worried about coming up with a fair comparable timing, yet I thought this was the point of running code on your own machine?

On optimal parameters, I don’t quite understand your response. In general, finding the optimal setting of learning rate, etc… seems to dominate the time to learn for the learning systems I’m familiar with. Finding the “right” initial weight vector seems like it would take the same amount of time. I’m also unclear on why this is regarded as feature selection. I also don’t recall discussion of preprocessing time being added in for the instructions, but I can’t check right now because the server is down. (Have you considered using P2P systems for distributing the data?)

For paralellism, the suggestion in the post seems reasonable: just detail the system you are going to do evaluations on. Wall clock time seems like a fair unbiased judgement for any good system.
jl says:

5/2/2008 at 8:46 am

The current rules seem both against online learning algorithms (because they prefer algorithms which load all the data into RAM) and map-reduce style algorithms (because there is no provision for parallelism) for dealing with large datasets. Both of these approaches seem like credible contenders for large scale learning with substantial research interest behind them.
Gordon Rios says:

5/2/2008 at 6:06 pm

Map-reduce is certainly a “credible contender” as evidenced by it’s widespread use by Google to solve large-scale problems and convincingly defeat all competitors in web search and online ad targeting. I think that the field of ML would be well served by maintaining contact with approaches that have been demonstrated in industry.
Soeren Sonnenburg says:

5/2/2008 at 10:49 pm

As there was limited interest on the LS-NIPS’08 workshop in having a parallel track in the challenge, we made the decision to not support parallel architectures. As a consequence the evaluation criteria are tuned to single cpus – it won’t make sense to apply those 1:1 to parallelized algoritms. But again if someone is willing to take the lead on this and think it all through, we can easily (and would be happy to) create a parallel track. Re’ online algorithms vs. batch: I don’t agree. Online algorithms still have the big advantage of having learned something long before the whole batch of examples is processed.
Soeren Sonnenburg says:

5/2/2008 at 11:09 pm

Re definition: This is a problem oriented definition. No one said that you should find the slowest algorithm of its kind to solve this problem. So the task here would be find an “accurate classifier”, free choice of weapons. If the task is overall very hard it might render the problem large scale even with few examples.

Re suggestions: Limiting the number of parameters makes sense. And (by design decision), yes we don’t punish algorithms that do inefficient data loading reducing cache misses, as data loading times are completely excluded.
jl says:

5/4/2008 at 9:39 am

There are a few definitions of online learning which might be slightly confusing things here. Online optimization approaches certainly enjoy an advantage. Online algorithms operating in an online informational setting appear to be disadvantaged to me, because they generally mix parsing and learning.
Amir Saffari says:

5/27/2008 at 11:51 am

I think one other interesting aspect which could have been used in the challenge is to consider multi-class or even multi-label problems: many real world large scale problems (e.g. image/video/text retrieval/classification) are multi-class, however many algorithms which do not have a natural extension from binary to multi-class cases (e.g. solving it by 1-vs-1, 1-vs-rest, error correcting codes, etc) have to repeat the same algorithm over different representations of data, which again make them not suitable in practice. So this might be also an area to move in next set of challenges.
Pingback: ?????????????? | Canapé

Comments are closed.