ML Symposium and Strata/Hadoop World

The New York ML symposium was last Friday. There were 303 registrations, up a bit from last year. I particularly enjoyed talks by Bill Freeman on vision and ML, Jon Lenchner on strategy in Jeopardy, and Tara N. Sainath and Brian Kingsbury on deep learning for speech recognition. If anyone has suggestions or thoughts for next year, please speak up.

I also attended Strata + Hadoop World for the first time. This is primarily a trade conference rather than an academic conference, but I found it pretty interesting as a first time attendee. This is ground zero for the Big data buzzword, and I see now why. It’s about data, and the word “big” is so ambiguous that everyone can lay claim to it. There were essentially zero academic talks. Instead, the focus was on war stories, product announcements, and education. The general level of education is much lower—explaining Machine Learning to the SQL educated is the primary operating point. Nevertheless that’s happening, and the fact that machine learning is considered a necessary technology for industry is a giant step for the field. Over time, I expect the industrial side of Machine Learning to grow, and perhaps surpass the academic side, in the same sense as has already occurred for chip design. Amongst the talks I could catch, I particularly liked the Github, Zillow, and Pandas talks. Ted Dunning also gave a particularly masterful talk, although I have doubts about the core Bayesian Bandit approach(*). The streaming k-means algorithm they implemented does look quite handy.

(*) The doubt is the following: prior elicitation is generally hard, and Bayesian techniques are not robust to misspecification. This matters in standard supervised settings, but it may matter more in exploration settings where misspecification can imply data starvation.

NYAS ML 2012 and ICML 2013

The New York Machine Learning Symposium is October 19 with a 2 page abstract deadline due September 13 via email with subject “Machine Learning Poster Submission” sent to physicalscience@nyas.org. Everyone is welcome to submit. Last year’s attendance was 246 and I expect more this year.

The primary experiment for ICML 2013 is multiple paper submission deadlines with rolling review cycles. The key dates are October 1, December 15, and February 15. This is an attempt to shift ICML further towards a journal style review process and reduce peak load. The “not for proceedings” experiment from this year’s ICML is not continuing.

Edit: Fixed second ICML deadline.

ICML survey and comments

Just about nothing could keep me from attending ICML, except for Dora who arrived on Monday. Consequently, I have only secondhand reports that the conference is going well.

For those who are remote (like me) or after the conference (like everyone), Mark Reid has setup the ICML discussion site where you can comment on any paper or subscribe to papers. Authors are automatically subscribed to their own papers, so it should be possible to have a discussion significantly after the fact, as people desire.

We also conducted a survey before the conference and have the survey results now. This can be compared with the ICML 2010 survey results. Looking at the comparable questions, we can sometimes order the answers to have scores ranging from 0 to 3 or 0 to 4 with 3 or 4 being best and 0 worst, then compute the average difference between 2012 and 2010.

Glancing through them, I see:

  1. Most people found the papers they reviewed a good fit for their expertise (-.037 w.r.t 2010). Achieving this was one of our subgoals in the pursuit of high quality decisions.
  2. Most people had sufficient time for doing reviews. This was something that we worried about significantly in shifting the paper deadline and otherwise massaging the schedule. Most people also thought the review period was sufficiently long and most reviews were high quality (+.023 w.r.t. 2010)
  3. About 1/4 of reviewers say that author response changed their mind on a paper and 2/3 of reviewers say discussion changed their mind on a paper. The expectation of decision impact from author response is reduced from 2010 (-.135). The existence of author response is overwhelmingly preferred.
  4. People generally found ICML reviewing the same or better than previous ICMLs (+.35 w.r.t. 2010) and other similar conferences (+.198 w.r.t. 2010) at the cost of being somewhat more work. A substantial bump in reviewing quality was a primary goal.
  5. The ACs spent substantially more time (43 hours on average) than PC members (28 hours on average). This agrees with our expectation—the set of ACs didn’t change even after we had a 50% increase in submissions. The AC load we had this year was probably too high and will need to be reduced somewhat for next year.
  6. 2/3 of authors prefer the option to revise a paper during author response.
  7. The choice of how to deal with increased submissions is deeply undecided, with a slight preference for short talk+poster as we did.
  8. Most people like having two workshop days or don’t care.
  9. There is a strong preference for COLT and UAI colocation with the next tier of preference for IJCAI, KDD, AAAI, and CVPR.

ICML acceptance statistics

People are naturally interested in slicing the ICML acceptance statistics in various ways. Here’s a rundown for the top categories.

18/66 = 0.27 in (0.18,0.36) Reinforcement Learning
10/52 = 0.19 in (0.17,0.37) Supervised Learning
9/51 = 0.18 not in (0.18, 0.37) Clustering
12/46 = 0.26 in (0.17, 0.37) Kernel Methods
11/40 = 0.28 in (0.15, 0.4) Optimization Algorithms
8/33 = 0.24 in (0.15, 0.39) Learning Theory
14/33 = 0.42 not in (0.15, 0.39) Graphical Models
10/32 = 0.31 in (0.15, 0.41) Applications (+5 invited)
8/29 = 0.28 in (0.14, 0.41]) Probabilistic Models
13/29 = 0.45 not in (0.14, 0.41) NN & Deep Learning
8/26 = 0.31 in (0.12, 0.42) Transfer and Multi-Task Learning
13/25 = 0.52 not in (0.12, 0.44) Online Learning
5/25 = 0.20 in (0.12, 0.44) Active Learning
6/22 = 0.27 in (0.14, 0.41) Semi-Supervised Learning
7/20 = 0.35 in (0.1, 0.45) Statistical Methods
4/20 = 0.20 in (0.1, 0.45) Sparsity and Compressed Sensing
1/19 = 0.05 not in (0.11, 0.42) Ensemble Methods
5/18 = 0.28 in (0.11, 0.44) Structured Output Prediction
4/18 = 0.22 in (0.11, 0.44) Recommendation and Matrix Factorization
7/18 = 0.39 in (0.11, 0.44) Latent-Variable Models and Topic Models
1/17 = 0.06 not in (0.12, 0.47) Graph-Based Learning Methods
5/16 = 0.31 in (0.13, 0.44) Nonparametric Bayesian Inference
3/15 = 0.20 in (0.7, 0.47) Unsupervised Learning and Outlier Detection
7/12 = 0.58 not in (0.08, 0.50) Gaussian Processes
5/11 = 0.45 not in (0.09, 0.45) Ranking and Preference Learning
2/11 = 0.18 in (0.09, 0.45) Large-Scale Learning
0/9 = 0.00 in [0, 0.56) Vision
3/9 = 0.33 in [0, 0.56) Social Network Analysis
0/9 = 0.00 in [0, 0.56) Multi-agent & Cooperative Learning
2/9 = 0.22 in [0, 0.56) Manifold Learning
4/8 = 0.50 not in [0, 0.5) Time-Series Analysis
2/8 = 0.25 in [0, 0.5] Large-Margin Methods
2/8 = 0.25 in [0, 0.5] Cost Sensitive Learning
2/7 = 0.29 in [0, 0.57) Recommender Systems
3/7 = 0.43 in [0, 0.57) Privacy, Anonymity, and Security
0/7 = 0.00 in [0, 0.57) Neural Networks
0/7 = 0.00 in [0, 0.57) Empirical Insights
0/7 = 0.00 in [0, 0.57) Bioinformatics
1/6 = 0.17 in [0, 0.5) Information Retrieval
2/6 = 0.33 in [0, 0.5) Evaluation Methodology

Update: See Brendan’s graph for a visualization.

I usually find these numbers hard to interpret. At the grossest level, all areas have significant selection. At a finer level, one way to add further interpretation is to pretend that the acceptance rate of all papers is 0.27, then compute a 5% lower tail and a 5% upper tail. With 40 categories, we expect to have about 4 violations of tail inequalities. Instead, we have 9, so there is some evidence that individual areas are particularly hot or cold. In particular, the hot topics are Graphical models, Neural Networks and Deep Learning, Online Learning, Gaussian Processes, Ranking and Preference Learning, and Time Series Analysis. The cold topics are Clustering, Ensemble Methods, and Graph-Based Learning Methods.

We also experimented with AIStats resubmits (3/4 accepted) and NFP papers (4/7 accepted) but the numbers were to small to read anything significant.

One thing that surprised me was how uniform decisions were as a function of average score in reviews. All reviews included a decision from {Strong Reject, Weak Reject, Weak Accept, Strong Accept}. These were mapped to numbers in the range {1,2,3,4}. In essence, average review score < 2.2 meant 0% chance of acceptance, and average review score > 3.1 meant acceptance. Due to discretization in the number of reviewers and review scores there were only 3 typical uncertain outcomes:

  1. 2.33. This was either 2 Weak Rejects+Weak Accept or Strong Reject+2 Weak Accepts or (rarely) Strong Reject+Weak Reject+Strong Accept. About 8% of these paper were accepted.
  2. 2.67. This was either Weak Reject+Weak Accept*2 or Strong Accept+2 Weak Rejects or (rarely) Strong Reject+Weak Accept+Strong Accept. About 48% of these paper were accepted.
  3. 3.0. This was commonly 3 Weak Accepts or Strong Accept+Weak Accept+Weak Reject or (rarely) 2 Strong Accepts + Strong Reject. About 90% of these papers were accepted.

One question I’ve always wondered is: How much variance is there in the accept/reject decision? In general, correlated assignment of reviewers can greatly increase the amount of variance, so one of our goals this year was doing as independent an assignment as possible. If you accept that as independence, we essentially get 3 samples for each paper where the average standard deviation of reviewer scores before author feedback and discussion is 0.64. After author feedback and discussion the standard deviation drops to 0.51. If we pretend that papers have an intrinsic value between 1 and 4 then think of reviews as discretized gaussian measurements fed through the above decision criteria, we get the following:

There are great caveats to this picture. For example, treating the AC’s decision as random conditioned on the reviewer average is a worst-case analysis. The reality is that ACs are removing noise from the few events that I monitored carefully, although it is difficult to quantify this. Similarly, treating the reviews observed after discussion as independent is clearly flawed. A reasonable way to look at it is: author feedback and discussion get us about 1/3 or 1/4 of the way to the final decision from the initial reviews.

Conditioned on the papers, discussion, author feedback and reviews, AC’s are pretty uniform in their decisions with ~30 papers where ACs disagreed on the accept/reject decision. For half of those, the ACs discussed further and agreed, leaving Joelle and I a feasible quantity of cases to look at (plus several other exceptions).

At the outset, we promised a zero-spof reviewing process. We actually aimed higher: at least 3 people needed to make a wrong decision for the ICML 2012 reviewing process to kick out a wrong decision. I expect this happened a few times given the overall level of quality disagreement and quantities involved, but hopefully we managed to reduce the noise appreciably.