Continuizing Solutions

This post is about a general technique for problem solving which I’ve never seen taught (in full generality), but which I’ve found very useful.

Many problems in computer science turn out to be discretely difficult. The best known version of such problems are NP-hard problems, but I mean ‘discretely difficult’ in a much more general way, which I only know how to capture by examples.

  1. ERM In empirical risk minimization, you choose a minimum error rate classifier from a set of classifiers. This is NP hard for common sets, but it can be much harder, depending on the set.
  2. Experts In the online learning with experts setting, you try to predict well so as to compete with a set of (adversarial) experts. Here the alternating quantifiers of you and an adversary playing out a game can yield a dynamic programming problem that grows exponentially.
  3. Policy Iteration The problem with policy iteration is that you learn a new policy with respect to an old policy, which implies that simply adopting the new policy can go very wrong.

For each of these problems, there are “continuized” solutions which can yield smaller computation, more elegant mathematics, or both.

  1. ERM By shifting from choosing a single classifier to choosing a stochastic classifier we can prove a new style of bound which is significantly tighter, easier to state, and easier to understand than traditional bounds in the traditional setting. This is the PAC-Bayes bound idea.
  2. Experts By giving the adversary slightly more power—the ability to split experts and have them fractionally predict one way vs. another, the optimal policy becomes much easier to compute (quadratic in the horizon, or maybe less). This is the continuous experts idea.
  3. Policy Iteration For policy iteration, by stochastically mixing the old and the new policy, we can find a new policy better than the old policy. This is the conservative policy iteration idea.

There is some danger to continuizing. The first and second examples both involve a setting shift, which may not be valid—in general your setting should reflect your real problem rather than the thing which is easy to solve. However, even with the setting shift, the solutions appear so compellingly more elegant that it is hard to not hope to use them in a solution to the original setting.

I have not seen a good formulation of the general approach of continuizing. Nevertheless, I expect to see continuizing in more places and to use it in the future. By making it explicit, perhaps this can be made eaesier.

Explicit Randomization in Learning algorithms

There are a number of learning algorithms which explicitly incorporate randomness into their execution. This includes at amongst others:

  1. Neural Networks. Neural networks use randomization to assign initial weights.
  2. Boltzmann Machines/Deep Belief Networks. Boltzmann machines are something like a stochastic version of multinode logistic regression. The use of randomness is more essential in Boltzmann machines, because the predicted value at test time also uses randomness.
  3. Bagging. Bagging is a process where a learning algorithm is run several different times on several different datasets, creating a final predictor which makes a majority vote.
  4. Policy descent. Several algorithms in reinforcement learning such as Conservative Policy Iteration use random bits to create stochastic policies.
  5. Experts algorithms. Randomized weighted majority use random bits as a part of the prediction process to achieve better theoretical guarantees.

A basic question is: “Should there be explicit randomization in learning algorithms?” It seems perverse to feed extra random bits into your prediction process since they don’t contain any information about the problem itself. Can we avoid using random numbers? This question is not just philosophy—we might hope that deterministic version of learning algorithms are both more accurate and faster.

There seem to be several distinct uses for randomization.

  1. Symmetry breaking. In the case of a neural network, if every weight started as 0, the gradient of the loss with respect to every weight would be the same, implying that after updating, all weights remain the same. Using random numbers to initialize weights breaks this symmetry. It is easy to believe that there are good deterministic methods for symmetry breaking.
  2. Overfit avoidance. A basic observation is that deterministic learning algorithms tend to overfit. Bagging avoids this by randomizing the input of these learning algorithms in the hope that directions of overfit for individual predictions cancel out. Similarly, using random bits internally as in a deep belief network avoids overfitting by forcing the algorithm to learn a robust-to-noise set of internal weights, which are then robust-to-overfit. Large margin learning algorithms and maximum entropy learning algorithms can be understood as deterministic operations attempting to achieve the same goal. A significant gap remains between randomized and deterministic learning algorithms: the deterministic versions just deal with linear predictions while the randomized techniques seem to yield improvements in general.
  3. Continuizing. In reinforcement learning, it’s hard to optimize a policy over multiple timesteps because the optimal decision at timestep 2 is dependent on the decision at timestep 1 and vice versa. Randomized interpolation of policies offers a method to remove this cyclic dependency. PSDP can be understood as a derandomization of CPI which trades off increased computation (learning a new predictor for each timestep individually). Whether or not we can avoid a tradeoff in general is unclear.
  4. Adversary defeating. Some algorithms, such as randomized weighted majority are designed to work against adversaries who know your algorithm, except for random bits. The randomization here is provably essential, but the setting is often far more adversarial than the real world.

The current state-of-the-art is that random bits provide performance (computational and predictive) which we don’t know (or at least can’t prove we know) how to achieve without randomization. Can randomization be removed or is it essential to good learning algorithms?

Incompatibilities between classical confidence intervals and learning.

Classical confidence intervals satisfy a theorem of the form: For some data sources D,

PrS ~ D(f(D) > g(S)) > 1-d

where f is some function of the distribution (such as the mean) and g is some function of the observed sample S. The constraints on D can vary between “Independent and identically distributed (IID) samples from a gaussian with an unknown mean” to “IID samples from an arbitrary distribution D“. There are even some confidence intervals which do not require IID samples.

Classical confidence intervals often confuse people. They do not say “with high probability, for my observed sample, the bounds holds”. Instead, they tell you that if you reason according to the confidence interval in the future (and the constraints on D are satisfied), then you are not often wrong. Restated, they tell you something about what a safe procedure is in a stochastic world where d is the safety parameter.

There are a number of results in theoretical machine learning which use confidence intervals. For example,

  1. The E3 algorithm uses confidence intervals to learn a near optimal policy for any MDP with high probability.
  2. Set Covering Machines minimize a confidence interval upper bound on the true error rate of a learned classifier.
  3. The A2 uses confidence intervals to safely deal with arbitrary noise while taking advantage of active learning.

Suppose that we want to generalize thse algorithms in a reductive style. The goal would be to train a regressor to predict the output of g(S) for new situations. For example, a good regression prediction of g(S) might allow E3 to be applied to much larger state spaces. Unfortunately, this approach seems to fail badly due to a mismatch between the semantics of learning and the semantics of a classical confidence interval.

  1. It’s difficult to imagine a constructive sampling mechanism. In a large state space, we may never encounter the same state twice, so we can not form meaningful examples of the form “for this state-action, the correct confidence interval is y“.
  2. When we think of succesful learning, we typically think of it in an l1 sense—the expected error rate over the data generating distribution is small. Confidence intervals have a much stronger meaning as we would like to apply them: with high probability, in all applications, the confidence interval holds. This mismatch appears unaddressable.

It is tempting to start plugging in other notions such as Bayesian confidence intervals or quantile regression systems. Making these approaches work at a theoretical level on even simple systems is an open problem, but there is plenty of motivation to do so.

Explorations of Exploration

Exploration is one of the big unsolved problems in machine learning. This isn’t for lack of trying—there are many models of exploration which have been analyzed in many different ways by many different groups of people. At some point, it is worthwhile to sit back and see what has been done across these many models.

  • Reinforcement Learning (1). Reinforcement learning has traditionally focused on Markov Decision Processes where the next state s’ is given by a conditional distribution P(s’|s,a) given the current state s and action a. The typical result here is that certain specific algorithms controlling an agent can behave within e of optimal for horizon T except for poly(1/e,T,S,A) “wasted” experiences (with high probability). This started with E3 by Satinder Singh and Michael Kearns. Sham Kakade’s thesis has significant discussion. Extensions have typically been of the form “under extra assumptions, we can prove more”, for example Factored-E3 and Metric-E3. (It turns out that the number of wasted samples can be less than the number of bits required to describe an MDP.) A weakness of all these results is that they rely upon (a) assumptions which are often false for real applications, (b) state spaces are too large, and (c) make a gurantee that is rather weak. Good performance is only guaranteed after suffering the possibly catastrophic consequences of exploration.
  • Reinforcement Learning (2). Several recent papers have been attempting to analyze reinforcement learning via reduction. To date, all results are either nonconstructive or involve the use of various hints (oracle access to an optimal policy, the distribution over states of an optimal policy etc…) which short-circuit the need to explore. Obviously, these hints are not always available for real-world problems.
  • Reinforcement Learning (3). Much of the rest of reinforcement learning has something to do with exploration, but it’s difficult to summarize succinctly.
  • Online Learning. The n-armed bandit setting can be thought of as an MDP with one state and many actions. In some variants, there is even an adversary who chooses the payoffs of the arms in a non-stochastic manner. The typical result here says that you can compete well with the best constant action after some wasted actions. The exact number of wasted actions varies with the precise setting, but it is typically linear in the number of actions. This work can be traced back to (at least) Gittins indices which (unfortunately) don’t seem to have a good description available on the internet.
  • Active Learning(1) The common current use of this term has to do with “selective sampling”=choosing unlabeled samples to label so as to minimize the number of labels required to learn a good predictor (typically a classifier). A typical result has the form: Given that your classifier comes from restricted class C and the labeled data distribution obeys some constraint, the number of adaptively labeled samples required O(log (1/e)) where e is the error rate. (It turns out that the even noisy distributions are allowed.) The constraints on distributions and hypothesis spaces required to achieve these speedups are often severe.
  • Active Learning(2) Membership query learning is another example. The distinguishing difference with respect to selective sampling is that the a labeled can be requested for any unlabeled point (not just those drawn according to some natural distribution). Several relatively strong results hold for membership query learning, but there is a significant drawback: it seems that the ability to query for a label on an arbitrary point is not very natural. For example, imagine query whether a text document is about sports or politics when the text is generated at random.
  • Active Learning(3) Experimental design (which is mostly based in statistics), is often about finding the extrema of some function rather than generalization. Often, the data generating distribution is assumed to come from some specific parametric family. Unfortunately, my knowledge is sketchy here.

The striking thing about all of these models is that they fail to apply to typical real world problems. This failure is either by design (making assumptions which are simply rarely met), by failure to prove interesting results, or both.

And yet, many of the pieces are here. Active learning deals with generalization, online learning can deal with adversarial situations, and reinforcement learning deals with the situation where your choices influence what you can later learn. At a high level, there is much room for research here by design of a new model of exploration, new theoretical statements, or both.

I’ve been told “exploration is too hard”, and that’s a good warning to bear in mind, but I’m still hopeful for progress.

What is state?

In reinforcement learning (and sometimes other settings), there is a notion of “state”. Based upon the state various predictions are made such as “Which action should be taken next?” or “How much cumulative reward do I expect if I take some action from this state?” Given the importance of state, it is important to examine the meaning. There are actually several distinct options and it turns out the definition variation is very important in motivating different pieces of work.

  1. Newtonian State. State is the physical pose of the world. Under this definition, there are very many states, often too many for explicit representation. This is also the definition typically used in games.
  2. Abstracted State. State is an abstracted physical state of the world. “Is the door open or closed?” “Are you in room A or not?” The number of states is much smaller here. A basic issue here is: “How do you compute the state from observations?”
  3. Mathematical State. State is a sufficient statistic of observations for making necessary predictions.
  4. Internal State. State is the internal belief/understanding/etc… which changes an agent’s actions in different circumstances. A natural question is: “How do you learn a state?” This is like the mathematical version of state, except that portions of the statistic which can not be learned are irrelevant.
  5. There are no states. There are only observations (one of which might be a reward) and actions. This is more reasonable than it might sound because state is a derived quantity and the necessity of that derivation is unclear. PSRs are an example.

The different questions these notions of state motivate can have large practical consequences on the algorithms used. It is not clear yet what the “right” notion of state is—we just don’t know what works well. I am most interested in “internal state” at the moment.