Embeddings: what are they good for?

I’ve been looking at some recent embeddings work, and am struck by how beautiful the theory and algorithms are. It also makes me wonder, what are embeddings good for?

A few things immediately come to mind:

(1) For visualization of high-dimensional data sets.

In this case, one would like good algorithms for embedding specifically into 2- and 3-dimensional Euclidean spaces.

(2) For nonparametric modeling.

The usual nonparametric models (histograms, nearest neighbor) often require resources which are exponential in the dimension. So if the data actually lie close to some low-dimensional
surface, it might be a good idea to first identify this surface and embed the data before applying the model.

Incidentally, for applications like these, it’s important to have a functional mapping from high to low dimension, which some techniques do not yield up.

(3) As a prelude to classifier learning.

The hope here is presumably that learning will be easier in the low-dimensional space, because of (i) better generalization and (ii) a more “natural” layout of the data.

I’d be curious to know of other uses for embeddings.

Advantages and Disadvantages of Bayesian Learning

I don’t consider myself a “Bayesian”, but I do try hard to understand why Bayesian learning works. For the purposes of this post, Bayesian learning is a simple process of:

  1. Specify a prior over world models.
  2. Integrate using Bayes law with respect to all observed information to compute a posterior over world models.
  3. Predict according to the posterior.

Bayesian learning has many advantages over other learning programs:

  1. Interpolation Bayesian learning methods interpolate all the way to pure engineering. When faced with any learning problem, there is a choice of how much time and effort a human vs. a computer puts in. (For example, the mars rover pathfinding algorithms are almost entirely engineered.) When creating an engineered system, you build a model of the world and then find a good controller in that model. Bayesian methods interpolate to this extreme because the Bayesian prior can be a delta function on one model of the world. What this means is that a recipe of “think harder” (about specifying a prior over world models) and “compute harder” (to calculate a posterior) will eventually succeed. Many other machine learning approaches don’t have this guarantee.
  2. Language Bayesian and near-Bayesian methods have an associated language for specifying priors and posteriors. This is significantly helpful when working on the “think harder” part of a solution.
  3. Intuitions Bayesian learning involves specifying a prior and integration, two activities which seem to be universally useful. (see intuitions).

With all of these advantages, Bayesian learning is a strong program. However, there are also some very significant disadvantages.

  1. Information theoretically infeasible It turns out that specifying a prior is extremely difficult. Roughly speaking, we must specify a real number for every setting of the world model parameters. Many people well-versed in Bayesian learning don’t notice this difficulty for two reasons:
    1. They know languages allowing more compact specification of priors. Acquiring this knowledge takes some signficant effort.
    2. They lie. They don’t specify their actual prior, but rather one which is convenient. (This shouldn’t be taken too badly, because it often works.)
  2. Computationally infeasible Let’s suppose I could accurately specify a prior over every air molecule in a room. Even then, computing a posterior may be extremely difficult. This difficulty implies that computational approximation is required.
  3. Unautomatic The “think harder” part of the Bayesian research program is (in some sense) a “Bayesian employment” act. It guarantees that as long as new learning problems exist, there will be a need for Bayesian engineers to solve them. (Zoubin likes to counter that a superprior over all priors can be employed for automation, but this seems to add to the other disadvantages.)

Overall, if a learning problem must be solved a Bayesian should probably be working on it and has a good chance of solving it.
I wish I knew whether or not the drawbacks can be convincingly addressed. My impression so far is “not always”.

Dynamic Programming Generalizations and Their Use

David Mcallester gave a talk about this paper (with Pedro Felzenszwalb). I’ll try to give a high level summary of why it’s interesting.

Dynamic programming is most familiar as instantiated by Viterbi decoding in a hidden markov model. It is a general paradigm for problem solving where subproblems are solved and used to solve larger problems. In the Viterbi decoding example, the subproblem is “What is the most probable path ending at each state at timestep t?”, and the larger problem is the same except at timestep t+1. There are a few optimizations you can do here:

  1. Dynamic Programming -> queued Dynamic Programming. Keep track of the “cost so far” (or “most probable path”) and (carefully) only look at extensions to paths likely to yield the shortest path. “Carefully” here is defined by Dijkstra’s shortest path algorithm.
  2. queued Dynamic programming -> A*Add a lower bound on the cost to complete a path (or an upper bound on the probability of a completion) for the priority queue of Dijkstra’s shortest path. This can yield computational speedups varying between negligible and outstanding.
  3. A* -> Hierarchical A* The efficiency of A* search is dependent on the tightness of it’s lower bound, which brings up the question: “Where do you get the lower bound?” One appealing answer is from A* applied to a simplified problem equivalent to the original problem, but with states aliased (many states in original problem = 1 state in new problem). This technique can be applied recursively until the problem is trivial.

Each of these steps has been noted previously (although perhaps not in the generality of this paper). What seems new and interesting is that the entire hierarchy of A* searches can be done simultaneously on one priority queue.

The resulting algorithm can use low level information to optimize high level search as well as high level information to optimize low level search in a holistic process. It’s not clear yet how far this approach can be pushed, but this quality is quite appealing. Naturally, there are many plausible learning-related applications.

Which Assumptions are Reasonable?

One of the most confusing things about understanding learning theory is the vast array of differing assumptions. Some critical thought about which of these assumptions are reasonable for real-world problems may be useful.

Before we even start thinking about assumptions, it’s important to realize that the word has multiple meanings. The meaning used here is “assumption = axiom” (i.e. something you can not verify).

Assumption Reasonable? Which analysis? Example/notes
Independent and Identically Distributed Data Sometimes PAC,ERM,Prediction bounds,statistics The KDD cup 2004 physics dataset is plausibly IID data. There are a number of situations which are “almost IID” in the sense that IID analysis results in correct intuitions. Unreasonable in adversarial situations (stock market, war, etc…)
Independently Distributed Data More than IID, but still only sometimes online->batch conversion Losing “identical” can be helpful in situations where you have a cyclic process generating data.
Finite exchangeability (FEX) Sometimes reasonable as for IID There are a good number of situations where there is a population we wish to classify, pay someone to classify a random subset, and then try to learn.
Input space uniform on a sphere No. PAC, active learning I’ve never observed this in practice.
Functional form: “or” of variables, decision list, “and” of variables Sometimes reasonable PAC analysis There are often at least OK functions of this form that make good predictions
No Noise Rarely reasonable PAC, ERM Most learning problems appear to be of the form where the correct prediction given the inputs is fundamentally ambiguous.
Functional form: Monotonic on variables Often PAC-style Many natural problems seem to have behavior monotonic in their input variables.
Functional form: xor Occasionally PAC I was suprised to observe this.
Fast Mixing Sometimes RL Interactive processes often fail to mix, ever, because entropy always increases.
Known optimal state distribution Sometimes RL Sometimes humans know what is going on, and sometimes not.
Small approximation error everywhere Rarely RL Approximate policy iteration is known for sometimes behaving oddly.

If anyone particularly agrees or disagrees with the reasonableness of these assumptions, I’m quite interested.