Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds.

To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values).

On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a *matrix*. In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted.

A vector window and a matrix both imply a notion of locality which is being actively and effectively used by these algorithms. In contrast, common algorithms like support vector machines, neural networks, and decision trees do not use or take advantage of locality in the input representation. For example, many of these algorithms are (nearly) invariant under permutation of input vector features.

A basic question we should ask is: “Does it really matter whether or not a learning algorithm knows the locality structure of the input?” Consider a simplistic example where we have *n* input bits as features and the function *f* to learn uses *k* of hte *n* bits. Suppose we also know that the function is one of two candidates: *f _{1}* and

*f*, but that these are two otherwise complicated functions. Then, finding which of the

_{2}*k*to use might require an

*O(n choose k) = O(n!/k!(n-k)!)*computation and

*O(k log (n/k))*samples when

*k*is much smaller than

*n*. On the other hand, when we know which

*k*are relevant, “learning” is trivial. This suggests that telling the algorithm which

*k*features are “near” to other features can be very helpful in solving the problem, at least computationally.

There are several natural questions which arise:

- How do we specify locality? Using a neighborhood graph? A weighted neighborhood graph?
- What are natural algorithms subject to locality? It seems most practical existing learning algorithms using locality have some form of sweeping operation over local neighborhoods. Is that it, or is there more?
- Can a general purpose locality aware algorithm perform well on a broad variety of different tasks?

Aren’t Markov Networks (and Bayesian Networks) already a fairly general way to specify the locality of the input features? This can be done simply by restricting the graph structure such that locally influenced features have dependencies.

1. I think we should be able to learn locality. For example, using the correlation between pixels of an image allows us to recover the full shape of the pixels. Now you can use the recovered coordinates of the pixels as additional inputs. So this would give a weighted neighborhood graph, the weights depending on your training set and not only on the task.

2. This might not exactly be what you are talking about, but all neural networks with a “memory”, such as recurrent neural networks, induce a kind of ordering of the inputs (in 1D, the dimension being the time). But I might not have understood your question.

3. The notion of locality is highly dependant of a task. You can think of two correlated variables as neighbors (such as images or music), but there must be tasks (even though I cannot think of any now) where locality has nothing to do with correlation. In that case, my answer to your question would be no.

We are not trying to predict features from other features, so I expect you mean some specification of hidden nodes in a Markov/Bayes Net graphical structure. For example, if we think of Mnist, we might specify some internal nodes dependent on local neighborhoods which, in turn, depend on other internal nodes until a final output node is learned.

There are sort of two difficulties I have with this solution:

I like observation (1). To take advantage of observation (2) in practice, it seems you are essentially specifying the locality by choice of learning algorithm. That’s fine, but if we want “one” learning algorithim we want to avoid this. For (3), I was thinking of locality entirely in terms of “physical” locality rather than statistical correlation. Words are next to each other. Pixels are next to each other. Moments in time are next to each other.

For (3), what I meant was : how can we now that including this notion of locality will be relevant ? How can we even know it will not harm the algorithm. Mix the pixels of any image of the MNIST database. Then it would be useless to use this spatial locality. My question would then be : how do you know when the physical locality is useful ? How do you know how to take advantage of it ?

Your original question (3) seems to be more like the second question, but the first one seems as important a problem to me. Is your question (2) related to it ?

As a general principle, it is possible to make extra information available to a learning algorithm without significantly harming the performance of a well-designed learning algorithm if the information is irrelevant. There are several analyses which show this, but the obvious one is to just run two learning algorithms: one with the extra information and one without, and keep the one that works best.

hi John,

As you said there is an implicit invariance for most vector representations of the input data used by most algorithms, and this is somehow a problem since some coordinates may be more related to other ones, i.e. there might be some sense of proximity between them.

We may imagine for instance a huge feature vector for each object, where the n first coordinates may each describe specific quantifications of the object under very similar conditions (colors in a very small portion of the image, histograms of words in the very last chapters of the text). Instead of considering them separately (a fine viewpoint), one may consider the coars(er) viewpoint of averaging them.

My work has recently focused on using a hierarchical structure between these coordinates, and proposing a global mixture (in the framework of kernel design) to induce some kind of locality on the final kernel. The computation is very much inspired by the context tree weighting algorithm of Tjalkens and al., proposed for Universal coding. If you are interested in it, i can send you a preprint.

Best,

Marco

Mix the pixels of any image in the MNIST databaseBut this is exactly the point: in general, our algorithms are insensitive to a permutation of

the input vectors, but the

datamay not be! In particular, by scrambling the images,you have thrown away valuable information that

couldhave been leveraged by anappropriate algorithm. It’s just that many current algorithms don’t.

Of course, I think learned locality would recover the permutation, as long as you mix the

pixels in the same way across each image in your data set.

I’d be careful here. I think that one way to get easily stuck is with an incorrect

thought experiment. We can always imagine some variation on current problems which are so pathological that they defeat our best attempts at solving them. So to say, “Hey — if we

throw away the locality, then our attempts at modeling it will be useless!” is less useful

than “There

is locality here; how can we learn/use it?”I think it’s a big mistake to generalize from MNIST to all learning problems. The MNIST database is very large, large enough that, e.g., k-nearest neighbors in a learned distance metric performs very well, as do all of the other latest, greatest methods. If you only want your learning algorithms to work in that kind of situation, then, arguably, the learning problem has been solved and we’re just fine-tuning performance. (And, even in this case, Hinton and Nair showed improvements in performance using an explicit generative model to represent MNIST digits). If you want to be able to analyze smaller datasets or more complex phenomena, then representing domain knowledge is very important. Suppose, for example, you want to extract scene information from a single photograph of the natural world? Modeling the entire image as a single vector (or even a vector of features, except in simplistic cases) will not work.

As for your questions:

1. I think that a simple neighborhood structure is very weak, since it doesn’t represent large-scale dependencies. For example, images have coarse-scale structure which can be represented in terms of low-resolution basis functions, or with segmentation labels (as in layer-based vision methods). Representing these long-range dependencies is important. Energy-based models of images are an example of learning these representations (e.g., Fields of Experts). Layer-based image representations and segmentation are another way. More generally, long-range dependencies can be represented by hierarchical models (e.g., topic models).

2. I think that, once you specify a network structure/model and pick your favorite learning paradigm, the relevant algorithm more or less falls out — inference in a Bayes net, or maximum margin Markov nets, or inference in a CRF, etc.

3. Good question. I think that building the structure into the model amounts to specifying a sort of domain knowledge about the problem, and it makes sense to specify this domain knowledge. (For images, to ignore the spatial locations of the pixels seems crazy in many situations). Certainly, locality and multiresolution structure are common to many image modeling problems. The question you’re asking is, I think: can you learn the structure of the problem? Obviously, there’s a Bayesian answer to this. A margin-based approach might be to find the max-margin structure. But, of course, if you want it to learn all relationships from scratch, you may need a ton of data.

I think that there’s an inherent problem: the structure of the data. I really struggle when the theory just assumes that I have an input vector in the Euclidean space. While it may sound generic, when I feed the algorithms with this data, if each feature is taken as isolated measurements (of a real value) I’m disregarding parts of the structure of the data in the original domain. Take music represented symbolically for instance: the interrelationship between the features is quite complex and discrete most of the times. There’s beautiful research in the field of mathematical music theory that explores elaborated properties of the musical objects. Topologies of chords, algebraic definitions of motifs, etc. With the current theory, if I model my data as just MEASUREMENTS of these features instead of the FEATURES THEMSELVES, my Machine Learning algorithms won’t leverage all the nice mathematical properties of my domain. It’s not ok just to assume that these measurements are input vectors in the Euclidean space and just do some inner product between them in some feature space. That’s brute force. What I think we need to do is to enhance the theory of Machine Learning to use some more abstract algebra, topology and category theory. That may generalize the machines enough to feed them with more interesting data, giving more freedom of data representation. Until then, the burden is on the algorithms to do magic with some reading of the problem domain, instead of working on the data directly.