Let’s suppose that we are trying to create a general purpose machine learning box. The box is fed many examples of the function it is supposed to learn and (hopefully) succeeds.
To date, most such attempts to produce a box of this form take a vector as input. The elements of the vector might be bits, real numbers, or ‘categorical’ data (a discrete set of values).
On the other hand, there are a number of succesful applications of machine learning which do not seem to use a vector representation as input. For example, in vision, convolutional neural networks have been used to solve several vision problems. The input to the convolutional neural network is essentially the raw camera image as a matrix. In learning for natural languages, several people have had success on problems like parts-of-speech tagging using predictors restricted to a window surrounding the word to be predicted.
A vector window and a matrix both imply a notion of locality which is being actively and effectively used by these algorithms. In contrast, common algorithms like support vector machines, neural networks, and decision trees do not use or take advantage of locality in the input representation. For example, many of these algorithms are (nearly) invariant under permutation of input vector features.
A basic question we should ask is: “Does it really matter whether or not a learning algorithm knows the locality structure of the input?” Consider a simplistic example where we have n input bits as features and the function f to learn uses k of hte n bits. Suppose we also know that the function is one of two candidates: f1 and f2, but that these are two otherwise complicated functions. Then, finding which of the k to use might require an O(n choose k) = O(n!/k!(n-k)!) computation and O(k log (n/k)) samples when k is much smaller than n. On the other hand, when we know which k are relevant, “learning” is trivial. This suggests that telling the algorithm which k features are “near” to other features can be very helpful in solving the problem, at least computationally.
There are several natural questions which arise:
- How do we specify locality? Using a neighborhood graph? A weighted neighborhood graph?
- What are natural algorithms subject to locality? It seems most practical existing learning algorithms using locality have some form of sweeping operation over local neighborhoods. Is that it, or is there more?
- Can a general purpose locality aware algorithm perform well on a broad variety of different tasks?