Language – Machine Learning (Theory)

4/5/2023

An AI Miracle Malcontent

The stark success of OpenAI’s GPT4 model surprised me shifting my view from “really good autocomplete” (roughly inline with intuitions here) to a dialog agent exhibiting a significant scope of reasoning and intelligence. Some of the MSR folks did a fairly thorough study of capabilities which seems like a good reference. I think of GPT4 as an artificial savant: super-John capable in some language-centric tasks like style and summarization with impressive yet more limited abilities in other domains like spatial and reasoning intelligence.

And yet, I’m unhappy with mere acceptance because there is a feeling that a miracle happened. How is this not a miracle, at least with hindsight? And given this, it’s not surprising to see folks thinking about more miracles. The difficulty with miracle thinking is that it has no structure upon which to reason for anticipation of the future, prepare for it, and act rationally. Given that, I wanted to lay out my view in some detail and attempt to understand enough to de-miracle what’s happening and what may come next.

Deconstructing The Autocomplete to Dialog Miracle
One of the ironies of the current situation is that an organization called “OpenAI” created AI and isn’t really open about how they did it. That’s an interesting statement about economic incentives and focus. Nevertheless, back when they were publishing, the Instruct GPT paper suggested something interesting: that reinforcement learning on a generative model substrate was remarkably effective—good for 2 to 3 orders of magnitude improvement in the quality of response with a tiny (in comparison to language sources for next word prediction) amount of reinforcement learning. My best guess is that this was the first combination of 3 vital ingredients.

Learning to predict the next word based on vast amounts of language data from the internet. I have no idea how much, but wouldn’t be surprised if it’s a million lifetimes of reading generated by a billion people. That’s a vast amount of information there with deeply intermixed details about the world and language.
1. Why not other objectives? Well, they wanted something simple so they could maximize scaling. There may indeed be room for improvement in choice of objective.
2. Why language? Language is fairy unique amongst information in that it’s the best expression of conscious thought. There is thought without language (yes, I believe animals think in various ways), but you can’t really do language without thought.
The use of a large deep transformer model (pseudocode here) to absorb all of this information. Large here presumably implies training on many GPUs with both data and model parallelism. I’m sure there are many fine engineering tricks here. I’m unclear on the scale, but expect the answer is more than thousands and less than millions.
1. Why transformer models? At a functional level, they embed ‘soft attention’ (=ability to look up a value with a key in a gradient friendly way). At an optimization level, they are GPU-friendly.
2. Why deep? The drive to minimize word prediction error in the context of differentiable depth creates a pressure to develop useful internal abstractions.
Reinforcement learning on a small amount of data which ‘awakens’ a dialog agent. With the right prompt (=prefix language) engineering a vanilla large language model can address many tasks as the information is there, but it’s awkward and clearly not a general purpose dialog agent. At the same time, the learned substrate is an excellent representation upon which to apply RL creating a more active agent while curbing an inherited tendency to mimic internet flamebait.
1. Why reinforcement learning? One of the oddities of language is that there is more than one way of saying things. Hence, the supervised learning view that there is a right answer and everything else is wrong sets up inherent conflicts in the optimization. Hence, “reinforcement learning from human feedback” pairs inverse reinforcement learning to discover a reward function and basic reinforcement learning to achieve better performance. What’s remarkable about this is that the two-step approach is counter to the information processing inequality.

The overall impression that I’m left with is something like the “ghost of the internet”. If you ask the internet for the answer to a question on the best forum available and get an answer, it might be in the ballpark of as useful and as correct as that which GPT4 provides (notably, in seconds). Peter Lee’s book on the application to medicine is pretty convincing. There are pluses and minuses here—GPT4’s abstraction of language tasks like summarization and style appear super-human, or at least better than I can manage. For commonly discussed content (e.g. medicine) it’s fairly solid, but for less commonly discussed content (say, Battletech fan designs) it becomes sketchy as the internet gives out. There are obviously times when it errs (often egregiously in a fully confident way), but that’s also true in internet forums. I specifically don’t trust GPT4 with math and often find it’s reasoning and abstraction abilities shaky, although it’s deeply impressive that they exist at all. And driving a car is out because it’s a task that you can’t really describe.

What about the future?
There’s been a great deal about the danger of AI discussed recently, and quite a mess of misexpectations about where we are.

Is GPT4 and future variants the answer to [insert intelligence-requiring problem here]? GPT4 seems most interesting as a language intelligence. It’s clearly useful as an advisor or a brainstormer. The meaning of “GPT5” isn’t clear, but I would expect substantial shifts in core algorithms/representations are necessary for mastering other forms of intelligence like memory, skill formation, information gathering, and optimized decision making.
Are generative models the end of consensual reality? Human societies seem to have a systematic weakness in that people often prefer a consistent viewpoint even at the expense of fairly extreme rationalization. That behavior in large language models is just looking at our collective behavior through a mirror. Generative model development (both language and video) do have a real potential to worsen this. I believe we should be making real efforts as a society to harden and defend objective reality in a multiple ways. This is not specifically about AI, but it would address a class of AI-related concerns and improve society generally.
Is AI about to kill everyone? Yudkowski’s editorial gives the impression that a Terminator style apocalypse is just around the corner. I’m skeptical about the short term (the next several years), but the longer term requires thought.
1. In the short term there are so many limitations of even GPT4 (even though it’s a giant advance) that I both lack the imagination to see a path to “everyone dies” and I expect it would be suicidal for an AI as well. GPT4, as an AI, is using the borrowed intelligence of the internet. Without that source it’s just an amalgamation of parameters of no interesting capabilities.
2. For the medium term, I think there’s a credible possibility that drone warfare becomes ultralethal inline with this imagined future. You can already see drone warfare in the Ukraine-Russia war significantly increasing the lethality of a battlefield. This requires some significant advances, but nothing seems outlandish. Counterdrone technology development and limits on usage inline with other war machines seems prudent.
3. For the longer term, Vinge’s classical singularity essay is telling here as he lays out the inevitability of developing intelligence for competitive reasons. Economists are often fond of pointing out how job creation has accompanied previous mechanization induced job losses and yet my daughter points out how we keep increasing the amount of schooling children must absorb to be capable members of society. It’s not hard to imagine a desolation of jobs in a decade or two where AIs can simply handle almost all present-day jobs and most humans can’t skill-up to be economically meaningful. Our society is not prepared for this situation—it seems like a quite serious and possibly inevitable possibility. Positive models for a nearly-fully-automated society are provided by Star Trek and Iain Banks although science fiction is very far from a working proposal for a working society.
4. I’m skeptical about a Lawnmower Man like scenario where a superintelligence suddenly takes over the world. In essence, cryptographic barriers are plausibly real, even to a superintelligence. As long as that’s so, the thing to watch out for is excessive concentrations of power without oversight. We already have a functioning notion of super-human intelligence in organizational intelligence and are familiar with techniques for restraining organizational intelligence into useful-for-society channels. Starting with this and improving seems reasonable.

12/2/201112/4/2011

Hadoop AllReduce and Terascale Learning

Suppose you have a dataset with 2 terafeatures (we only count nonzero entries in a datamatrix), and want to learn a good linear predictor in a reasonable amount of time. How do you do it? As a learning theorist, the first thing you do is pray that this is too much data for the number of parameters—but that’s not the case, there are around 16 billion examples, 16 million parameters, and people really care about a high quality predictor, so subsampling is not a good strategy.

Alekh visited us last summer, and we had a breakthrough (see here for details), coming up with the first learning algorithm I’ve seen that is provably faster than any future single machine learning algorithm. The proof of this is simple: We can output a optimal-up-to-precision linear predictor faster than the data can be streamed through the network interface of any single machine involved in the computation.

It is necessary but not sufficient to have an effective communication infrastructure. It is necessary but not sufficient to have a decent programming language, because parallel programming is hard. It is necessary but not sufficient to have a good optimization approach. The combination says “yikes”, because you need to know many things to design an effective new system.

For communication infrastructures, the two most prevalent approaches are MPI and MapReduce, both of which have substantial drawbacks for machine learning with lots of data.

MPI suffers because it has no fault tolerance by default and because it has a poor understanding of where data is, implying that data must be either manually placed on local nodes, or the first step in every computation is “partition the data across the cluster” which is very undesirable from a communication complexity and programming complexity standpoint. These significantly limit the scale that you can work at to ~100 nodes in practice, because the economics of clusters make sharing unavoidable at larger scales. When the cluster is shared, preshuffling the data is awkward to impossible and you must expect that some nodes will run slower than others because they will be executing other jobs. This limitation on reliability kicks in much sooner than disk read failures or node failures.
MapReduce suffers because the setup and teardown costs are significant. Measured directly, this is often on the order of a minute, associated with interacting with the scheduler and communicating the program to a large number of nodes. But indirectly, this can be radically worse, as any map-reduce job can be held in limbo while waiting for free nodes to work on. And commonly we need to execute many MapReduce iterations to achieve high quality prediction.
MapReduce has another more subtle flaw: using it requires refactoring your code into a sequence of map and reduce operations. This is significantly annoying, because right good learning algorithms is pretty difficult in the first place. MapReduce has a third flaw: it encourages inefficient optimization paradigm. In particular, while you can phrase many learning algorithms as statistical query learning algorithms, doing so is energy inefficient, up to O(examples) in extreme cases.

Since the drawbacks of MPI and MapReduce differ, we can try to create a solution which eliminates all of drawbacks, which a Hadoop-compatible AllReduce does. Cherry picking from each we get:

MPI: The Allreduce function. The starting state for AllReduce is n nodes each with a number, and the end state is all nodes having the sum of all numbers.
MapReduce: Conceptual simplicity. One easy to understand function is enough.
MPI: No need to refactor code. You just sprinkle allreduce in a few locations in your single machine code.
MapReduce: Data locality. We just hijack the MapReduce infrastructure to execute a map-only job where each process executes on the node with the data.
MPI: Ability to use local storage (or RAM). Hadoop itself gobbles large amounts of RAM by default because it uses Java. And, in any case, you don’t have an effective large scale learning algorithm if it dies every time the data on a single node exceeds available RAM. Instead, you want to create a temporary file on the local disk and allow it to be cached in RAM by the OS, if that’s possible.
MapReduce: Automatic cleanup of local resources. Temporary files are automatically nuked.
MPI: Fast optimization approaches remain within the conceptual scope. Allreduce, because it’s a function call, does not conceptually limit online learning approaches as discussed below. MapReduce conceptually forces statistical query style algorithms. In practice, this can be walked around, but that’s annoying.
MapReduce: Robustness. We don’t captures all the robustness of MapReduce which can succeed even during a gunfight in the datacenter. But we don’t generally need that: it’s easy to use Hadoop’s speculative execution approach to deal with the slow node problem and use delayed initialization to get around all startup failures giving you something with >99% success rate on a running time reliable to within a factor of 2.

One function (all_reduce) is not a programming language. But since it’s written in C, it is easily encapsulated and added to any existing programming language giving you a complete language. To test this hypothesis, I visited Clement for a day, where we connected things to make Allreduce work in Lua twice—once with an online approach and once with an LBFGS optimization approach for convolutional neural networks. As a parallel programming paradigm, it’s amazingly easier than many other approaches, because you take your existing code and figure out which pieces of state to synchronize. It’s superior enough that I’ve now eliminated the multithreaded and parallel online learning approaches within Vowpal Wabbit. This approach is also great in terms of the amount of incremental learning required—you just need to learn one function to be able to create useful parallel machine learning algorithms. The only thing easier than learning one function is learning none, which you can do for linear prediction by just using VW. Incidentally, we designed the AllReduce code so that Hadoop is not a requirement—you just need to do a bit of extra scripting and lose some of the benefits discussed above when running this on a workstation cluster or a single machine.

You also need to get optimization approaches right. Two canonical but very different optimization algorithms are stochastic gradient descent and LBFGS. Understanding the weaknesses of these algorithm is critical even though often not discussed by their proponents. SGD approaches tend to have two drawbacks: the right choice of various hyperparameters can be annoying. We’ve mostly eliminated this drawback in VW using a learning rate that is tuned to automatically work in various ways. The other drawback is that they generally aren’t great at dealing with noise. This is tricky to deal with in general, because the algorithms only see one example at a time. Leon Bottou is working to eliminate this last drawback, but my impression is that we’re not quite there yet. LBFGS on the other hand is great at dealing with noise but suffers significantly in it’s early convergence rate where SGD is extremely effective. Again, we can combine these approaches in an obvious way: use online learning at the beginning to warmstart LBFGS to integrate out the noise. In practice, the online learning gets you 95%-99% of the way there and then LBFGS nails the last bit of performance.

For the problem I mentioned at the beginning, we can learn in about an hour using a kilonode, implying an overall throughput of 500 megafeatures/s, which is about a factor of 5 faster than any single network interface (1 gigabit/s). This is substantially greater scaling than any of the other algorithms in the Scaling up Machine Learning book (see here for a comparison).

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.
For smaller datasets with computationally intense models, GPU approaches seem very compelling.
For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug.
The model needs to fit into memory. If that’s not the case, then other approaches are required.

I also expect Hadoop Allreduce is useful across many more tasks than just machine learning. Optimization problems are an easy example, but I suspect there are a number of iterative computation problems where allreduce can be very effective. While it might appear a limited operation, you can easily do average, weighted average, max, etc… And, the scope of allreduce is also easily broadened with an arbitrary reduce function, as per MPI’s version. The Allreduce code itself is not yet native in Hadoop, so you’ll need to grab it from the VW source code which has a BSD license. I’ve been encouraged by discussions with Milind suggesting it may become native soon.

Update: CACM Crosspost.

7/11/2011

Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

An Analysis of Single-Layer Networks in Unsupervised Feature Learning, Adam Coates, Honglak Lee, Andrew Y. Ng (AISTATS 2011)
The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization, Adam Coates, Andrew Y. Ng (ICML 2011)

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

On Random Weights and Unsupervised Feature Learning, Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Y. Ng

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

On optimization methods for deep learning, Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Y. Ng

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

Learning Recurrent Neural Networks with Hessian-Free Optimization, James Martens, Ilya Sutskever

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Large-Scale Learning of Embeddings with Reconstruction Sampling, Yann Dauphin, Xavier Glorot, Yoshua Bengio

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me:

Sparse Additive Generative Models of Text, Jacob Eisenstein, Amr Ahmed, Eric Xing – the idea is to model each topic only in terms of how it differs from a background distribution.
Message Passing Algorithms for Dirichlet Diffusion Trees, David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani – I’ve been intrigued by Dirichlet Diffusion Trees but inference always seemed too slow. This paper might solve that problem.

10/22/200610/22/2006

Exemplar programming

There are many different abstractions for problem definition and solution. Here are a few examples:

Functional programming: a set of functions are defined. The composed execution of these functions yields the solution.
Linear programming: a set of constraints and a linear objective function are defined. An LP solver finds the constrained optimum.
Quadratic programming: Like linear programming, but the language is a little more flexible (and the solution slower).
Convex programming: like quadratic programming, but the language is more flexible (and the solutions even slower).
Dynamic programming: a recursive definition of the problem is defined and then solved efficiently via caching tricks.
SAT programming: A problem is specified as a satisfiability involving a conjunction of a disjunction of boolean variables. A general engine attempts to find a good satisfying assignment. For example Kautz’s blackbox planner.

These abstractions have different tradeoffs between ease of use, generality, and the effectiveness of existing solutions.

Machine learning can be thought of as exemplar programming. Exemplar programming is creating examples of a (input,output) pairs which are used by an algorithm to predict a function from input to output. Basic questions about this abstraction are:

How easy to use is it? An honest answer here is “not very”. There are several subtle issues related to overfitting, independence, and representativeness of the samples which take significant effort to describe to an unfamiliar person. Making this abstraction easier to use via careful language design is an area where effort may pay off.
How effectve are the exemplar programming solvers (aka learning algorithms)? There is huge variance with respect to the problem under consideration. For some problems, some learning algorithm is very effective while for others you might as well have not tried.
How general is exemplar programming? Very general. A great many problems can be phrased as “given this, I want that”.
How effective has examplar programming been? Increasingly effective.

Exemplar programming is a viewpoint of machine learning which (mostly) ignores statistics, prior information, and the possibility of overfitting. That’s a great deal to ignore, but there are gains as well.

Exemplar programming creates a split between problem solution and problem formation. This is important because the problem solver can heavily optimized (for speed, scalibility, effectiveness on common problems, etc…) making the process of apply machine learning simply a matter of specifying the exemplars.
The formation/solution split helps us focus on problem formation independent of solution. The big gains in machine learning in the last decade seem to be discovering how to apply to new areas. A significant step in any application to a new area is discovering the right way to formulate the problem.

Exemplar programming seems to be a useful viewpoint for machine learning in “big data” problems with many examples where significant prior information is lacking. When either of these conditions are not met, other viewpoints/approaches to machine learning may be more succesful.

9/28/20069/30/2006

Programming Languages for Machine Learning Implementations

Machine learning algorithms have a much better chance of being widely adopted if they are implemented in some easy-to-use code. There are several important concerns associated with machine learning which stress programming languages on the ease-of-use vs. speed frontier.

Speed The rate at which data sources are growing seems to be outstripping the rate at which computational power is growing, so it is important that we be able to eak out every bit of computational power. Garbage collected languages (java, ocaml, perl and python) often have several issues here.
1. Garbage collection often implies that floating point numbers are “boxed”: every float is represented by a pointer to a float. Boxing can cause an order of magnitude slowdown because an extra nonlocalized memory reference is made, and accesses to main memory can are many CPU cycles long.
2. Garbage collection often implies that considerably more memory is used than is necessary. This has a variable effect. In some circumstances it results in no slowdown while in others it can cause a 4-order of magnitude slowdown. The basic rule here is that you never want to run out of physical memory and use swap.
3. Some of these languages are interpreted rather than executed. As a rule of thumb, interpreted languages are an order of magnitude slower than an executed languages.
4. Even when these languages are compiled, there are often issues with how well they are compiled. Compiling to a modern desktop processor is a tricky business, and only a few compilers do this well.
Programming Ease Ease of use of a language is very subjective because it is always easiest to use the language you are most familiar with. Nevertheless, significant differences exist.
1. Syntax Syntax is often overlooked, but it can make a huge difference in the ease of both learning to program and using the language. A good syntax allows you study and improve the algorithm rather than the program implementing it. (Algorithmic improvements often yield the greatest speedups while optimizing.) The syntax needs to be concise (so that you can view the entire algorithm) and simple (so that it can become second nature).
2. Library Support Languages vary dramatically in terms of library support, and having the right linear algebre/graphics/IO library can make a task dramatically easier. Perl has a huge number of associated libraries which greatly ease use.
3. Built in Functionality Languages differ in terms of the primitives that are available. Which of these primiitives are useful is a subject of significant debate.
  1. Some form of generic programming (templates, polymorphism, etc…) seems important.
  2. Functions as first class objects is often very convenient while programming.
  3. Builting lists and hash tables are often extremely useful primitives. One caveat here is that when you make a speed optimization pass, you often have to avoid these primitives.
  4. Support for modularity is important. Objects (as in object oriented programming) is one example of this, but there are many others. The essential quantity here seems to be an interface creation mechanism.
4. Scalability Scalability is where otherwise higher level languages often break down. A simple example of this is a language with file I/O built in that fails to perform correctly when the file has size 2³¹ or 2³². I am particularly familiar with Ocaml which has the following scalability issues:
  1. List operations often end up consuming the stack and segfaulting. The Unison crew were annoyed enough by this that they created their own “safelist” library with all the same interfaces as the list type.
  2. Arrays on a 32 bit machine can have only 2²²-1 elements due to dynamic type information set aside for the garbage collector. As a substitute, there is a “big array” library. However, having big arrays, arrays, and strings, often becomes annoying because they have different interfaces for objects which are semantically the same.
5. Familiarity This isn’t just your familiarity, but also the familiarity of other people who might use the code. At one extreme, you can invent your own language (as Yann LeCun has done with Lush). At the other extreme, you can use a language which many people are familiar with such as C or Java.

The existing significantly used machine learning code bases seem to favor lower level languages.

Weka is one of the best known general purpose machine learning toolkits. It is implemented in Java and has far more algorithmic coverage than any other system I know of.
LibSVM is implemented in C++ and Java.
SVMlight is implemented in C.
C4.5 is implemented in C.
SNNS is written in C.

Both R and Matlab are also useful languages, although I have not used them.

None of the language choices seem anywhere near ideal. The higher level languages often can’t execute fast and the lower level ones which can are often quite clumsy. The approach I’ve taken is to first implement in a higher level language (my choice was ocaml) to test ideas and then reimplement in a lower level language (C or C++) for speed where the ideas work out. This reimplementation step is quite clumsy and I would love to find a way to avoid it.