Deep – Machine Learning (Theory)

4/5/2023

An AI Miracle Malcontent

The stark success of OpenAI’s GPT4 model surprised me shifting my view from “really good autocomplete” (roughly inline with intuitions here) to a dialog agent exhibiting a significant scope of reasoning and intelligence. Some of the MSR folks did a fairly thorough study of capabilities which seems like a good reference. I think of GPT4 as an artificial savant: super-John capable in some language-centric tasks like style and summarization with impressive yet more limited abilities in other domains like spatial and reasoning intelligence.

And yet, I’m unhappy with mere acceptance because there is a feeling that a miracle happened. How is this not a miracle, at least with hindsight? And given this, it’s not surprising to see folks thinking about more miracles. The difficulty with miracle thinking is that it has no structure upon which to reason for anticipation of the future, prepare for it, and act rationally. Given that, I wanted to lay out my view in some detail and attempt to understand enough to de-miracle what’s happening and what may come next.

Deconstructing The Autocomplete to Dialog Miracle
One of the ironies of the current situation is that an organization called “OpenAI” created AI and isn’t really open about how they did it. That’s an interesting statement about economic incentives and focus. Nevertheless, back when they were publishing, the Instruct GPT paper suggested something interesting: that reinforcement learning on a generative model substrate was remarkably effective—good for 2 to 3 orders of magnitude improvement in the quality of response with a tiny (in comparison to language sources for next word prediction) amount of reinforcement learning. My best guess is that this was the first combination of 3 vital ingredients.

Learning to predict the next word based on vast amounts of language data from the internet. I have no idea how much, but wouldn’t be surprised if it’s a million lifetimes of reading generated by a billion people. That’s a vast amount of information there with deeply intermixed details about the world and language.
1. Why not other objectives? Well, they wanted something simple so they could maximize scaling. There may indeed be room for improvement in choice of objective.
2. Why language? Language is fairy unique amongst information in that it’s the best expression of conscious thought. There is thought without language (yes, I believe animals think in various ways), but you can’t really do language without thought.
The use of a large deep transformer model (pseudocode here) to absorb all of this information. Large here presumably implies training on many GPUs with both data and model parallelism. I’m sure there are many fine engineering tricks here. I’m unclear on the scale, but expect the answer is more than thousands and less than millions.
1. Why transformer models? At a functional level, they embed ‘soft attention’ (=ability to look up a value with a key in a gradient friendly way). At an optimization level, they are GPU-friendly.
2. Why deep? The drive to minimize word prediction error in the context of differentiable depth creates a pressure to develop useful internal abstractions.
Reinforcement learning on a small amount of data which ‘awakens’ a dialog agent. With the right prompt (=prefix language) engineering a vanilla large language model can address many tasks as the information is there, but it’s awkward and clearly not a general purpose dialog agent. At the same time, the learned substrate is an excellent representation upon which to apply RL creating a more active agent while curbing an inherited tendency to mimic internet flamebait.
1. Why reinforcement learning? One of the oddities of language is that there is more than one way of saying things. Hence, the supervised learning view that there is a right answer and everything else is wrong sets up inherent conflicts in the optimization. Hence, “reinforcement learning from human feedback” pairs inverse reinforcement learning to discover a reward function and basic reinforcement learning to achieve better performance. What’s remarkable about this is that the two-step approach is counter to the information processing inequality.

The overall impression that I’m left with is something like the “ghost of the internet”. If you ask the internet for the answer to a question on the best forum available and get an answer, it might be in the ballpark of as useful and as correct as that which GPT4 provides (notably, in seconds). Peter Lee’s book on the application to medicine is pretty convincing. There are pluses and minuses here—GPT4’s abstraction of language tasks like summarization and style appear super-human, or at least better than I can manage. For commonly discussed content (e.g. medicine) it’s fairly solid, but for less commonly discussed content (say, Battletech fan designs) it becomes sketchy as the internet gives out. There are obviously times when it errs (often egregiously in a fully confident way), but that’s also true in internet forums. I specifically don’t trust GPT4 with math and often find it’s reasoning and abstraction abilities shaky, although it’s deeply impressive that they exist at all. And driving a car is out because it’s a task that you can’t really describe.

What about the future?
There’s been a great deal about the danger of AI discussed recently, and quite a mess of misexpectations about where we are.

Is GPT4 and future variants the answer to [insert intelligence-requiring problem here]? GPT4 seems most interesting as a language intelligence. It’s clearly useful as an advisor or a brainstormer. The meaning of “GPT5” isn’t clear, but I would expect substantial shifts in core algorithms/representations are necessary for mastering other forms of intelligence like memory, skill formation, information gathering, and optimized decision making.
Are generative models the end of consensual reality? Human societies seem to have a systematic weakness in that people often prefer a consistent viewpoint even at the expense of fairly extreme rationalization. That behavior in large language models is just looking at our collective behavior through a mirror. Generative model development (both language and video) do have a real potential to worsen this. I believe we should be making real efforts as a society to harden and defend objective reality in a multiple ways. This is not specifically about AI, but it would address a class of AI-related concerns and improve society generally.
Is AI about to kill everyone? Yudkowski’s editorial gives the impression that a Terminator style apocalypse is just around the corner. I’m skeptical about the short term (the next several years), but the longer term requires thought.
1. In the short term there are so many limitations of even GPT4 (even though it’s a giant advance) that I both lack the imagination to see a path to “everyone dies” and I expect it would be suicidal for an AI as well. GPT4, as an AI, is using the borrowed intelligence of the internet. Without that source it’s just an amalgamation of parameters of no interesting capabilities.
2. For the medium term, I think there’s a credible possibility that drone warfare becomes ultralethal inline with this imagined future. You can already see drone warfare in the Ukraine-Russia war significantly increasing the lethality of a battlefield. This requires some significant advances, but nothing seems outlandish. Counterdrone technology development and limits on usage inline with other war machines seems prudent.
3. For the longer term, Vinge’s classical singularity essay is telling here as he lays out the inevitability of developing intelligence for competitive reasons. Economists are often fond of pointing out how job creation has accompanied previous mechanization induced job losses and yet my daughter points out how we keep increasing the amount of schooling children must absorb to be capable members of society. It’s not hard to imagine a desolation of jobs in a decade or two where AIs can simply handle almost all present-day jobs and most humans can’t skill-up to be economically meaningful. Our society is not prepared for this situation—it seems like a quite serious and possibly inevitable possibility. Positive models for a nearly-fully-automated society are provided by Star Trek and Iain Banks although science fiction is very far from a working proposal for a working society.
4. I’m skeptical about a Lawnmower Man like scenario where a superintelligence suddenly takes over the world. In essence, cryptographic barriers are plausibly real, even to a superintelligence. As long as that’s so, the thing to watch out for is excessive concentrations of power without oversight. We already have a functioning notion of super-human intelligence in organizational intelligence and are familiar with techniques for restraining organizational intelligence into useful-for-society channels. Starting with this and improving seems reasonable.

1/7/2013

NYU Large Scale Machine Learning Class

Yann LeCun and I are coteaching a class on Large Scale Machine Learning starting late January at NYU. This class will cover many tricks to get machine learning working well on datasets with many features, examples, and classes, along with several elements of deep learning and support systems enabling the previous.

This is not a beginning class—you really need to have taken a basic machine learning class previously to follow along. Students will be able to run and experiment with large scale learning algorithms since Yahoo! has donated servers which are being configured into a small scale Hadoop cluster. We are planning to cover the frontier of research in scalable learning algorithms, so good class projects could easily lead to papers.

For me, this is a chance to teach on many topics of past research. In general, it seems like researchers should engage in at least occasional teaching of research, both as a proof of teachability and to see their own research through that lens. More generally, I expect there is quite a bit of interest: figuring out how to use data to make predictions well is a topic of growing interest to many fields. In 2007, this was true, and demand is much stronger now. Yann and I also come from quite different viewpoints, so I’m looking forward to learning from him as well.

We plan to videotape lectures and put them (as well as slides) online, but this is not a MOOC in the sense of online grading and class certificates. I’d prefer that it was, but there are two obstacles: NYU is still figuring out what to do as a University here, and this is not a class that has ever been taught before. Turning previous tutorials and class fragments into coherent subject matter for the 50 students we can support at NYU will be pretty challenging as is. My preference, however, is to enable external participation where it’s easily possible.

Suggestions or thoughts on the class are welcome 🙂

1/1/20131/2/2013

Deep Learning 2012

2012 was a tumultuous year for me, but it was undeniably a great year for deep learning efforts. Signs of this include:

Winning a Kaggle competition.
Wide adoption of deep learning for speech recognition.
Significant industry support.
Gains in image recognition.

This is a rare event in research: a significant capability breakout. Congratulations are definitely in order for those who managed to achieve it. At this point, deep learning algorithms seem like a choice undeniably worth investigating for real applications with significant data.

7/11/2011

Interesting Neural Network Papers at ICML 2011

Maybe it’s too early to call, but with four separate Neural Network sessions at this year’s ICML, it looks like Neural Networks are making a comeback. Here are my highlights of these sessions. In general, my feeling is that these papers both demystify deep learning and show its broader applicability.

The first observation I made is that the once disreputable “Neural” nomenclature is being used again in lieu of “deep learning”. Maybe it’s because Adam Coates et al. showed that single layer networks can work surprisingly well.

An Analysis of Single-Layer Networks in Unsupervised Feature Learning, Adam Coates, Honglak Lee, Andrew Y. Ng (AISTATS 2011)
The Importance of Encoding Versus Training with Sparse Coding and Vector Quantization, Adam Coates, Andrew Y. Ng (ICML 2011)

Another surprising result out of Andrew Ng’s group comes from Andrew Saxe et al. who show that certain convolutional pooling architectures can obtain close to state-of-the-art performance with random weights (that is, without actually learning).

On Random Weights and Unsupervised Feature Learning, Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, Andrew Y. Ng

Of course, in most cases we do want to train these models eventually. There were two interesting papers on the topic of training neural networks. In the first, Quoc Le et al. show that a simple, off-the-shelf L-BFGS optimizer is often preferable to stochastic gradient descent.

On optimization methods for deep learning, Quoc V. Le, Jiquan Ngiam, Adam Coates, Abhik Lahiri, Bobby Prochnow, Andrew Y. Ng

Secondly, Martens and Sutskever from Geoff Hinton’s group show how to train recurrent neural networks for sequence tasks that exhibit very long range dependencies:

Learning Recurrent Neural Networks with Hessian-Free Optimization, James Martens, Ilya Sutskever

It will be interesting to see whether this type of training will allow recurrent neural networks to outperform CRFs on some standard sequence tasks and data sets. It certainly seems possible since even with standard L-BFGS our recursive neural network (see previous post) can outperform CRF-type models on several challenging computer vision tasks such as semantic segmentation of scene images. This common vision task of labeling each pixel with an object class has not received much attention from the deep learning community.
Apart from the vision experiments, this paper further solidifies the trend that neural networks are being used more and more in natural language processing. In our case, the RNN-based model was used for structure prediction. Another neat example of this trend comes from Yann Dauphin et al. in Yoshua Bengio’s group. They present an interesting solution for learning with sparse bag-of-word representations.

Large-Scale Learning of Embeddings with Reconstruction Sampling, Yann Dauphin, Xavier Glorot, Yoshua Bengio

Such sparse representations had previously been problematic for neural architectures.

In summary, these papers have helped us understand a bit better which “deep” or “neural” architectures work, why they work and how we should train them. Furthermore, the scope of problems that these architectures can handle has been widened to harder and more real-life problems.

Of the non-neural papers, these two papers stood out for me:

Sparse Additive Generative Models of Text, Jacob Eisenstein, Amr Ahmed, Eric Xing – the idea is to model each topic only in terms of how it differs from a background distribution.
Message Passing Algorithms for Dirichlet Diffusion Trees, David A. Knowles, Jurgen Van Gael, Zoubin Ghahramani – I’ve been intrigued by Dirichlet Diffusion Trees but inference always seemed too slow. This paper might solve that problem.

8/23/20108/23/2010

Boosted Decision Trees for Deep Learning

About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.

This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.

Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.