Machine Learning to AI

I recently had fun discussions with both Vikash Mansinghka and Thomas Breuel about approaching AI with machine learning. The general interest in taking a crack at AI with machine learning seems to be rising on many fronts including DARPA.

As a matter of history, there was a great deal of interest in AI which died down before I began research. There remain many projects and conferences spawned in this earlier AI wave, as well as a good bit of experience about what did not work, or at least did not work yet. Here are a few examples of failure modes that people seem to run into:

  1. Supply/Product confusion. Sometimes we think “Intelligences use X, so I’ll create X and have an Intelligence.” An example of this is the Cyc Project which inspires some people as “intelligences use ontologies, so I’ll create an ontology and a system using it to have an Intelligence.” The flaw here is that Intelligences create ontologies, which they use, and without the ability to create ontologies you don’t have an Intelligence. If we are lucky, the substantial effort invested in Cyc won’t be wasted, as it has a large quantity of information stored in a plausibly useful format. If we are unlucky, it fails to even be partially useful, because the format is unnatural for the internal representations of an Intelligence.
  2. Uncertainty second. Many of the older AI programs had no role for uncertainty. If you asked the people working on them, they might agree that uncertainty was an important but secondary concern to be solved after the main problem. Unfortunately, it seems that uncertainty is a primary concern in practice. One example of this is blocks world where a system for planning how to rearrange blocks on a table might easily fail in practice because the robot fails to grab a block properly. Many people think of uncertainty as a second order concern, because they don’t experience uncertainty in their daily lives. I believe this is incorrect—a mental illusion due to the effect that focusing attention on a specific subject implies reducing uncertainty on that subject. More generally, because any Intelligence is a small part of the world, the ability of any intelligence to perceive, understand, and manipulate the world is inherently limited, requiring the ability to deal with uncertainty. For statistics & ML people, it’s important to not breath a sigh of relief too easily, as the problem is pernicious. For example many ML techniques based around conditional independence routinely suffer from excess certainty.
  3. Computation second. Some people try to create an intelligence without reference to efficient computation. AIXI is an extreme example of this sort. The algorithm is very difficult to deploy in practice because there were no computational constraints other than computability designed into it’s creation. It’s important to understand that computational constraints and uncertainty go together: because there are computational constraints, an intelligence is forced to deal with uncertainty since not everything which might follow at a mathematical level can be inferred in the available computational budget.
  4. AI-Hard problems. There was a time when some people thought, “If we could just get a program that mastered chess so well it could beat the best humans, we will learn enough about AI to create an AI.” Deep Blue put that theory to rest. Current efforts on Poker and Go seem more promising, but no one believes they are “AI-Hard” for good reason. It’s not even clear that the Turing Test is a reliable indicator, because (for example) we might imagine that there is Intelligence which can not imitate a human, or that there are programs that can imitate humans well enough to fool humans without being able to achieve everything that an Intelligence could. Perhaps the best evidence is something singularity-style: AI exists when it can substantially improve it’s own abilities.
  5. Asymptopia. In machine learning there are many theorems of the form “learning algorithm A can solve any learning problem in the limit of infinite data”. Here A might be nearest neighbors, decision trees, two-layer neural networks, support vector machines, nonparametric statistics, nonparametric Bayes, or something else. These theorem are ok, but insufficient. Often the algorithms are not computationally acceptable, and even if so, they are not sufficiently efficient with respect to the amount of experience required to learn.

Solving AI is undeniably hard, as evidenced by the amount of time spent on it, and the set of approaches which haven’t succeeded. There are a couple reasons for hope this time. The first is that there is, or soon will be sufficient computation available, unlike the last time. The second is that the machine learning approach fails well, because there are industrial uses for machine learning. Consequently, we can expect a lack of success to still see substantial use in practice. This might sound like “a good downside”, but it’s actually an upside, because it implies that incremental progress has the potential for ultimate success.

Restated at an abstract level: a hard problem can generally be decomposed in many ways into subproblems. Amongst all such decompositions, a good decomposition is one with the property that solutions to the subproblems are immediately useful. The machine learning approach to AI has this goodness property, unlike many other approaches, which partially explains why the ML approach is successful despite “failing” so far to achieve AI.

One reason why AI is hard, is that it turns out tackling general problems in the world undeniably requires a substantial number of different strategies, including learning, searching, and chunking (= constructing macros), all while respecting constraints of computation and robustness to uncertainty. Given this, a fair strategy seems to be first mastering one strategy, and then incorporating others, always checking that that incorporation properly addresses real world problems. In doing this, considering the constraint ignoring approaches as limiting cases of the real system may be helpful.

15 Replies to “Machine Learning to AI”

  1. Interesting post John. I’ve often been curious about what you label “asymptopia”. As you describe it is a sort of universality result: that the algorithm has the capability to solve a large class of problems. When I think about intelligence, an immediate thing which strikes me at least, is that it’s not clear that it is universal at all. At least not in any efficient manner for large problem sizes. I’m not sure what this means for approaches to AI, but I might guess that universality leads to throwing away of perfectly valid algorithms because they are not “strong enough.” Do you of examples that fit this mold?

  2. LOL!
    (excuse me)
    What makes you think that Machine Learning is the magic wand which will solve the “AI problem”?
    Though it is already supposedly solved by some Hutter’s student I do not see any possible breakthrough before the core goal is better defined:
    “If AI has made little obvious progress it may be because we are too busy
    trying to produce useful systems before we know how they should work.”

    – Marcel Schoppers (1983!)
    As far as I know all what machine learning is about is to pick up the “right stuff” within an hypothesis space whereas the hard part is to FIND the proper hypothesis space, not to mention how to just REPRESENT arbitrarily complex hypothesis spaces like a bunch of competing physic theories for a given blob or stream of “bare phenomenons”.
    Yeah! Yeah! Grant money is good to have but the bubble will fizzle again like in eighties…

  3. @Kevembuangga

    Yes, I admit it. I solved the AI problem. You can all go home now.

    Haha! I presume you’re talking about the claim that AIXI (which I didn’t invent) would solve the AI problem if it weren’t for the fact that it’s uncomputable. Your statement that I claim to have solved the AI problem is thus very misleading in at least two important ways.


    Plain AIXI is not just hard to compute, it’s provably uncomputable. Variants exist which are computable, but still, you really wouldn’t want to try in anything but very small toy examples.

    Anyway, the main thing I wanted to say is that I’d put a different “spin” on AIXI than you seem to put on it here. The idea was that general AI is a really hard problem. Ok, so what if we simplify things by ignoring computational cost… can we then solve the problem, and if we can, does this teach us anything useful? The fact that AIXI proved to be very hard, perhaps even impossible to convert into a practical solution to the AI problem certainly didn’t come as a surprise.

  4. A relevant concept that is clear here is that good science requires constant testing of ideas, which old AI essentially ignored: too many toy problems and proofs-of-concept, not enough evaluation. Serious testing requires data collection, which naturally leads to machine learning. In this sense, machine learning becomes unavoidable.

    Testing and the design of useful subproblems are also strongly related. In order to do large scale testing, the evaluating measure can’t be too complicated. Since we are at it, why not focus on small tasks that people can readily benefit from? Topic models for text are a good example: at the same time people keep working on several ways of introducing more “realistic” language features into such models, they leave a trail of nice, ready-to-use, models that solve relatively modest but relevant text analysis questions. That’s the path to creating a successful line of research. What earlier AI researchers perhaps lacked what the insight that producing “useful systems” was not enough: they needed to be easy to evaluate, and easy to use.

  5. It’s a trickier question than might clear. Is a computer a turing machine? The reflexive answer is yes, but the mathematical answer is no, because you can’t simulate an infinite tape in finite RAM. Similarly with learning algorithms I could answer your question either way. There are many algorithms which aren’t technically asymptotically universal for one reason or another, but most reasonable learning algorithms can be made asymptotically universal with a bit of tweaking. For example, even a linear predictor can be made asymptotically universal with a simple nonlearning nonlinear preprocessor. I’m not aware of a significantly used learning algorithm for which some similar trick doesn’t apply.

  6. I generally believe that solving a hard problem under relaxed constraints can be useful to inform further research on a hard problem, and AIXI fits that description. It’s also an example of more adventurous research than is typical in the field.

    I think what makes some people cringe a bit is the descriptions of it as an “optimal reinforcement learning agent”. This is correct, but for a particular kind of optimal which (a) doesn’t match the notion of optimal that almost all readers have and (b) isn’t a notion of optimal achievable by any real Intelligence.

    I can understand both viewpoints.

  7. @Shane

    OK, I recant, you didn’t solve anything and AIXI is useless 🙂
    Is that better?
    That was in jest to emphasize my argument which is that no matter how “promising” and clever current AI research is (and has been) it still misses the point in that we don’t really know what we are looking for.
    I vaguely suspect that intelligence is about probing not proving or solving, building on the fly some sort of ontologies to organize the surrounding world into manageable chunks.
    Unfortunately the Western scientific tradition is firmly entrenched in Platonism which suppose that objects and concepts are “a given” and pay little attention to the epistemology of concept formation (as opposed to just “discovery”).
    I am pretty sure a rat doesn’t use Solomonoff Induction to find its way into a maze, doesn’t meet uncomputable questions in doing so and beats any current AI.
    And I don’t buy the mystic of “the neurons” which have replaced “soul” and “spirit”, looking more closely at the neurons will likely be as useful as analyzing the structure of feathers in the bid to fly the heavier than air.

  8. Another problem that I have experienced in Machine Learning work is the Cartesian Error of splitting mind and body. A learning system needs to arise from its environment as part of it, so that its self is fundamentally akin to its environment and therefore understanding of its environment can reflect on itself giving it the potential for self awareness. I caught my self falling into the trap of mind-body, actor-environment splitting and this was behind many problems the prototype was having in test.

  9. The safest strategy for an AI researcher is to never utter the term “intelligence”. “Artificial” is OK.

    There’s been a whole lot of ink spilled on the topic in philosophical circles, ranging from Plato’s shadows on the wall to Putnam’s brains in vats. We can argue behaviorism versus functionalism w.r.t. whether the Turing test makes sense. We can argue about metaphor and conceptual schemes and the existence of truth about terms like “intelligence”. We can argue Descartes’ mind/body duality and about the existence of a soul.

    I’m a simple country practitioner. I just want a good predictor for held out data with as little fiddling and waiting on my part (counted in person and machine hours) as possible. Ideally one that has good knowledge of its own uncertainty. As John mentioned under point 2, “techniques based around conditional independence routinely suffer from excess certainty.” Amen to that, brother. Of course, people also routinely suffer from excess certainty.

  10. Kevembuangga,

    you said

    “it still misses the point in that we don’t really know what we are looking for.”

    And this is precisely what AIXI is trying to define *formally*: an optimal agent given one has infinite resources. The idea is that, if we have a good idea of what the solution is in principle, we can then move forward from that and try to figure out what the best method is in practice. If we do not even know what the ‘objective function’ for AI is,

    “I am pretty sure a rat doesn’t use Solomonoff Induction to find its way into a maze, doesn’t meet uncomputable questions in doing so and beats any current AI.”

    And no one is claiming it does. The point you are missing is that SI does provide, in a very *formal/mathematical* sense, the optimal solution to the sequence prediction problem. It is meant to guide development of more practical variants (like MDL/MML, and PEA, CTWs in the online setting, and so on).

    “I vaguely suspect that intelligence is about probing not proving or solving, building on the fly some sort of ontologies to organize the surrounding world into manageable chunks.”

    Which is compression; so you are saying the more you compress, the more intelligent you are; which is exactly what AIXI says in a mathematically precise manner.

  11. And this is precisely what AIXI is trying to define *formally*: an optimal agent given one has infinite resources.

    I fully agree that the key to any progress is by going from informal to formal definitions of intelligence, unfortunately the AIXI attempt is yet another flop.

    The point you are missing is that SI does provide, in a very *formal/mathematical* sense, the optimal solution to the sequence prediction problem.

    Thank you, since I referenced Solomonoff Induction in this thread I am certainly aware that it is the “perfect” solution to the sequence prediction problem, however the “AI problem” is NOT the sequence prediction problem.

    Which is compression…

    I know what you are alluding to, again this is NOT what I mean.
    The reason AIXI, Solomonoff Induction and the Hutter Prize (etc…) are widely off the mark is that they all work within an already established “coding scheme” whereas the REAL difficulty is to come up in the first place with some coding of events/objects/concepts from the world.
    To reframe this (metaphorically) in the machine learning realm think feature selection, before you can do anything sensible with your data you have to choose which kind of measurements are likely relevant to the question at hand, and more often than not, massage the raw measurements (PCA, SVD, dimension reduction, etc) BEFORE you can look for a match of hypotheses.
    Where does all those “clues” come from?
    From the experimenter, not from the algorithm!
    Just like the rat knows very well what properties of the maze are relevant to a possible escape route.
    Escaping the maze, tuning the algorithm, etc… ARE of course important matters too but NOT the “missing link” toward AI.

  12. A practical algorithm for scaling machine learning to AI size problems using only a few lines of code
    (discussed in the metaoptimize thread below)

    Basic concept is to use the random subspace method on a hierarchical network of dimensionality reducers, and program it simply using bulk synchronous processing (an implementation of which is the graph processing system pregel).

Comments are closed.