Inappropriate Mathematics for Machine Learning

Reviewers and students are sometimes greatly concerned by the distinction between:

  1. An open set and a closed set.
  2. A Supremum and a Maximum.
  3. An event which happens with probability 1 and an event that always happens.

I don’t appreciate this distinction in machine learning & learning theory. All machine learning takes place (by definition) on a machine where every parameter has finite precision. Consequently, every set is closed, a maximal element always exists, and probability 1 events always happen.

The fundamental issue here is that substantial parts of mathematics don’t appear well-matched to computation in the physical world, because the mathematics has concerns which are unphysical. This mismatched mathematics makes irrelevant distinctions. We can ask “what mathematics is appropriate to computation?” Andrej has convinced me that a pretty good answer to this question is constructive mathematics.

So, here’s a basic challenge: Can anyone name a situation where any of the distinctions above (or similar distinctions) matter in machine learning?

26 Replies to “Inappropriate Mathematics for Machine Learning”

  1. All machine learning takes place (by definition) on a machine where every parameter has finite precision. Consequently, every set is closed, a maximal element always exists, and probability 1 events always happen.

    Can anyone name a situation where any of the distinctions above (or similar distinctions) matter in machine learning?

    John, I do not know the context of your post, but your question is a bit of a red herring.

    Computers have finite precision and therefore can operate only with discrete sets. For example a single general real number cannot be accurately represented. Does it mean that real numbers are not appropriate for describing various phenomena? The success of various real-number based algorithms and models would indicate otherwise.
    Once we agree that real numbers are a key mathematical abstraction for
    many types modeling, we see that various processes and their attributes are continuous functions of parameters and their behavior needs to be analyzed from that point of view (e.g., by taking derivatives). In that setting the difference between an open set and a closed set is considerable. To give just one common example, you may have an energy barrier (a physical consideration) preventing a process from crossing a certain threshold. This is naturally defined on an open but not closed interval.

    There are some mathematical objects that may be irrelevant to computation and “unphysical” (the Banach-Tarski paradox on comes to mind), but the difference between an open and a closed set is not one of them.

    Moreover, in the context of learning theory, it is often useful to know that a certain object (e.g., an optimal classifier or a Nash equilibrium) exists. This is the difference between max and sup, as sup does not guarantee existence.

    misha belkin

  2. Even if machines always have finite precision, it is sometimes very convenient to pretend otherwise. For example, the standard definition of a derivative involves a quantity which converges to zero – a meaningless operation in a finite precision setting. However, I don’t think anyone will argue that derivatives are meaningless for machine learning, or will insist on using a different definition of a derivative which fits a finite precision setting.

    Also, there are mathematical settings which has proven useful for learning theory and where mathematical “niceties” can sometimes be important (such as RKHS theory). So while agreeing that these formal distinctions are meaningless when one talks about a concrete algorithm, it might be convenient in a theoretical analysis to ignore the finite precision issue, and then these distinctions may become important.

  3. Hm, following your logic it would turn out that calculus doesn’t make any sense, yet I’m pretty sure that’s useful in machine learning…

  4. Well, I don’t know any machine learning, but your question is quite general: given that the numbers computers work with are are always finite, why bother with real numbers, and with the machinery necessary to deal with them?

    One answer is that some things are nicer when you deal with real numbers. If you avoided real numbers, you could never say that your data was “corrupted by Gaussian noise” for example – you’d need to replace a Gaussian with some kind of discrete approximation to it every time it appeared, which I imagine would be cumbersome and inelegant.

  5. For the distinction between supremum and maximum, it might very well make a difference; if you chose to ignore it you cannot properly handle convergence of some algorithms. For example, AdaBoost never converges finitely to a maximum hence no usual optimality conditions can be guaranteed to hold. Making the distiction, one can resort to more appropriate stopping conditions. So its not merly a technicality.

  6. Sorry, I think my post was confusing. I am personally comfortable with real numbers, derivatives, and integration all being useful abstractions, as approximatable by a computer. There are several discussion of how to approximate these concepts on a finite precision machine, including some of Andrej’s thesis work.

  7. The standard analysis of adaboost is that a weak learner can be turned into a strong learner by repeated application on altered distributions. To the best of my knowledge, no sup or max occurs in this analysis, so I’m unclear on what you mean. Can you point out a proof somewhere?

  8. This is naturally defined on an open but not closed interval.

    I’m unclear on this. Energy barriers are not generally infinite, and hence not perfectly confining.

  9. If you are “comfortable with real numbers” as you claim, then surely you see the need to make sure arguments with real numbers are not contradictory? If you define x to be the smallest positive real number, for example, you might do some manipulations involving x and derive incorrect statements. Being careful about sups and max will make sure this doesn’t happen.

  10. I think that the barriers (at least in optimization) are usually infinite (wiki seems to agree). More to the point the fact that infinite barriers are frequently used shows that the distinction between closed an open interval can be important.
    In constrained optimization, a field of mathematics, a barrier function is a continuous function whose value on a point increases to infinity as the point approaches the boundary of the feasible region (Nocedal and Wright 1999). It is used as a penalizing term for violations of constraints. The two most common types of barrier functions are inverse barrier functions and logarithmic barrier functions.

    misha b

  11. A lot of “probability 1” events only have probability 1 if you have infinite precision and a source of truly random numbers. When working with finite precision and pseudorandom numbers, probability 1 events tend to have probability less than 1 and knowing how much less actually becomes important.

  12. Ok, I agree that barrier functions are something like an open set definition in optimization. The next question is: does the distinction between an open set and a closed set matter in machine learning? I know some people with an optimization viewpoint on machine learning like to have a very precise solution to a global optima which is found to near-infinite precision. In practice however, it seems that early termination is sufficient for test set performance.

    Do you know of an example where it matters in practice?

  13. John, you sound antimathematical in your post. Surely you are not advocating carelesness? Perhaps you meant to say that a lot of mathematics that is imposed on you by reviewers feels unnecessary in your context. This may be the case, but then it is your task to present your work in a correct context that will not invite complaints from referees. Apparently your teachers of mathematics and your community did not tell you what the appropriate mathematical setup might be.

    John says that distinctions such as “open vs. closed” are irrelevant because “everything is finite and decidable in practice”. I doubt everything you do is finite and decidable. Your readers have already pointed out real numbers. Just because in finite time your computer produces only a finite approximation to a real number this is no reason to think that the ideal, infinitely precise reals are unneccessary.

    One of the comments purpetrates the misbelief that “computers cannot represent reals exactly”. That is complete nonsense and I have to object. A computer can represent exactly at least all computable real numbers, and depending on your model of what a computer does, it might be possible to represent every real number exactly.

    And another thing. Even in finite structures it makes sense to talk about open and closed sets. For example, in a finite triangulation (of a polytope) there is a dinstinction between open and closed starts (they are kind of like balls in a metric space), and confusing them leads to complete nonsense.

    As far as “maximum” and “supremum” are concerned, just say “supremum” all the time and you should be fine…

    So what are you saying, John?

  14. When someone asks for the time, do you reply “six hours fourty four minutes and six seconds AM”? Or “A quarter to seven”? I prefer option (2), even though in some precise sense, it is false.

    Similarly, I believe that mathematics can at times obscure by precision, and I’m trying to point out a few ways in which it can happen in machine learning. (In contrast, I also strongly believe in the power of mathematics to illuminate a subject when used well.) In every instance where I-as-a-fellow-reviewer have seen other reviewers complain about some of the above issues, there was a sane default method (along the lines discussed in the post) for resolving the imprecision. I don’t believe that papers should be held up for publishing because of such issues, because they appear to be immaterial to the core content of papers in Machine Learning.

    I don’t believe this is a controversial statement in a CS theory context where many papers use probability but (for example) very few of them formally define the sigma algebra behind it. The place where this seems to become controversial is where CS-style theory and the mathematics of an older precomputer tradition meet, such as in machine learning theory.

    (A basic question is: “Am I right?”, which is why I’m interested in any counterexamples people have come across.)

    One of the things which is clear from this post, is that estimating sufficient precision is hard. Some of our anonymous friends seem to have lept into the fallacy that I’m always against the real number abstraction or calculus. I believe a fair reading of the post doesn’t imply that, but perhaps stating things differently would have been more clear.

  15. The only restriction to finite-precision arithmetic on a computer is the representation we choose. If we choose 32 or 64 bits to represent real numbers, we wind up with a discrete set. With unbounded variable-length strings of bits, we get a countably infinite, dense, but not continuous set. If we then look at sequences of these unbounded strings, we approach real numbers. Of course, we can only compute with the finite prefixes of the sequences, but it lets us get arbitrarily close. (Floating point representations are part of denotational semantics; there are very nice connections to topology, logic and set theory, especially constructive mathematics.)

    Is it worth being more precise than 64 (or even 32) bits for machine learning? Only if you need your coefficients estimated to more than a dozen decimal places, something we don’t typically have the data to motivate.

    Do real arithmetic precision problems creep in all over the place when trying to implement an algorithm? In my experience, yes. What we need to know is how the machine arithemtic interacts with our algorithms. And it’s not just basic arithmetic, it’s how we implement transcendental functions, not to mention sampling, cumulative densities, etc.

  16. This is an interesting issue and if anything illustrates the dichotomy between those who have a mathematics undergraduate education and those that have a computer science undergraduate education. I’ve found that those who have had much formal training in upper division or first year graduate analysis tend to use concepts and notation that is standard there, but those who have been trained mostly in CS-style machine learning courses do the opposite. I think this issue comes up in other areas as well, including information theory. Take the cover and Thomas book which is much less mathematically precise than many previous information theory books (e.g., Kullback’s book), but at the same time it is much more cited because it is more accessible. To help resolve this, I personally think that both are needed. I.e., when first introducing a subject, it is fine to be a little imprecise if it makes things a little simpler. Once the basic idea is presented, it is better to be more precise. Why should these truly continuous aspects of mathematics have a place in computer science? One possible reason is that as computers get better over time, ideally it would be the case that software written for today’s computers will still work 100 (or even 1000) years from now at which point being cognizant of an open vs. closed set might allow the algorithm to scale into the future.

  17. From the AdaBoost Wikipedia article,
    “Boosting can be seen as minimization of a convex loss function over a convex set of functions.”
    giving a reference to (T. Zhang, “Convex Risk Minimization”, Annals of Statistics, 2004).

    Although I do not know this reference, it is clear that if you normalize, you can write the problem solved by AdaBoost simply as,

    min_{\alpha} \sum_{i=1}^N exp(-y_i (\int_{\Omega} \omega f(x_i ; \omega) d\omega)),
    sb.t. \int_{\Omega} \omega d\omega = 1,
    \omega \geq 0, \forall \omega \in \Omega.

    (As done with TotalBoost. If you consider the Hinge-loss instead of the exponential one you exactly get LPBoost.)

    Then AdaBoost is just one way of ignoring/treating the normalization constraint and performing column generation for the above problem. However, as the loss function used in AdaBoost has a non-zero strictly monotone derivative and previous coefficients are never corrected (as in totally-corrective Boosting algorithms), you cannot have finite convergence. That was what I meant in the previous post.

  18. Also, if you start with a continuous model of a discrete problem, you often consider the continuous model as the “limit of a sequence of ever finer discrete approximations”, and your real problem as being one of those approximations.

    If in the continuous setting you then end up with a probability 1 property via a limit process, you often only know that the discrete correspondents to that property (in the sequence) would have “ever higher” probability.

  19. I can imagine this happening, but I haven’t run across it personally. Do you have an example?

  20. I haven’t seen ‘the smallest postive real number’ play a role in Machine Learning.

    It’s perhaps worth pointing out that this post is about Machine Learning, and not about all places that mathematics might be applied.

  21. I don’t quite understand the formula, because the \alpha are unconstrained. Perhaps there is an \alpha to \omega confusion?

    As I understand it, your point is that adaboost is nonconvergent in the sense that the limit point is never reached by iteration, which is something of an open vs. closed set distinction. Given this knowledge, it becomes clear that some early termination criteria is required.

    I don’t find this argument compelling because I believe the issue is convergence rate rather than an open vs. a closed set. Consider an algorithm adaboost’ which acts exactly like adaboost for the first 10^6 iterations, and then artificially imposes convergence for all future iterations. A basic claim is that adaboost’ is equivalent to adaboost for all practical purposes since nobody wants to waste the computation for 10^6 rounds of boosting. The same computational constraint implies that some form of early stopping is necessary. Since there is a sound alternative rational (computational considerations) which covers both adaboost and adaboost’, the argument above doesn’t seem compelling.

  22. Given that I talked to John last week about this post and still didn’t have a counter-example, I would like to revisit the question by giving a meta-algorithm for someone to find a counter-example which will surely satisfy John. I tend to side with the mathematicians in term of the needs for rigor; my intuition is that we often use powerful mathematical abstractions in which we carry an analysis to carry results back to something concrete. If the analysis is not coherent (i.e. not only rigorous, but also *wrong*), something bad could happen when you transfer it back to the concrete example, even though the concrete example seems to be simpler than the abstraction (e.g. finite vs. infinite). My intuition is thus that there exists such situations that John is asking for (just that I haven’t pinpointed one yet).

    So here is the meta-algorithm:
    1) Think of a concrete machine learning setup (e.g. binary classification).
    2) Suggest an algorithm and ‘prove’ some properties of the algorithm using ‘sloppy maths’ (where ‘sloppy’ here means to not bother with the distinctions 1-3 mentioned in John’s original post). The property of the algorithm should useful enough for John to care about it.
    3) Do your analysis in such a way that the result is actually wrong (and so depends on distinctions 1-3). Make sure that the result is still wrong when applied on computer with finite precision.

    For example, one could ‘prove’ that their algorithm will always terminate in finite time, but then show a finite precision example for which the algorithm would actually not terminate in finite time. Or one could ‘prove’ that the generalization error of their algorithm is smaller than some quantity w.h.p, and then give a distribution on a finite set for which the generalization error of their algorithm is actually bigger than this quantity w.h.p.

    The wrong ‘proofs’ should be simple enough so that John accepts the need to actually be careful with those distinctions to avoid the obtained contradictions in the future…

    Actually, I am starting to think of a few ideas… I will try to work them out during some free time (scarce resource these times!). But hopefully somebody else can pinch in…

  23. Let’s take the first example. By and large, the processes in machine learning that interest us need to be robust against perturbation. So whether or not points along the boundaries of our sets lie inside or outside the set should have no impact on our outcomes. This would appear to support (1).

    But in order to come to this conclusion I reasoned using topological ideas. The distinction between open and closed sets is crucial for this. So (1) seems important after all, paradoxically precisely so you can come to the conclusion that (1) isn’t important.

    1. If the point is that you need an inappropriate mathematics in order to reject the inappropriate mathematics, then I would claim “no”. An inappropriate mathematics can be rejected because it provides no useful advice about the design of learning algorithms.

Comments are closed.