This post is about a confusion of mine with respect to many commonly used machine learning algorithms.

A simple example where this comes up is Bayes net prediction. A Bayes net where a directed acyclic graph over a set of nodes where each node is associated with a variable and the edges indicate dependence. The joint probability distribution over the variables is given by a set of conditional probabilities. For example, a very simple Bayes net might express:

*P(A,B,C) = P(A | B,C)P(B)P(C)*

What I don’t understand is the mechanism commonly used to estimate *P(A | B, C)*. If we let *N(A,B,C)* be the number of instances of *A,B,C* then people sometimes form an estimate according to:

*P'(A | B,C) = N(A,B,C) / N /[N(B)/N * N(C)/N] = N(A,B,C) N /[N(B) N(C)]*

… in other words, people just estimate

*P'(A | B,C)*according to observed relative frequencies. This is a reasonable technique when you have a large number of samples compared to the size space

*A x B x C*, but it (naturally) falls apart when this is not the case as typically happens with “big” learning problems such as machine translation, vision, etc…

To compensate, people often try to pick some prior (such as Dirichlet prior with one “virtual count” per joint parameter setting) to provide a reasonable default value for the count. Naturally, in the “big learning” situations where this applies, the precise choice of prior can greatly effect the system performance leading to finicky tuning of various sorts. It’s also fairly common to fit some parametric model (such as a Gaussian) in an attempt to predict *A* given *B* and *C*.

Stepping back a bit, we can think of the estimation of *P(A | B, C)* as a simple self-contained prediction (sub)problem. Why don’t we use existing technology for doing this prediction? Viewed as a learning algorithm “counting with a Dirichlet prior” is exactly memorizing the training set and then predicting according to either (precisely) matching training set elements or using a default. It’s hard to imagine a more primitive learning algorithm.

There seems to be little downside to trying this approach. In low count situations, a general purpose prediction algorithm has a reasonable hope of performing well. In a high count situation, any reasonable general purpose algorithm converges to the same estimate as above. In either case something reasonable happens.

Using a general purpose probabilistic prediction algorithm isn’t a new idea, (for example, see page 57), but it appears greatly underutilized. This is a very small modification of existing systems with a real hope of dealing with low counts in {speech recognition, machine translation, vision}. It seems that using a general purpose probabilistic prediction algorithm should be the default rather than the exception.

I think the reason why such embedding of a learning algorithm in a Bayes net (or any other) is not

that popular is because rather than solving a problem it pushes it to another level. Indeed, why not use a Bayes net in a Bayes net to estimate

P(A|B,C) from the example in the post? Imho a good Bayes net is somewhat of divide-and-conquer:

it breaks the learning problem to a number of pieces each of which can be solved trivially. I agree, however, that often a clever combination of learning

algorithms can do better than any of them separately.

This “counting” is usually justified by wanting to maximize the likelihood of the data. IE, if we apply Empirical Risk Minimization with log loss as risk to Bayes Nets, this “counting” method would come out. It may seem wrong that people maximize log likelihood when they really care about classification, but some people have argued that optimizing this incorrect loss has a regularization effect

http://www.cs.iastate.edu/~honavar/Bouchard-compstat04.pdf

Also, in the last few years a number of papers came out about training Bayes Nets to optimize conditional log loss. For simple structures this becomes equivalent to logistic regression, which, as we know, overfits compared to Naive Bayes for small sample sizes. (Ng’s “On Discriminative vs. Generative classifiers” paper)

I don’t consider fourr’s solution viable by default. It is important to understand that conditional independence is rare. We should take advantage of it where we find it, but we should not expect it.

I certainly agree with Yaroslav that the process is justified as maximum likelihood. However, the model within which the likelihood is maximized seems very implausible and unreasonable. We can and should hope for stronger forms of generalization.

There has been a considerable amount of research into structure learning, either for joint models or for classification. There have been about 4 or 5 papers on this topic at ICML-05. The naive Bayes model is just one particular structure you may either assume or learn.

As for counting, yes, multinomial models are very popular because of their simplicity, but they are definitely not the only ones: you mention Gaussian, but there are others.

As for using conditional distributions: yes, you can use a discriminative model for them too. In fact, rules have been used in past to model CPT’s. Although there can be problems with inconsistent partial CPTs, one can surely squeeze some extra predictive performance out of this.

Friedman and Goldszmidt have a more fine-grained approach to learning Bayesian networks [UAI-96] where they use decision trees to learn conditional probabilities instead of just ‘counting’.

Perhaps a relevant paper: “Why is diagnosis using belief networks insensitive to imprecision in probabilities?”

by Henrion , Pradhan , Huang , Provan , O’Rorke. The suggestion is that even when individual parameter estimates are bad, the whole inference procedure could still be robust