Idempotent-capable Predictors

One way to distinguish different learning algorithms is by their ability or inability to easily use an input variable as the predicted output. This is desirable for at least two reasons:

  1. Modularity If we want to build complex learning systems via reuse of a subsystem, it’s important to have compatible I/O.
  2. “Prior” knowledge Machine learning is often applied in situations where we do have some knowledge of what the right solution is, often in the form of an existing system. In such situations, it’s good to start with a learning algorithm that can be at least as good as any existing system.

When doing classification, most learning algorithms can do this. For example, a decision tree can split on a feature, and then classify. The real differences come up when we attempt regression. Many of the algorithms we know and commonly use are not idempotent predictors.

  1. Logistic regressors can not be idempotent, because all input features are mapped through a nonlinearity.
  2. Linear regressors can be idempotent—they just set the weight on one input feature to 1 and other features to 0.
  3. Regression trees are not idempotent, or (at least) not easily idempotent. In order to predict the same as an input feature, that input feature must be split many times.
  4. Bayesian approaches may or may not be easily idempotent, depending on the structure of the Bayesian Prior.

It isn’t clear how important the idempotent-capable property is. Successive approximation approaches such as boosting can approximate it in a fairly automatic maner. It may be of substantial importance for large modular systems where efficiency is important.

6 Replies to “Idempotent-capable Predictors”

  1. I’m trying to decide whether stacking (taken generally) is an idempotent “operator” on predictors. Seems as if it is, but I’m not in a position to know state of the art for general stacking methods.

  2. I could be wrong, but polynomial cascades (Grudic&Lawrence, IJCAI97) seem to be idempotent regression learners. Polynomial Cascades are an ensemble of simple learners where each new learner is at least as good as the previous models in the ensemble.

  3. Who’s to say that input data is completely reliable and noise free? Shouldn’t we look at inputs as _measurements_ of the true data, in which case we’d like to be able to ignore some of the information contained in the input?

  4. All learning algorithms that I’m aware of are capable of ignoring some piece of input informations.

  5. Several commenters seem to be missing the point. This isn’t a question of ignoring inputs, but of being able to accurately reproduce a perfect one. In practice, even being able to approximate an input could be vital, especially when high-quality features are constructed by the analyst.

  6. Rich Caruana and Virginia de Sa do interesting things with input-output juggling, akin to this. This CiteSeer link gets two additional related later papers – similar theme:

    http://citeseer.ist.psu.edu/60317.html

    A lot of things can be offered for “robustness” that add computation cost. Any thoughts?

    I can see as a practical matter, if you include in any heuristic approach some inputs as outputs, train on a training set, and test on a test set – you have a seat-of-the-pants answer of whether your classifier or regressor has obvious problems beyond some correct/incorrect classification error.

    If you cannot get your system to give you back a good picture of the inputs from using the inputs, what do you have? And if you do too much of it, what do you have besides an autoassociator?

    What about the regular technique taking inputs, corrupting them with noise, and testing an associative memory for noise sensitivity. It is not the same, but — similar. Any thoughts?

    Am I missing something subtle here?

Comments are closed.