Ockham’s Razor is Dull

Ockham’s razor is the principle that entities must not be multiplied beyond necessity. There are many different interpretations of Ockham’s razor. For me, the idea that simplicity is a guide to truth is the core of Ockham’s razor.

For any set of observations, there are an infinite number of theories that can fit the observations, with varying degrees of accuracy. For any set of data points, there are an infinite number of curves that fit the points, exactly or approximately. We don’t usually pick the theory with the highest accuracy (in general, there are infinitely many theories that are perfectly accurate); instead, we pick the theory that best balances accuracy and simplicity. Why? Because simplicity is a guide to truth. That is, the theory that best balances accuracy and simplicity is the theory that will make the most reliable predictions about future observations.

My PhD thesis was, in essence, about Ockham’s razor. For a period of about a decade, extending from my late undergraduate years to my early postdoctoral years, it would be fair to say that I was obsessed with Ockham’s razor. I was convinced that it was the key to understanding how we acquire knowledge about the world. I no longer believe in Ockham’s razor.

There are various ways to make Ockham’s razor more precise, such as Bayesian inference, minimum description length, minimum message length, Kolmogorov complexity, and data compression. My own approach was to reduce simplicity to stability. That is, we should prefer simpler theories because they are more stable, more resistant to noise in the data. This approach is closely related to the Akaike information criterion and the bias/variance dilemma.

For some time, I was satisfied with bias/variance as a justification for Ockham’s razor. The papers that awoke me from my dogmatic slumber were Cullen Schaffer’s “A conservation law for generalization performance” and “Overfitting avoidance as bias“. The subsequent No Free Lunch theorems and Webb’s “Further experimental evidence against the utility of Occam’s razor” further persuaded me that Ockham’s razor is dull.

It is notoriously difficult to explain the No Free Lunch theorems informally. I think of the theorems in terms of Hume and Kant. The NFL theorems show that there is no a priori reason to prefer any one theory to another. If you don’t make any assumptions about the world (i.e., the universe, the environment, the particular dataset that you are analyzing), then you can’t make any claims about the reliability of your predictions.

In machine learning, the NFL theorems are explained in terms of inductive bias. Note that inductive bias is distinct from statistical bias, which is the kind of bias involved in the bias/variance dilemma. Inductive bias is anything that is used to select a unique theory from the set of all theories, other than fit of the theory to the training data. In the words of Tom Mitchell, inductive bias is “a means of selecting between competing hypotheses that utilizes criteria beyond those strictly encapsulated in the training data”. The NFL theorems show that there is no way to avoid inductive bias, and there is no way to justify one inductive bias over another, without making assumptions about the world.

The lesson I draw from this is that, instead of asking how to measure the complexity/simplicity of a theory, we should ask which inductive biases tend to work well in this particular universe in which we find ourselves. Simplicity is just one more example of an inductive bias, and there is no reason to prefer it to any other inductive bias, unless empirical evidence supports its value. Webb’s experiments suggest that simplicity is not a particularly good bias for our world. The success of support vector machines, on the other hand, suggests that maximum margin is a good bias for our world.

In my experience with machine learning algorithms, on a wide variety of problems, the choice of features in the feature vectors is far more important than the choice of learning algorithm. That is, the most important issue is representation, and the inductive bias due to a given representation. When applying decision tree induction to chess endgame classification, Quinlan reported that the most difficult part of the problem was finding the right features to represent the data. The initial attempt was a feature vector with 64 elements, one for each square of the board. This low-level representation was useless. The final features were much higher-level. This kind of experience is rarely written down, but it is often discussed informally, in hallways at machine learning conferences, for example. I believe that it is not fruitful to think of the representation problem in terms of Ockham’s razor.

I have nothing against Bayesian inference, for example. It is often a useful tool. But there are those who seem to believe that all problems can be solved using a Bayesian approach. I disagree. You can’t apply Bayesian inference until you figure out how to represent the problem. But figuring out how to represent the problem is 95% of the work. By the time you have the representation right, the tool that you use to finish the remaining 5% is not terribly important.

17 Responses to “Ockham’s Razor is Dull”

  1. Support Vector Machines are really explained by simplicity, not maximum margin. Think about the general framework of Tikhonov regularization:

    min_{f \in H} \sum_{x,y} V(f(x_i), y_i) + \lambda ||f||^2

    This is merely saying that a good way to do supervised learning is to minimize a tradeoff between your training error (V measures how much you pay when you see x_i, predict f(x_i), but the true value is y_i), and the simplicity of your function, ||f||^2, measured in whatever norm you want. In a Reproducing Kernel Hilbert Space (the usual choice), functions with smaller norms are smoother, which to me at least corresponds well to an intuitive notion of simplicity. Lambda controls the tradeoff between smoothness and fit.

    SVM’s are often derived in a linearly separable setting, and then you see how the optimal solution has the “maximum margin.” The problem is that SVMs are nearly always used in the non-separable case, and then the geometric concept of margin is pretty broken. I really feel the “correct” way to view SVMs is as an instance of Tikhonov regularization, with the loss function V taken to be the hinge loss.

    Overall, you are absolutely correct that you cannot get anywhere without assuming a bias in the world (and being right about that assumption). Pretty much all the biases I’ve seen work in the real world are towards simplicity and smoothness, so I’m still a believer in the razor. But I haven’t (yet) read the references you’ve cited…

  2. functions with smaller norms are smoother, which to me at least corresponds well to an intuitive notion of simplicity

    The way I read this, you’re saying that smoothness is a good inductive bias for typical real-world learning problems, and then you’re justifying simplicity by equating it with smoothness. I’m willing to agree that smoothness is a reasonable bias, although I would like to see a rigorous test of this hypothesis. I’m not so willing to agree that the intuitive concept of simplicity is captured by smoothness. It might be argued that minimum description length is a more intuitive approach to simplicity, but a smooth curve could encode many bits of information in gentle bumps, and a jagged curve might be encoded with very few bits. Why do you feel it necessary to equate simplicity and smoothness? What does this equation really give you?

    There’s still the point that worrying about Ockham’s razor (or, more generally, inductive bias) is a distraction from the more important issue of representation.

  3. Dear Peter,

    Thank you for this interesting post and the many references, I will follow them up in due time.

    My question is this: you say the problem is representation, and I agree strongly; but it is humans who choose the representations, and then we are immediately left with the question how humans do this.

    In the brain, there is some algorithm at work which tries to find out which representation works best: an inductive bias.

    Cognitive Science shows that humans are pretty bad at reasoning - we have to make an effort to do well (that is, our inductive biases are evolved for surviving in the wild, and do not scale well to the modern world which requires more abstract reasoning).

    What the people at overcomingbias.com are saying IMHO is that you will do better if you reflect on your own thought processes when thinking of hypotheses and using Bayes Theorem as a guide; Bayes should override your built-in (by evolution) biases.

    The essence of my argument is this: if I think of a human (say, myself) as a learning machine, what biases should I employ to choose the best representations? Don’t we get back to Ockham’s razor (or some formalization) - now working in my brain?

  4. The essence of my argument is this: if I think of a human (say, myself) as a learning machine, what biases should I employ to choose the best representations? Don’t we get back to Ockham’s razor (or some formalization) - now working in my brain?

    We know so little about how humans form representations that any answer to this question must be very speculative. However, I’ll give you my opinion. It seems to me that the process of creating hypotheses (including the process of constructing a good representation) is distinct from the process of evaluating hypotheses. Bayesian inference is useful for evaluation, but I don’t believe it is useful for creation (i.e., discovery, invention). I agree that evaluation is important, and we should use the best tools available to us (e.g., Bayesian inference) for evaluation, but I believe that the creation process is much more important to cognition.

  5. Bayesian inference is useful for evaluation, but I don’t believe it is useful for creation (i.e., discovery, invention).

    I gave this some more thought. The most creative algorithms I know about are John Koza’s genetic programming algorithms, which have created patentable inventions. I admit I have no idea how to apply genetic programming to the problems that interest me, in computational linguistics. However, for what it’s worth, Koza appears to believe that parsimony (simplicity) is not important for genetic programming. Koza writes, in Genetic Programming:

    These seven principles of correctness, consistency, justifiability, certainty, orderliness, parsimony, and decisiveness have played such valuable roles in the successful solution of so many problems in science, mathematics, and engineering that they are virtually integral to our training and thinking. It is hard to imagine that these seven guiding principles should not be used in solving every problem. Since computer science is founded on logic, it is especially difficult for practitioners of computer science to imagine that these seven guiding principles should not be used in solving every problem. As a result, it is easy to overlook the possibility that there may be an entirely different set of guiding principles that are appropriate for a problem such as getting computers to solve problems without being explicitly programmed. Since genetic programming runs afoul of all seven of these guiding principles, I will take a moment to examine them.

    Parsimony: Copernicus argued in favor of his simpler (although not otherwise better) explanation for the motion of the planets (as opposed to the then established complicated Aristotelian explanation of planetary motion in terms of epicycles). Since then, there has been a strong preference in the sciences for parsimonious explanations. Occam’s Razor (which is, of course, merely a preference of humans) is a guiding principle of science.

    Fitness, not parsimony, is the dominant factor in natural evolution. Once nature finds a solution to a problem, it commonly enshrines that solution. Thus, we often observe seemingly indirect and complex (but successful) ways of solving problems in nature. When closely examined, these non-parsimonious approaches are often due to both evolutionary history and a fitness advantage. Parsimony seems to play a role only when it interferes with fitness (e.g., when the price paid for an excessively indirect and complex solution interferes with performance). Genetic programming does not generally produce parsimonious results (unless parsimony is explicitly incorporated into the fitness measure). Like the genome of living things, the results of genetic programming are rarely the minimal structure for performing the task at hand. Instead, the results of genetic programming are replete with totally unused substructures (not unlike the introns of deoxyribonucleic acid) and inefficient substructures that reflect evolutionary history rather than current functionality. Humans shape their conscious thoughts using Occam’s Razor so as to maximize parsimony; however, there is no evidence that nature favors parsimony in the mechanisms that it uses to implement conscious human behavior and thought (e.g., neural connections in the brain, the human genome, the structure of organic molecules in living cells).

  6. Dear Peter,

    I do not think that “It is notoriously difficult to explain the No Free Lunch theorems informally”. On the contrary, the NFL theorems are cute observations that have a very simple interpretation, and once you realize this, NFL becomes obviously true but irrelevant. Wolpert’s classical NFL theorem essentially just says that all algorithms that try to find the maximum of a white noise function are equally bad. (With probability 1 a uniformly randomly sampled function is white noise).

    So despite NFL being the holy grail in some research communities, this theorem has little to no practical implication, since nobody cares about the maximum of white noise functions.

    If you just assume that the world has ANY effective structure, or in your words, assume an inductive bias towards environments that have SOME structure, then NFL breaks down.

    Not only does NFL break down. Solomonoff’s theory then shows you that inductive inference works, at least for prediction. Actually there is a gap to NFL here. Optimization is (much) more tricky than prediction as explained in Section 6.4 of my book.

    The reason is that optimization is an active learning problem. The core of the book is actually devoted to generalize Solomonoff beyond passive prediction to the general RL setup.

    The assumption that the world has SOME structure is as safe as (or IMO even weaker than) the assumption that e.g. classical logic is good for reasoning about the world (and the latter you have to assume to make this conversation meaningful ;-).

    Regarding your comments about the importance of the representation of the problem, in particular the choice of the model class and Bayesian prior, etc, the paper “On Universal Prediction and Bayesian Confirmation”, TCS 384:1 (2007) 33-48, shows how all this is solved, at least conceptually. It discusses a wide range of philosophical and statistical problems around induction and shows how they are all solved by Solomonoff’s “Universal Bayesian” approach.

    In practice, you have to approximate Kolmogorov complexity (see Schmidhuber’s OOPS and related work), which I agree is usually a highly difficult and time-consuming task (but see Cilibrasi and Vitanyi (2005) for amazing successes with “general purpose” approximations). Let me give an analogy: Exact minimax is infeasible for nearly all interesting zero-sum games, 99% of the work is in approximating it. That’s true, but that doesn’t weaken the fundamental status of minimax.

    So experimental evidence against Ockham’s razor usually indicates that the used approximation was too crude. Anyway, never trust an experiment if it is not supported by a theory ;-) and (un)fortunately NFL is not a useful theory.

    On the contrary, Ockham’s razor is well supported theoretically and experimentally, and I am not aware of any other similarly general and powerful principle which could replace or augment Ockham. In a certain sense I discuss in my book, Ockham’s razor is actually the core of (all) science.

  7. Dear Peter,

    I beg to differ.

    The idea that simplicity (however formalized) is an inductive bias like any other, is, in my view, simply not true. I expand on this in the first and final chapters of my book, The Minimum Description Length Principle (MIT Press, 2007). These chapters are freely available on my webpage; see in particular Section 1.8.

    Of course, inductive bias is of crucial importance when designing learning algorithms; I’m certainly not claiming the opposite. I’m also not claiming that ‘our world is simple’ - as I argue in the book, adopting forms of Occam’s razor makes sense even if the world is complex.

    I won’t repeat the arguments made in the book here, but let me say something about the relationship to Shafer’s papers and the No Free Lunch Theorems. These theorems are of course true, but I don’t think they have all that much to say about the issue.

    The theorems rely on two strange assumptions: first, ‘no assumptions about the world’ is translated into a uniform prior over all possible probability distributions on data. It is totally unclear whether this is a good translation. Instead, in the COLT (computational learning theory community), one usually works in the agnostic framework where one makes *no assumptions at all* about the distribution from which data are sampled, except that it is i.i.d. So there is not even a uniform prior. And, lo and behold, within *this* framework, which makes even less assumptions, adopting some forms of Occam’s razor (such as in the so-called PAC-Bayesian and PAC-MDL approaches) does lead to good predictions, with high probability, for small sample sizes. (for those interested, I elaborate on how this is possible at the end of the mail).

    This agnostic framework is sometimes criticized because it measures generalization error in a different way from the way it is done in the NFL theorems, which focus on “off-training set generalization error”. But in fact, one can adjust the COLT/agnostic framework to this setting as well, as we (= Teemu Roos, Petri Myllymaki and me) showed in our 2007 NIPS paper “generalization to unseen cases”.

    Best wishes,

    Peter

    PS How can it be that we can ‘learn’ while making no assumptions about the truth at all, except that data are iid according to some arbitrary distribution? Well, given a ‘model’, say, a set of classifiers, let h* be the best classifier within your model, i.e. the one which is optimal for prediction relative to the unknown distribution from which data are sampled. Then the theorems says that some learning algorithms (which take as input a model and data of size n), wll start outputting hypotheses h which predict nearly as well as h* even for relatively small n, no matter what the underlying distribution is (thus, if the inductive bias goes the wrong way,even the best h* does not predict much better than random guessing, so it is not very helpful that we can predict nearly as well as h*. But if we are lucky, and the inductive bias goes the right way, then h* performs well, and with our learning algorithm, we will perform well as well). Now it turns out that if choose the set of hypotheses h too complex (measured in terms of VC dimension or e.g. description length), then even if we have our inductive bias right, i.e. the best h* within the model is very good, we will not learn a good approximation of h* unless we have an enormous sample. This is a first, crude sense in which adopting Occam’s razor makes sense.

  8. In practice, you have to approximate Kolmogorov complexity (see Schmidhuber’s OOPS and related work), which I agree is usually a highly difficult and time-consuming task (but see Cilibrasi and Vitanyi (2005) for amazing successes with “general purpose” approximations). Let me give an analogy: Exact minimax is infeasible for nearly all interesting zero-sum games, 99% of the work is in approximating it. That’s true, but that doesn’t weaken the fundamental status of minimax.

    This is the weak spot in your argument, as you know. Kolmogorov complexity is not merely difficult to compute; it is impossible, because it would require solving the halting problem. Thus your analogy with zero-sum games is inappropriate, because the minimax theorem does not require assuming that you can solve an unsolvable problem.

    In the real world, even polynomial-time algorithms are inadequate (for larger-order polynomials) for large quantities of data:

    More data usually beats better algorithms
    The Power of Efficient Algorithms on Big Data
    Change the algorithm, not the dataset
    The Seductive Power of Mathematics

    Regarding Cilibrasi and Vitanyi, I wrote to Cilibrasi in 2005 to point out that their NGD (Normalized Google Distance) is similar to my PMI-IR (Pointwise Mutual Information - Information Retrieval), although PMI-IR is much simpler than NGD. I challenged Cilibrasi to apply NGD to the TOEFL synonyms, so that we could compare the performance of NGD and PMI-IR. He wrote back that his results with the TOEFL synonyms were very disappointing. In their published paper, they neglect to mention this experiment with the TOEFL synonyms, and they neglect to cite PMI-IR as related work.

    Joshua Goodman has an interesting critique of Language Trees and Zipping, Extended Comment on Language Trees and Zipping.

  9. The idea that simplicity (however formalized) is an inductive bias like any other, is, in my view, simply not true. I expand on this in the first and final chapters of my book, The Minimum Description Length Principle (MIT Press, 2007). These chapters are freely available on my webpage; see in particular Section 1.8.

    I’ve now read the first chapter of your book. Your Example 1.7 assumes that h* is polynomial. Your Example 1.8 assumes that g is continuous. Given these assumptions about the world, your MDL inductive bias is reasonable. If you make no assumptions about the world, you cannot show MDL is a reasonable bias. Furthermore, it seems to me that Examples 1.7 and 1.8 would work equally well for any arbitrary ordering on the set of hypotheses; that is, they would work for any inductive bias; they are not specific to MDL.

    I did not see anything in the first chapter of your book that shows simplicity is not merely an inductive bias, like any other.

  10. I will apply the razor to my own post: I agree.

    Cheers!

  11. Kidding aside, I believe that in mathematical analysis and in machine learning, there is a hidden (or not-so-hidden) assumption that the data lies in some sort of smooth manifold. My own feeling is that this makes theoretical study much more interesting and easier, especially if you do not have a computer around.

    That is, I believe, where the bias comes from. Simplicity has pedagogical value.

  12. assumption that the data lies in some sort of smooth manifold

    I agree that smoothness is often a good inductive bias. But why do you equate smoothness with simplicity? A smooth curve (e.g., a cubic spline) can be highly complex, in terms of MDL.

  13. The way universal Turing machines escape the critique of Ockham’s Razor is precisely their universality of representation, including their representations of other Turing machines! The minimum size of a UTM’s description is in the hundreds of bits (see work on UTMs in terms of SK combinators) — vastly smaller than the practical models people work with all the time. In other words its virtually _all_ about “representation”. After all the Kolmogorov program for a string of bits is simply its shortest representation. The theorems showing you can’t beat the shortest representation for future accuracy are, in that sense, independent of the representation you choose. While it may be true that virtually all of the work in AI may be in choosing the right representation, it is also true that virtually all of the encoding in a Kolmogorov program is representation.

    But since the shortest encoding isn’t computable, we are left with mere relative comparisons. This is where approximations to Kolmogorov compexity _are_ to the point.

    Insofar as evolution is concerned, it is notoriously difficult to evaluate Ockham’s Razor in terms of natural selection because living systems aren’t mathematical abstractions that are attempting to model the world. They are replicators attempting to propagate themselves. Viruses, for example, aren’t going to pass a Turing Test any time soon but they are highly effective replicators.

    As far as maximum margin goes, it seems obvious it is inferior to Ockham’s razor for classification. Let’s say you have a bunch of distributions which you have modeled by taking , say, the first few major moments — mean, variance, skewness, kurtosis. If the only information you have about a point is its location, you can choose which distribution it is most likely to be a part of from that, and the emergent critical boundaries for those choices end up being better than those chosen by naive maximum margin.

  14. The way universal Turing machines escape the critique of Ockham’s Razor is precisely their universality of representation, including their representations of other Turing machines!

    Two separate things are being mixed up here: (1) the space of hypotheses that is being explored (e.g., programs for a universal Turing machine) and (2) a preference ordering on the space of hypotheses (i.e., an inductive bias, such as minimum description length).

    While it may be true that virtually all of the work in AI may be in choosing the right representation, it is also true that virtually all of the encoding in a Kolmogorov program is representation.

    Let’s consider Quinlan’s chess endgame dataset as an example. My claim is that the core problem is how to go from a low-level representation of the chess board to a high-level representation that facilitates classification. Your reply is, in effect, that the mapping from low-level to high-level can be done by a program for a universal Turing machine, so all we need to do is search through the space of programs for universal Turing machines. This answer may be true, in some abstract way, but it is exceedingly unhelpful for practical applications.

    Insofar as evolution is concerned, it is notoriously difficult to evaluate Ockham’s Razor in terms of natural selection because living systems aren’t mathematical abstractions that are attempting to model the world.

    Koza’s genetic programming is an evolutionary algorithm for exploring the space of Lisp programs. It is easy to impose a bias for simplicity in genetic programming; there are many papers on this topic. However, the interesting question is whether such a bias is helpful. The answer seems to be that sometimes it is helpful, other times it is not.

    As far as maximum margin goes, it seems obvious it is inferior to Ockham’s razor for classification.

    The only way to decide between two inductive biases is to test them both on a wide variety of problems. I don’t buy “it seems obvious”. Many falsities seem obvious.

  15. The only way to decide between two inductive biases is to test them both on a wide variety of problems. I don’t buy “it seems obvious”. Many falsities seem obvious.

    Quite!

    The main value of your blog entry is to shed light on a fundamental rift in schools of thought about science itself… and since schools of thought within science agree that experiment trumps argument, this blog entry may point the way to some important experiments.

    Of course, as with work on representation, experimental design is the most difficult part of science since it operationalizes our arguments in a most ruthlessly rigorous way.

  16. There is a paper by Pedro Domingos which makes some of the same criticisms against Ockham’s razor (he also quotes some of the same sources): The Role of Occam’s Razor in Knowledge Discovery.

    I have to agree with the first commenter that maximum margin arguments are a poor explanation for SVM performance on non-separable cases. I am still undecided on the relation between smoothness and simplicity, though.

  17. Related:
    (1) Simplicity (Stanford Encyclopedia of Philosophy)
    (2) David Dowe’s Occam’s razor links

Leave a Reply