June 22, 23, 24, 2012

Adamson Wing, Baker Hall 136A, Carnegie Mellon University

Scientific theory choice is guided by judgments of simplicity, a bias frequently referred to as "Ockham's Razor". But what is simplicity and how, if at all, does it help science find the truth? Should we view simple theories as means for obtaining accurate predictions, as classical statisticians recommend? Or should we believe the theories themselves, as Bayesian methods seem to justify? The aim of this workshop is to re-examine the foundations of Ockham's razor, with a firm focus on the connections, if any, between simplicity and truth.

The workshop generated some lively discussion, both during the event (particularly at the final round-table) and in subsequent blogs and email exchanges. It appears that the time is indeed ripe to re-open debate about Ockham's razor across philosophy, statistics, and machine learning and that the dialogue should continue.

**Slides**, **streaming videos** and** blog reviews** are all linked below.

The workshop generated a lively, interdisciplinary blog discussion. Thanks to the bloggers!!

Three-Toed Sloth play-by-play: Day 1, Day 2, Day 3.

**Vladimir Cherkassky**(University of Minnesota, Computer and Electrical Engineering),**Peter Gruenwald**(Leiden and Amsterdam, Machine Learning),**Malcolm Forster**(University of Wisconsin, Philosophy),**Kevin Kelly**(Carnegie Mellon, Philosophy),**Hannes Leeb**(University of Vienna, Statistics),**Deborah Mayo**(Virginia Tech, Philosophy),**Oliver Schulte**(Simon Fraser, Computer Science),**Cosma Shalizi**(Carnegie Mellon, Statistics),**Elliott Sober**(University of Wisconsin, Philosophy),**Vladimir Vapnik**(Columbia University, Center for Computational Learning Systems),-
**Larry Wasserman**(Carnegie Mellon, Statistics and Machine Learning)

**8:50** Introduction

**9:00** **Vladimir Vapnik** (Columbia University, Center for Computational Learning Systems)

Empirical Inference Problems in Machine Learning and Philosophy of Science.

**10:30** **Vladimir Cherkassky** (University of Minnesota, Computer and Electrical Engineering),

Data-Driven Knowledge Discovery and Philosophy of Science

**1:30**** Peter Gruenwald** (Leiden and CWI, Machine Learning),

Occam's Razor and Luckiness

**3:00 ****Larry Wasserman** (Carnegie Mellon, Statistics and Machine Learning).

Filaments and Manifolds: Simple Structure Hiding in Point Clouds

**4:30 Elliott Sober** (University of Wisconsin, Philosophy)

Ockham's razors.

**9:00** **Oliver Schulte** (Simon Fraser, Computer Science)

Simplicity, Induction, and Scientific DiscoveryStreaming video, Slides

**10:30**** Kevin Kelly**** and Hanti Lin** (Carnegie Mellon, Philosophy),

Ockham's Razor and Truth

**1:30** **Cosma Shalizi **(Carnegie Mellon, Statistics)

Complexity, Prediction, and Inference

**3:00 Hannes Leeb ** (University of Vienna, Statistics)

Inference Post-model-selection, Simplicity, and the role of Sample Size

**4:30** **Malcolm Forster** (University of Wisconsin, Philosophy)

So Many Theories of Simplicity! So Which One is Right?

**9:00 Deborah Mayo** (Virginia Tech, Philosophy)

Error Correction, Severity, and Truth: What's Simplicity Got to Do With it?

**10:30** Round table discussion. Streaming video

**12:00** Acknowledgments

**Vladimir Cherkassky** (University of Minnesota, Computer and Electrical Engineering),

**Data-Driven Knowledge Discovery and Philosophy of Science**

**Abstract:** Human knowledge reflects a combination of hypotheses/ideas and empirical or experimental data. In the modern digital age, the balance between ideas (mental constructs) and observed data has completely shifted. So the philosophical treatment of classical scientific knowledge may no longer be appropriate for data-driven knowledge discovery. Such data-driven knowledge can be properly described following the methodology of predictive learning developed in Vapnik-Chervonenkis(VC) theory and widely used in machine learning. This methodology leads to the Instrumentalist view of data-analytic knowledge, where the utility of a model is defined via its prediction capability. This view is quite different from the classical philosophical interpretation of scientific knowledge. In particular, instrumentalist view of data-driven knowledge: "rejects the notion of a single best (true) theory or model; "emphasizes the importance of a method of questioning, i.e., the idealistic component of knowledge discovery;" allows for multiple good scientific models describing the same empirical data. My talk will discuss and illustrate these points using several real-life application examples. Further, I will discuss challenging issues related to the interpretation of predictive data-analytic models. Classical science has both explanatory and predictive value. Likewise, classical parametric statistics is methodologically biased towards simple interpretable models. In contrast, modern machine learning methods are designed to achieve good prediction but are not interpretable. Interpretation of black-box data-analytic models is intrinsically difficult, especially for high-dimensional settings. For example, it may be possible to estimate several good models (using different parameterizations) from the same data set. These fundamental challenges will be discussed and illustrated via real-life application studies.

**Malcolm Forster** (University of Wisconsin, Philosophy),

**So Many Theories of Simplicity! So Which One is Right? **

Streaming video, Slides

**Abstract:** For a long time, most of us assumed that simplicity is a sign of truth because the world is simple. Popper complicated the story by saying that simplicity is falsifiability, so a simpler theory can be more severely tested (and if it passes those tests, it is well corroborated). Statisticians have recently pointed out that simpler models have fewer parameters, in which case each can be more accurately estimated, which leads to more accurate predictions. Now Kevin Kelly has taught us how we more quickly converge towards the truth if we examine simpler theories first. Which theory of simplicity is right? I want to argue that all of them are mutually compatible, and most of them are right.

**Peter Gruenwald** (Leiden and CWI, Machine Learning),

**Occam's Razor and Luckiness **

Streaming video, Slides

**Abstract:** I highlight the fundamental role of Occam's Razor in statistics and learning theory, even in settings like "prediction with expert advice" in which no stochastic assumptions are made at all, and in some forms of cross-validation, where Occam is deeply hidden but present nevertheless. I turn the familiar claim that ``Bayes implies a form of Occam" on its head, by showing that in nonparametric inference, Bayesian inference can only be expected to work well if the priors implement a form of Occam's Razor. The classic Bayesian inconsistency results of Diaconis and Freedman can be understood in these terms: they are based on non-Occam priors. Finally, I address the claim that inductive methods like MDL that embody Occam's Razor by means of data compression suffer from arbitrariness. This claim is both false and true. It is false in the sense that "not nearly everything goes": demanding that an inductive method can be understood in terms of data-compression imposes very strong constraints and rules out many a silly method. It is true in the sense that such methods do have a subjective component. This subjectivity is, however, different from the Bayesian idea of subjectivity and should rather be understood in terms of "luckiness", an idea going back to Kiefer's (1977) conditionalist frequentist inference. Although the term "luckiness" is hardly ever used explicitly, the idea is all over modern statistics. I argue that it should receive more attention by philosophers.

**Kevin Kelly**** and Hanti Lin** (Carnegie Mellon, Philosophy),

**Ockham's Razor and Truth**

**Abstract:** What does simplicity have to do with scientific truth? Bayesian conditioning explains Ockham's razor---if one starts out with prior probabilities biased toward possibilities compatible with simpler models. Frequentist appoaches to model selection do not invoke a prior bias toward simple possibilities, but neither do they connect simplicity with the truth of the models selected; rather, models are viewed as a means toward accurate prediction, understood in the sense of "probably getting close". But getting close in that sense does not imply that one is close in terms of the counterfactual predictions we invariably want to make from our models---relativistic and quantum corrections to classical physics are small in the ordinary course of experience, but have dramatic practical consequences when they are technologically amplified.

What is lacking is a *non-question-begging * justification of Ockham's razor as a reason for trusting the *counterfactual implications* of models, which requires that one get the "sharp edges" of the model exactly right. Getting the sharp edges right means that one must take the problem of induction seriously---there is no hope of bounding the chance of error by taking a large enough sample. Nonetheless, we will present a non-circular (frequentist) argument to the effect that Ockham's razor is the uniquely best method for finding true models.

Regarding the nature of simplicity, we assume that there is an empirical problem to be solved. An empirical problem consists of an empirical question (which model is correct?) and possible information states that the scientist may occupy in the future. The information states induce a topology. The problem is coarse-grained into "possibilities", and topological relations among the possibilities induce a natural, partial order that looks like different versions of empirical simplicity (number of parameters, number of causes, number of entities, etc) depending on the nature of the problem one begins with.

Then, in order to justify Ockham's razor, we consider losses for convergent success (e.g., convergence vs. non-converngence, time to avoidance of error, time to convergence to a true disjunction of possibilities, total number of retractions) and order these losses in various ways in terms of priority. For each such order, we solve for the collection of all methods that is optimal (or admissible) in that sense, in terms of worst-case bounds over the possibilities induced by the theory of simplicity just described. Then we axiomatize the various collections of methods that emerge in terms of methodological properties. The result is that all of the resulting axiomatizations look like plausible candidates for "Ockham's razor", but they are all also slightly different in interesting ways. This work is supported by a generous grant from the John Templeton Foundation.

**Hannes Leeb** (University of Vienna, Statistics),

**Model selection, simplicity, and the role of sample size**

In statistics, inference procedures post-model-selection can be viewed as a particular instance of Occam's razor. We show that such procedures can perform poorly when the models under consideration are very simple in relation to the size of the training sample. But when this relation is reversed, i.e., when the candidate models under consideration are not very simple in relation to the training sample size, then we find that such procedures can perform rather well in certain situations. We establish this last finding for regression with Gaussian data, i.e., under the assumption that both the response and the explanatory variables are jointly normal. When extending this result beyond Gaussian case, we need to deal with a property that, in a way, defines the Gaussian law: In regression with Gaussian data, all submodels are `correct' (in a certain sense). We show that a relaxed version of this property holds also in non-Gaussian settings, provided that the response variable is driven by a large number of independent components.

Parts of this work are joint with Benedikt Poetscher, Lukas Steinberger, and Ivana Milovic.

**Deborah Mayo** (Virginia Tech, Philosophy),

**Error Correction, Severity, and Truth: What's Simplicity Got to Do With it?**

Criteria of simplicity and truth, or well-testedness, may lead to opposing recommendations regarding issues of evidence and inference. I consider how such tension arises in evaluating philosophically: accounts of inductive statistical inference, principles of method or of relevant evidence, and the evidential warrant of a theory or model. Can/should the tension be resolved? By way of an answer, I propose to

- disentangle simplicity from efficiency and applicability, and
- elucidate the role of simplicity in a (deliberately) complex, piecemeal account of error correction.

I consider notions from Cox, Fisher, Peirce, Popper, and Neyman and Pearson.

**Oliver Schulte** (Simon Fraser, Computer Science)

**Simplicity, Induction, and Scientific Discovery**

Streaming video, Slides

**Abstract:** An inductive version of Ockham's Razor directs us to prefer simpler models for given observations. A basic question about this advice is what performance guarantees can be given for methods that follow it. Kevin Kelly has suggested that Ockham's Razor helps learning converge to a correct model efficiently. In this talk I develop theory and applications for the efficiency justification for Ockham's Razor, where the operative notion of efficient learning is minimizing the number of theory changes or mind changes before convergence.

Theory: I describe a simplicity concept from algorithmic learning theory: a given hypothesis space is viewed as a topological space, and the simplicity rank of a hypothesis is defined to be its Cantor-Bendixson rank in that space. Two basic theorems connect topological simplicity and mind change efficient learning. 1) A learning problem can be solved with a finite mind change bound n if and only if the Cantor-Bendixson rank of each hypothesis is at most n. 2) If a learning problem permits a mind change bound, there is an essentially unique mind-change optimal method: Adopt the topologically simplest hypothesis consistent with a given data set.

Applications: Most of the talk presents illustrative applications of mind-change optimal learning. These include

- A solution to a Goodman-style Riddle of Induction.

- Constraint-based learning of Bayes nets from independence tests.

- Learning conservation laws, particle families, and hidden particles in particle families. (No knowledge of particle physics is presupposed).

**Cosma Shalizi **(Carnegie Mellon, Statistics)

**Complexity, Prediction, and Inference**

**Abstract:** Scientists have proposed multiple ways of measuring the complexity of a system or process. In this talk, I will look at one such measure, explicitly grounded on razor-like considerations, the _statistical forecasting complexity_ of Grassberger, Crutchfield and Young, defined as the minimum amount of information about the system's past needed to predict its future. Working through this definition reveals that this is an objective, description-invariant quantity, related to minimal sufficient statistics, to Markovian representations of non-Markov processes, and to Salmon's "statistical relevance basis". There are several ways of recovering it from empirical data; one of these, itself, uses a razor-like strategy.

**Elliott Sober** (University of Wisconsin, Philosophy),

**Likelihood, Parsimony, and Common Causes**

**Abstract:** In model selection criteria like AIC and BIC, likelihood (as measured by fit to data) conflicts with parsimony (as measured by number of adjustable parameters). But there is a class of cases in which parsimony contributes to likelihood and therefore parsimony can be said to have a likelihood justification. I'll discuss this idea in the context of everyday Reichenbachian examples and then with reference to work in evolutionary biology on phylogenetic inference.

**Vladimir Vapnik** (Columbia University, Center for Computational Learning Systems)

**Empirical Inference Problems in Machine Learning and Philosophy of Science.**

**Abstract:** The classical statistical approach to empirical inference is based on the idea of estimating the true, underlying density function that generates observations. Therefore it reflects the point of view of philosophical realism.

The Machine Learning paradigm (introduced by Rosenblatt's Perceptron) does not require estimation of the true underlying density. Instead, it searches for the best function in a given collection of functions. Therefore, it reflects the point of view of philosophical instrumentalism.

The Theory of the Machine Learning paradigm, called Statistical Learning Theory or VC theory (Vapnik-Chervonenkis theory), was developed in 1960 -- 1970. It is based on the idea of "shattering" the collection of data by a set of functions. A set of functions "shatters" a collection of data if, using functions from this set, one can separate the data in all possible ways. The VC dimension of a set of functions is the size of the largest sample it shatters. Using the VC dimension of the set of functions, the following results can be proved: a) the necessary and sufficient conditions of learnability and b) the upper and lower bounds of the error for the estimated function.

The idea of shattering reflects something like what Karl Popper had in mind by falsifiability. Popper used the idea of falsifiability to construct a concept of degree of falsifiability (or Popper dimension) which, in contrast to VC dimension, does not lead to a characterization of the learning processes.

In the 1970s, using the VC upper bound on error rate of the estimated function, the general inductive principle called Structural Risk Minimization (SRM) principle was introduced. According to this inductive principle: a) one creates a structure on the set of functions, by introducing nested subsets of functions with monotonically increasing VC dimensions and then b) one chooses the function that minimizes the upper bounds of error along the elements of the structure.

In the 1980s, it was shown that the SRM principle is universally consistent. In other words, the consistency of the learning processes using SRM principle does not require having a special structure on the set of functions which, for example, takes into account simplicity of the functions in the elements of the structure (whatever that simplicity means). It just requires any structure with monotonically increasing VC dimension of its elements.

In the 1990s, constructive algorithms that implement SRM principle were developed (SVM, Boosting). These state of the art universally consistent algorithms were constructed in direct contradiction to Occam razor principle ("entities should not be multiplied beyond necessity").

The main part of my talk will be devoted to a theory of generalization for the Complex World. According to Einstein, the existing scientific methodology can be applied only to the Simple World ("When solution is simple, God is answering" and "When the number of factors coming into play in phenomenological concepts is too large the scientific methods in most cases fail"). Despite of the fact that the SRM principle (and the corresponding algorithms) is universally consistent, its rate of convergence can be slow (the number of examples required for function estimation in the Complex World can be huge).

Therefore, the main problem of empirical inference in the Complex World is to speed up the rate of convergence of learning processes. To do this, one has to change the learning model. The existing machine learning paradigm does not involve a teacher in the learning process. To speed up the rate of convergence of the learning process I will include an abstract teacher in the learning model.

In the talk, I will introduce a new learning paradigm called Learning Using Privileged Information (LUPI), in which a teacher, along with examples (defined in the space X), will supply students

with additional (privileged) information about examples such as comments, metaphoric images, comparison and so on (defined in the space X*). Privileged information is available only for training example. It will not be available for test examples. Privileged information (in the space X*) can significantly speed up the rate of convergence to the desired function (in the space X). The LUPI paradigm tries to employ existing general mechanisms that use teachers to increase effectiveness of learning.

I will give examples of learning in the framework of the LUPI paradigm where different type of privileged information are used (advanced technical models, future events, holistic descriptions) and will show advantages of the new paradigm with respect to the classical machine learning paradigm.

**Larry Wasserman** (Carnegie Mellon, Statistics and Machine Learning.

**Simplicity and Inference**

**Abstract : **I will discuss four statistical problems: high dimensional regression, graphical models, density estimation and topological data analysis. In each case I will show that choosing the appropriate complexity of the model is very difficult.

This workshop is supported by a generous grant from the John Templeton Foundation.