Statistical Retraction Animations

Konstantin Genin and Kevin T. Kelly

Department of Philosophy

Carnegie Mellon University

One hears that Bayesian conditioning and other methods like the Bayes information criterion (BIC) or the PC algorithm for causal network search converge in probability to true models (rather than to accurate estimated models). But one should not imagine that they converge monotonically to the truth like a compass---nature can force all convergent methods to perform dramatic retractions of opinion. Theory choice methods can be forced to choose one theory with arbitrarily high chance and then choose another theory with arbitrarily high chance, etc. Bayesians can be forced to put a high expected posterior on one theory followed by a high expected posterior on another theory, etc., where the expectations are in chance. But how do these retractions in chance happen and, more importantly, how can they be minimized so that we at least are not subjected to needless ones?

We examine a toy problem well-adapted to graphical representation, with the idea that it can serve as a proxy for more interesting problems that are difficult to visualize, like causal network search. Suppose that there are two variables X and Y with known covariance and the question is exactly which components of the mean vector (muX, muY) are nonzero. The possible theories are none, both, just muX or just muY. Note that as propositions they are all mutually exclusive, but since nonzero means are free parameters, more nonzero means is simpler than fewer. Geometrically speaking, "none" is true precisely at the origin of the cartesian coordinates, "just muX" is the Y axis minus the origin, "just muY" is the X axis minus the origin, and "both" is everything but the coordinate axes.

The BIC score for a model H is defined as: -2ln(mle(H)) + k ln(n), where n is sample size and k is the number of free parameters in H. A proposed rule is to draw a sample of size n and to choose the model H whose BIC score is minimum. That method chooses a model for each point in (X-bar, Y-bar) space, so one can plot the regions in the plane at which each such model is chosen. In our simulations, each such region is a different color. The zone for the origin model is blue, for the X axis model is yellow, for the Y axis model is red, and for the background model is green. The blue unfilled ellipses are the 95th nd 99th quantiles of the sampling gaussian, respectively. As covariance increases, so does the eccentricity of the ellipse.

BIC when X and Y are independent.

BIC when X and Y are strongly correlated.

We choose a true mean vector in which both components are non-zero but very small, with muY much smaller than muX. The sampling density for (X-bar, Y-bar) shrinks as sample size increases but we zoom in on it to keep it centered in the picture at a fixed size, as though we are watching an airplane through powerful binoculars. The result is that the acceptance zones in the background are magnified and shift as we maintain our focus on the sampling density. The effect is reminiscent of a heavy 747 taking off. The acceleration effect is artificial---it reflects our use of a log time scale to speed things up. Actually, the convergence is extremely slow. The slowness is easy to understand from the picture---the sampling distribution follows a straight-line path away from the origin, so the shallower the angle of departure, the longer it takes to escape the yellow band.

During the simulation one sees the sampling density filled first with blue and then with yellow and finally with green. Those are retractions in chance---momentous drops in the chance of producing the first two hypotheses. The graph to the right right plots the chance of producing each answer in the color that corresponds to its acceptance zone. Total retractions are tallied at the top of the frame and the retractions of each answer are to the right, in matching colors. The graphs should be smooth. The choppiness is due to sampling effects in the monte-carlo sampling procedure that estimates the chances and can be eliminated due to larger samples and by computing retractions over larger intervals. Therefore, the total retraction estimate in the posted simulations is currently far too high. We will re-run them to get better estimates. But the comparisons between methods are realistic.

Bayesian conditioning looks very similar when it is interpreted as choosing the theory whose posterior is highest (the modal hypothesis):

Bayes a la mode.

One important point of interest is that the sampling density drags across the corner of the red axis zone when it leaves the origin (blue) zone, which generates extra retractions (look for the red bump) that could have been avoided by returning "I don't know" rather than an informative theory in a region around the origin zone. The extra retractions are large when the correlation is high. For example, suppose that we view Bayesian conditioning as selecting only hypotheses with high posterior probability. The white zone is "I don't know". Notice the reductionin the red bump.

.95 Bayes.

The same effect can be obtained by thresholding the BIC score. That raises the intriguing possiblity that the apparent need to "wait for data to confirm the simplest hypothesis" is based on minimizing retractions rather than on reliability (which nobody can promise for questions of this sort).

Another finding (which the reader has probably noticed already) is that the BIC acceptance zone for the origin model has a strange, pinched-in shape when X and Y are correlated and is square (!) when they are independent. That happens due to minization of the BIC score, which checks whether the origin model has a lower score than both of the axis models. The strange shape results in extra retractions that can be eliminated by comparing the score of the origin model only to the that of the sector model. The strange shape can be corrected by comparing the origin only with the background and the axes only with the background. That is actually easier to compute and results in a more sensible method.

Improved BIC when X and Y are independent.

Improved BIC when X and Y are strongly correlated.

Bayesian conditioning itself yields the oddly shaped zones if we measure retractions in expected posterior values.

Grant Support

Pending: John Templeton Foundation grant 24145, Simplicity, Truth, and Ockham's Razor.

2009-2011: NSF grant 0740681, Ockham's Razor: A New Justification, Division of Social and Economic Sciences, Program for History and Philosophy of Science Engineering and Technology