§11.5: The Berry–Esseen Theorem

Now that we’ve built up some results concerning Gaussian space, we’re motivated to try reducing problems involving Boolean functions to problems involving Gaussian functions. The key tool for this is the Invariance Principle, discussed at the beginning of the chapter. As a warmup, this section is devoted to proving (a form of) the Berry–Esseen Theorem.

As discussed in Chapter 5.2, the Berry–Esseen Theorem is a quantitative form of the Central Limit Theorem for finite sums of independent random variables. We restate it here:

Berry–Esseen Theorem Let $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n$ be independent random variables with $\mathop{\bf E}[\boldsymbol{X}_i] = 0$ and $\mathop{\bf Var}[\boldsymbol{X}_i] = \sigma_i^2$, and assume $\sum_{i=1}^n \sigma_i^2 = 1$. Let $\boldsymbol{S} = \sum_{i=1}^n \boldsymbol{X}_i$ and let $\boldsymbol{Z} \sim \mathrm{N}(0,1)$ be a standard Gaussian. Then for all $u \in {\mathbb R}$, \[ |\mathop{\bf Pr}[\boldsymbol{S} \leq u] – \mathop{\bf Pr}[\boldsymbol{Z} \leq u]| \leq c \gamma, \] where \[ \gamma = \sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3 \] and $c$ is a universal constant. (For definiteness, $c = .56$ is acceptable.)

In this traditional statement of Berry–Esseen, the error term $\gamma$ is a little opaque. To say that $\gamma$ is small is to simultaneously say two things: the random variables $\boldsymbol{X}_i$ are all “reasonable” (as in Chapter 9.1); and, none is too dominant in terms of variance. In Chapter 9.1 we discussed several related notions of “reasonableness” for a random variable $\boldsymbol{X}$. It was convenient there to use the definition that $\|\boldsymbol{X}\|_4^4$ is not much larger than $\|\boldsymbol{X}\|_2^4$. For the Berry–Esseen Theorem it’s more convenient (and slightly stronger) to use the analogous condition for the $3$rd moment. (For the Invariance Principle it will be more convenient to use $(2,3,\rho)$- or $(2,4,\rho)$-hypercontractivity.) The implication for Berry–Esseen is the following:

Remark 56 In the Berry–Esseen Theorem, if all of the $\boldsymbol{X}_i$’s are “reasonable” in the sense that $\|\boldsymbol{X}_i\|_3^3 \leq B \|\boldsymbol{X}_i\|_2^3 = B \sigma_i^{3}$, then we can use the bound \begin{equation} \label{eqn:be-no-variance-dom} \gamma \leq B \cdot \max_i\{\sigma_i\}, \end{equation} as this is a consequence of \[ \gamma =\sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3 \leq B \sum_{i=1}^n \sigma_i^{3} \leq B \cdot \max_i \{\sigma_i\} \cdot \sum_{i=1}^n \sigma_i^2 = B \cdot \max_i \{\sigma_i\}. \] (Cf. Remark 5.12.) Note that some “reasonableness” condition must hold if $\boldsymbol{S} = \sum_i \boldsymbol{X}_i$ is to behave like a Gaussian. For example, if each $\boldsymbol{X}_i$ is the “unreasonable” random variable which is $\pm \sqrt{n}$ with probability $\frac{1}{2n^2}$ each and $0$ otherwise, then $\boldsymbol{S} = 0$ except with probability at most $\frac{1}{n}$ — quite unlike a Gaussian. Further, even assuming reasonableness we still need a condition like \eqref{eqn:be-no-variance-dom} ensuring that no $\boldsymbol{X}_i$ is too dominant (“influential”) in terms of variance. For example, if $\boldsymbol{X}_1 \sim \{-1,1\}$ is a uniformly random bit and $\boldsymbol{X}_2, \dots, \boldsymbol{X}_n \equiv 0$, then $\boldsymbol{S} \equiv \boldsymbol{X}_1$, which is again quite unlike a Gaussian.

There are several known ways to prove the Berry–Esseen Theorem; for example, using characteristic functions (i.e., “real” Fourier analysis), or Stein’s Method. We’ll use the “Replacement Method” (also known as the Lindeberg Method, and similar to the “Hybrid Method” in theoretical cryptography). Although it doesn’t always give the sharpest results, it’s a very flexible technique which generalizes easily to higher-degree polynomials of random variables (as in the Invariance Principle) and random vectors. The Replacement Method suggests itself as soon as the Berry–Esseen Theorem is written in a slightly different form: Instead of trying to show \begin{equation} \label{eqn:replacement-no-suggest} \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Z}, \end{equation} where $\boldsymbol{Z} \sim \mathrm{N}(0,1)$, we’ll instead try to show the equivalent statement \begin{equation} \label{eqn:replacement-suggest} \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Z}_1 + \boldsymbol{Z}_2 + \cdots + \boldsymbol{Z}_n, \end{equation} where the $\boldsymbol{Z}_i$’s are independent Gaussians with $\boldsymbol{Z}_i \sim \mathrm{N}(0,\sigma_i^2)$. The statements \eqref{eqn:replacement-no-suggest} and \eqref{eqn:replacement-suggest} really are identical, since the sum of independent Gaussians is Gaussian, with the variances adding. The Replacement Method proves \eqref{eqn:replacement-suggest} by replacing the $\boldsymbol{X}_i$’s with $\boldsymbol{Z}_i$’s one by one. Roughly speaking, we introduce the “hybrid” random variables \[ \boldsymbol{H}_t = \boldsymbol{Z}_1 + \cdots + \boldsymbol{Z}_t + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n, \] show that $\boldsymbol{H}_{t-1} \approx \boldsymbol{H}_t$ for each $t \in [n]$, and then simply add up the $n$ errors.

As a matter of fact, the Replacement Method doesn’t really have anything to do with Gaussian random variables. It actually seeks to show that \[ \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Y}_1 + \boldsymbol{Y}_2 + \cdots + \boldsymbol{Y}_n, \] whenever $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \boldsymbol{Y}_1, \dots, \boldsymbol{Y}_n$ are independent random variables with “matching first and second moments”, meaning $\mathop{\bf E}[\boldsymbol{X}_i] = \mathop{\bf E}[\boldsymbol{Y}_i]$ and $\mathop{\bf E}[\boldsymbol{X}_i^2] = \mathop{\bf E}[\boldsymbol{Y}_i^2]$ for each $i \in [n]$. (The error will be proportional to $\sum_i (\|\boldsymbol{X}_i\|^3 + \|\boldsymbol{Y}_i\|_3^3)$.) Another way of putting it (roughly speaking) is that the linear form $x_1 + \cdots + x_n$ is invariant to what independent random variables you substitute in for $x_1, \dots, x_n$, so long as you always use the same first and second moments. The fact that we can take the $\boldsymbol{Y}_i$’s to be Gaussians (with $\boldsymbol{Y}_i \sim \mathrm{N}(\mathop{\bf E}[\boldsymbol{X}_i], \mathop{\bf Var}[\boldsymbol{X}_i])$) and then in the end use the fact that the sum of Gaussians is Gaussians to derive the simpler-looking \[ \boldsymbol{S} = \sum_{i=1}^n \boldsymbol{X}_i \approx \mathrm{N}(\mathop{\bf E}[\boldsymbol{S}], \mathop{\bf Var}[S]) \] is just a pleasant bonus (and one that we’ll no longer get once we look at nonlinear polynomials of random variables in Section 6). Indeed, the remainder of this section will be devoted to showing that \[ \boldsymbol{S}_X = \boldsymbol{X}_1 + \cdots + \boldsymbol{X}_n \quad \text{is ``close'' to} \quad \boldsymbol{S}_Y = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_n \] whenever the $\boldsymbol{X}_i$’s and $\boldsymbol{Y}_i$’s are independent, “reasonable” random variables with matching first and second moments.

To do this, we’ll first have to discuss in more detail what it means for two random variables to be “close”. A traditional measure of closeness between two random variables $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ is the “cdf-distance” used in the Berry–Esseen Theorem: $\mathop{\bf Pr}[\boldsymbol{S}_X \leq u] \approx \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u]$ for every $u \in {\mathbb R}$. But there are other natural measures of closeness too. We might want to know that the absolute moments of $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ are close; for example, that $\|\boldsymbol{S}_X\|_1 \approx \|\boldsymbol{S}_Y\|_1$. Or, we might like to know that $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ stray from the interval $[-1,1]$ by about the same amount: $\mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_Y)]$. Here we are using:

Definition 57 For any interval $\emptyset \neq I \subsetneq {\mathbb R}$ the function $\mathrm{dist}_I : {\mathbb R} \to {\mathbb R}^{\geq 0}$ is defined to measure the distance of a point from $I$; i.e., $\mathrm{dist}_I(s) = \inf_{u \in I}\{|s-u|\}$.

All of the closeness measures just described can be put in a common framework: they are requiring $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$ for various “test functions” (or “distinguishers”) $\psi : {\mathbb R} \to {\mathbb R}$.

Figure 11.1: The test functions $\psi$ used for judging $\mathop{\bf Pr}[\boldsymbol{S}_X \leq u] \approx \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u]$, $\|\boldsymbol{S}_X\|_1 \approx \|\boldsymbol{S}_Y\|_1$, and $\mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_Y)]$, respectively

It would be nice to prove a version of the Berry–Esseen Theorem that showed closeness for all the test functions $\psi$ depicted in Figure 11.1, and more. What class of tests might we able to handle? On one hand, we can’t be too ambitious. For example, suppose each $\boldsymbol{X}_i \sim \{-1,1\}$, each $\boldsymbol{Y}_i \sim \mathrm{N}(0,1)$, and $\psi(s) = 1_{s \in {\mathbb Z}}$. Then $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = 1$ because $\boldsymbol{S}_X$ is supported on the integers, but $\mathop{\bf E}[\psi(\boldsymbol{S}_Y)] = 0$ because $\boldsymbol{S}_Y \sim \mathrm{N}(0,n)$ is a continuous random variable. On the other hand, there are some simple kinds of tests $\psi$ for which we have exact equality. For example, if $\psi(s) = s$, then $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$; this is by the assumption of matching first moments, $\mathop{\bf E}[\boldsymbol{X}_i] = \mathop{\bf E}[\boldsymbol{Y}_i]$ for all $i$. Similarly, if $\psi(s) = s^2$, then \begin{align} \mathop{\bf E}[\psi(\boldsymbol{S}_X)] &= \mathop{\bf E}\Bigl[\bigl(\sum_i \boldsymbol{X}_i\bigr)^2\Bigr] = \sum_{i} \mathop{\bf E}[\boldsymbol{X}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{X}_i\boldsymbol{X}_j] \nonumber\\ &= \sum_{i} \mathop{\bf E}[\boldsymbol{X}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{X}_i]\mathop{\bf E}[\boldsymbol{X}_j] \label{eqn:be-quadratic1}
\end{align}
(using independence of the $\boldsymbol{X}_i$’s); and also,
\begin{align}
\label{eqn:be-quadratic2} \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] &= \sum_{i} \mathop{\bf E}[\boldsymbol{Y}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{Y}_i]\mathop{\bf E}[\boldsymbol{Y}_j] \end{align} The quantities \eqref{eqn:be-quadratic1} and \eqref{eqn:be-quadratic2} are equal because of the matching first and second moment conditions.

As a consequence of these observations we have $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$ for any quadratic polynomial $\psi(s) = a + bs + cs^2$. This suggests that to handle a general test $\psi$ we try to approximate it by a quadratic polynomial up to some error; in other words, consider its $2$nd-order Taylor expansion. For this to make sense the function $\psi$ must have a continuous $3$rd derivative, and the error we incur will involve the magnitude of this derivative. Indeed, we will now prove a variant of the Berry–Esseen Theorem for the class of $\mathcal{C}^3$ test functions $\psi$ with $\psi^{\prime\prime\prime}$ uniformly bounded. You might be concerned that this class doesn’t contain any of the interesting test functions depicted in Figure 11.1. But we’ll be able to handle even those test functions with some loss in the parameters by using a simple “hack” — approximating them by smooth functions, as suggested in Figure 11.2.

Figure 11.2: The step function $\psi(s) = 1_{s \leq u}$ can be smoothed out on the interval $[u-\eta, u+\eta]$ so that the resulting function $\widetilde{\psi}_\eta$ satisfies $\|\widetilde{\psi}_\eta^{\prime\prime\prime}\|_\infty \leq O(1/\eta^3)$. Similarly, we can smooth out $\psi(s) = \mathrm{dist}_{[-1,1]}(s)$ to a function $\widetilde{\psi}_\eta$ satisfying $\|\psi – \widetilde{\psi}\|_\infty \leq \eta$ and $\|\widetilde{\psi}_\eta^{\prime\prime\prime}\|_\infty \leq O(1/\eta^2)$.

Invariance Principle for Sums of Random Variables Let $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n$, $\boldsymbol{Y}_1, \dots, \boldsymbol{Y}_n$ be independent random variables with matching $1$st and $2$nd moments; i.e., $\mathop{\bf E}[\boldsymbol{X}_i^k] = \mathop{\bf E}[\boldsymbol{Y}_i^k]$ for $i \in [n]$, $k \in \{1, 2\}$. Write $\boldsymbol{S}_X = \sum_i \boldsymbol{X}_i$ and $\boldsymbol{S}_Y = \sum_i \boldsymbol{Y}_i$. Then for any $\psi : {\mathbb R} \to {\mathbb R}$ with continuous third derivative, \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq \tfrac16 \|\psi^{\prime\prime\prime}\|_\infty \cdot \gamma_{XY}, \] where $\gamma_{XY} = \sum_i (\|\boldsymbol{X}_i\|_3^3 + \|\boldsymbol{Y}_i\|_3^3)$.


Proof: The proof is by the Replacement Method. For $0 \leq t \leq n$, define the “hybrid” random variable \[ \boldsymbol{H}_t = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_t + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n, \] so $\boldsymbol{S}_X = \boldsymbol{H}_0$ and $\boldsymbol{S}_Y = \boldsymbol{H}_n$. Thus by the triangle inequality, \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq \sum_{t=1}^n \left| \mathop{\bf E}[\psi(\boldsymbol{H}_{t-1})] – \mathop{\bf E}[\psi(\boldsymbol{H}_t)] \right|. \] Given the definition of $\gamma_{XY}$, we can complete the proof by showing that for each $t \in [n]$, \begin{align} \tfrac16 \|\psi^{\prime\prime\prime}\|_\infty \cdot (\mathop{\bf E}[|\boldsymbol{X}_t|^3] + \mathop{\bf E}[|\boldsymbol{Y}_t|^3]) &\geq \left| \mathop{\bf E}[\psi(\boldsymbol{H}_{t-1})] – \mathop{\bf E}[\psi(\boldsymbol{H}_t)] \right| \nonumber\\ &= \left| \mathop{\bf E}[\psi(\boldsymbol{H}_{t-1}) - \psi(\boldsymbol{H}_t)] \right| \nonumber\\ &= \left| \mathop{\bf E}[\psi(\boldsymbol{U}_t + \boldsymbol{X}_t) -\psi(\boldsymbol{U}_t + \boldsymbol{Y}_t)] \right|, \label{eqn:var-be-ineq} \end{align} where \[ \boldsymbol{U}_t = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_{t-1} + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n. \] Note that $\boldsymbol{U}_t$ is independent of $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$. We are now comparing $\psi$’s values at $\boldsymbol{U}_t + \boldsymbol{X}_t$ and $\boldsymbol{U}_t + \boldsymbol{Y}_t$, with the presumption that $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$ are rather small compared to $\boldsymbol{U}_t$. This clearly suggests the use of Taylor’s theorem: For all $u, \delta \in {\mathbb R}$, \[ \psi(u + \delta) = \psi(u) + \psi'(u) \delta + \tfrac12 \psi''(u) \delta^2 + \tfrac16 \psi^{\prime\prime\prime}(u^*) \delta^3, \] for some $u^* = u^*(u,\delta)$ between $u$ and $u + \delta$. Applying this pointwise with $u = \boldsymbol{U}_t$, $\delta = \boldsymbol{X}_t, \boldsymbol{Y}_t$ yields \begin{align*} \psi(\boldsymbol{U}_t + \boldsymbol{X}_t) &= \psi(\boldsymbol{U}_t) + \psi’(\boldsymbol{U}_t) \boldsymbol{X}_t + \tfrac12 \psi”(\boldsymbol{U}_t) \boldsymbol{X}_t^2 + \tfrac16 \psi^{\prime\prime\prime}(\boldsymbol{U}_t^*) \boldsymbol{X}_t^3\\ \psi(\boldsymbol{U}_t + \boldsymbol{Y}_t) &= \psi(\boldsymbol{U}_t) + \psi’(\boldsymbol{U}_t) \boldsymbol{Y}_t + \tfrac12 \psi”(\boldsymbol{U}_t) \boldsymbol{Y}_t^2 + \tfrac16 \psi^{\prime\prime\prime}(\boldsymbol{U}_t^{**}) \boldsymbol{Y}_t^3 \end{align*} for some random variables $\boldsymbol{U}_t^*, \boldsymbol{U}_t^{**}$. Referring back to our goal of \eqref{eqn:var-be-ineq}, what happens when we subtract these two identities and take expectations? The $\psi(\boldsymbol{U}_t)$ terms cancel. The next difference is \[ \mathop{\bf E}[\psi'(\boldsymbol{U}_t) (\boldsymbol{X}_t - \boldsymbol{Y}_t)] = \mathop{\bf E}[\psi'(\boldsymbol{U}_t)]\cdot\mathop{\bf E}[\boldsymbol{X}_t - \boldsymbol{Y}_t] = \mathop{\bf E}[\psi'(\boldsymbol{U}_t)]\cdot 0 = 0, \] where the first equality used that $\boldsymbol{U}_t$ is independent of $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$, and the second equality used the matching $1$st moments of $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$. An identical argument, using matching $2$nd moments, shows that the shows that the difference of the quadratic terms disappears in expectation. Thus we’re left only with the “error term”: \begin{align*} \left| \mathop{\bf E}[\psi(\boldsymbol{U}_t + \boldsymbol{X}_t) -\psi(\boldsymbol{U}_t + \boldsymbol{Y}_t)] \right| &= \tfrac16 \left| \mathop{\bf E}[\psi^{\prime\prime\prime}(\boldsymbol{U}_t^*)\boldsymbol{X}_t^3 -\psi^{\prime\prime\prime}(\boldsymbol{U}_t^{**})\boldsymbol{Y}_t^3] \right| \\ &\leq \tfrac16 \|\psi^{\prime\prime\prime}\|_\infty \cdot (\mathop{\bf E}[|\boldsymbol{X}_t|^3] + \mathop{\bf E}[|\boldsymbol{Y}_t|^3]), \end{align*} where the last step used the triangle inequality. This confirms \eqref{eqn:var-be-ineq} and completes the proof. $\Box$

We can now give a Berry–Esseen-type corollary by taking the $\boldsymbol{Y}_i$’s to be Gaussians:

Variant Berry–Esseen Theorem In the setting of the Berry–Esseen Theorem, for all $\mathcal{C}^3$ functions $\psi : {\mathbb R} \to {\mathbb R}$, \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S})] – \mathop{\bf E}[\psi(\boldsymbol{Z})] \right| \leq \tfrac16(1+2\sqrt{\tfrac{2}{\pi}}) \|\psi^{\prime\prime\prime}\|_\infty \cdot \gamma \leq .433 \|\psi^{\prime\prime\prime}\|_\infty \cdot \gamma. \]


Proof: Applying the preceding theorem with $\boldsymbol{Y}_i \sim \mathrm{N}(0, \sigma_i^2)$ (and hence $\boldsymbol{S}_Y \sim \mathrm{N}(0, 1)$), it suffices to show that \begin{equation} \label{eqn:be-bound} \gamma_{XY} = \sum_{i=1}^n (\|\boldsymbol{X}_i\|_3^3 + \|\boldsymbol{Y}_i\|_3^3) \leq (1+2\sqrt{\tfrac{2}{\pi}}) \cdot \gamma = (1+2\sqrt{\tfrac{2}{\pi}}) \cdot \sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3. \end{equation} In particular, we just need to show that $\|\boldsymbol{Y}_i\|_3^3 \leq 2\sqrt{\tfrac{2}{\pi}}\|\boldsymbol{X}_i\|_3^3$ for each $i$. This holds because Gaussians are extremely reasonable; by explicitly computing $3$rd absolute moments we indeed obtain \[ \|\boldsymbol{Y}_i\|_3^3 = \sigma_i^3 \|\mathrm{N}(0,1)\|_3^3 = 2\sqrt{\tfrac{2}{\pi}} \sigma_i^3 = 2\sqrt{\tfrac{2}{\pi}} \|\boldsymbol{X}_i\|_2^3 \leq 2\sqrt{\tfrac{2}{\pi}} \|\boldsymbol{X}_i\|_3^3. \qquad \Box \]

This version of the Berry–Esseen Theorem is incomparable with the standard version. Sometimes it can be stronger; for example, if for some reason we wanted to show $\mathop{\bf E}[\cos \boldsymbol{S}] \approx \mathop{\bf E}[\cos \boldsymbol{Z}]$ then the Variant Berry–Esseen Theorem gives this with error $.433\gamma$, whereas it can’t be directly deduced from the standard Berry–Esseen at all. On the other hand, as we’ll see shortly, we can only obtain the standard Berry–Esseen conclusion from the Variant version with an error bound of $O(\gamma^{1/4})$ rather than $O(\gamma)$.

We end this section by describing the “hacks” which let us extend the Variant Berry–Esseen Theorem to cover certain non-$\mathcal{C}^3$ tests $\psi$. As mentioned the idea is to smooth them out, or “mollify” them:

Proposition 58 Let $\psi : {\mathbb R} \to {\mathbb R}$ be $c$-Lipschitz. Then for any $\eta > 0$ there exists $\widetilde{\psi}_\eta : {\mathbb R} \to {\mathbb R}$ satisfying $\|\psi – \widetilde{\psi}_\eta\|_\infty \leq c \eta$ and $\|\widetilde{\psi}_\eta^{(k)}\|_\infty \leq C_k c/\eta^{k-1}$ for each $k \in {\mathbb N}^+$. Here $C_k$ is a constant depending only on $k$, and $\widetilde{\psi}_\eta^{(k)}$ denotes the $k$th derivative of $\widetilde{\psi}_\eta$.

The proof is straightforward, taking $\widetilde{\psi}_\eta(s) = \displaystyle \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)}[\psi(s+\eta \boldsymbol{g})]$; see the exercises.

As $\eta \to 0$ this gives a better and better smooth approximation to $\psi$, but also a larger and larger value of $\|\widetilde{\psi}_\eta^{\prime\prime\prime}\|_\infty$. Trading these off gives the following:

Corollary 59 In the setting of the Invariance Principle for Sums of Random Variables, if we merely have that $\psi : {\mathbb R} \to {\mathbb R}$ is $c$-Lipschitz, then \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq O(c) \cdot \gamma_{XY}^{1/3}. \]


Proof: Applying the Invariance Principle for Sums of Random Variables with the test $\widetilde{\psi}_\eta$ from Proposition 58 we get \[ \left| \mathop{\bf E}[\widetilde{\psi}_\eta(\boldsymbol{S}_X)] – \mathop{\bf E}[\widetilde{\psi}_\eta(\boldsymbol{S}_Y)] \right| \leq O(c/\eta^2) \cdot \gamma_{XY}. \] But $\|\widetilde{\psi}_\eta – \psi\|_\infty \leq c\eta$ implies \[ \left| \mathop{\bf E}[\widetilde{\psi}_\eta(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_X)] \right| \leq \mathop{\bf E}[|\widetilde{\psi}_\eta(\boldsymbol{S}_X) - \psi(\boldsymbol{S}_X)|] \leq c\eta \] and similarly for $\boldsymbol{S}_Y$. Thus we get \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq O(c) \cdot (\eta + \gamma_{XY}/ \eta^2) \] which yields the desired bound by taking $\eta = \gamma_{XY}^{1/3}$. $\Box$

Remark 60 It’s obvious that the dependence on $c$ in this theorem should be linear in $c$; in fact, since we can always divide $\psi$ by $c$ it would have sufficed to prove the theorem assuming $c = 1$.

This corollary covers all Lipschitz tests, which suffices for the functions $\psi(s) = |s|$ and $\psi(s) = \mathrm{dist}_{[-1,1]}(s)$ from Figure 11.1. However, it still isn’t enough for the test $\psi(s) = 1_{s \leq u}$ — i.e., for establishing cdf-closeness as in the usual Berry–Esseen Theorem. Of course, we can’t hope for a smooth approximator $\widetilde{\psi}_\eta$ satisfying $|\widetilde{\psi}_\eta(s) – 1_{s \leq u}| \leq \eta$ for all $s$ because of the discontinuity at $u$. However, as suggested in Figure 11.2, if we’re willing to exclude $s \in [u - \eta, u+\eta]$ we can get an approximator with third derivative bound $O(1/\eta^3)$, and thereby obtain (exercise):

Corollary 61 In the setting of the Invariance Principle for Sums of Random Variables, for all $u \in {\mathbb R}$ we have \[ \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u - \epsilon] – \epsilon \leq \mathop{\bf Pr}[\boldsymbol{S}_X \leq u] \leq \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u + \epsilon] + \epsilon \] for $\epsilon = O(\gamma_{XY}^{1/4})$; i.e., $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ have Lévy distance $d_{L}(\boldsymbol{S}_X,\boldsymbol{S}_Y) \leq O(\gamma_{XY}^{1/4})$.

Finally, in the Berry–Esseen setting where $\boldsymbol{S}_Y \sim \mathrm{N}(0,1)$, we can appeal to the “anticoncentration” of Gaussians: \[ \mathop{\bf Pr}[\mathrm{N}(0,1) \leq u + \epsilon] = \mathop{\bf Pr}[\mathrm{N}(0,1) \leq u] + \mathop{\bf Pr}[u < \mathrm{N}(0,1) \leq u+\epsilon] \leq \mathop{\bf Pr}[\mathrm{N}(0,1) \leq u] + \tfrac{1}{\sqrt{2\pi}} \epsilon, \] and similarly for $\mathop{\bf Pr}[\mathrm{N}(0,1) \leq u - \epsilon]$. This lets us convert the Lévy distance bound into a cdf-distance bound. Recalling \eqref{eqn:be-bound}, we immediately deduce the following weaker version of the classical Berry–Esseen Theorem:

Corollary 62 In the setting of the Berry–Esseen Theorem, for all $u \in {\mathbb R}$, \[ \left| \mathop{\bf Pr}[\boldsymbol{S} \leq u] – \mathop{\bf Pr}[\boldsymbol{Z} \leq u \right| \leq O(\gamma^{1/4}), \] where the $O(\cdot)$ hides a universal constant.

Although the error bound here is weaker than necessary by a power of $1/4$, this weakness will be more than made up for by the ease with which the Replacement Method generalizes to other settings. In the next section we’ll see it applied to nonlinear polynomials of independent random variables. Furthermore, an exercise outlines how to use it to give a Berry–Esseen theorem for sums of independent random vectors; as you’ll see, other than replacing Taylor’s theorem with its multivariate form, hardly a symbol in the proof changes.

2 comments to §11.5: The Berry–Esseen Theorem

Leave a Reply

  

  

  

You can use most standard LaTeX, as well as these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>