One small note: In the relatively near future I will post here an electronic version of the book. It will actually be version 1.01, with version 1.00 being the [...]]]>
One small note: In the relatively near future I will post here an electronic version of the book. It will actually be version 1.01, with version 1.00 being the printed book. The electronic version will include fixes for typos found subsequent to the book’s printing.
Due to laziness, however, I will stop incorporating fixes into the old blog posts. So those are now basically deprecated. Feel free to take a look at them till the PDF comes out, of course. But this means there’s a chance that when you find a typo or mistake on the blog pages, I might respond with “Thanks — someone else found that too, and it’s already been corrected.”
Once again, thanks so much to all of you for the corrections!
]]>There will be at least one more post on this blog; I will release a free PDF of the book in its entirety some time around September. Watch this space!
]]>As mentioned in the notes on Chapter 9, the Gaussian Hypercontractivity Theorem is originally due to Nelson [Nel66] and now has many known proofs. The idea behind the proof we presented — first proving the Boolean hypercontractivity result and then deducing the Gaussian case by the Central Limit Theorem — is due to Gross [Gro75] (see also Trotter [Tro58]). Gross actually used the idea to prove his Gaussian Log-Sobolev Inequality, and thereby deduced the Gaussian Hypercontractivity Theorem. Direct proofs of the Gaussian Hypercontractivity Theorem have been given by Neveu [Nev76] (using stochastic calculus), Brascamp and Lieb [BL76a] (using rearrangement [BL76a]), and Ledoux [Led13] (using a variation on Exercises 26–29); direct proofs of the Gaussian Log-Sobolev Inequality have been given by Adams and Clarke [AC79], by Bakry and Emery [BE85a], and by Ledoux [Led92], the latter two using semigroup techniques. Bakry’s survey [Bak94] on these topics is also recommended.
The Gaussian Isoperimetric Inequality was first proved independently by Borell [Bor75] and by Sudakov and Tsirel’son [ST78]. Both works derived the result by taking the isoperimetric inequality on the sphere (due to Lévy [Lev22] and Schmidt [Sch48], see also Figiel, Lindenstrauss, and Milman [FLM77]) and then taking “Poincaré’s limit” — i.e., viewing Gaussian space as a projection of the sphere of radius $\sqrt{n}$ in $n$ dimensions, with $n \to \infty$ (see Lévy [Lev22], McKean [McK73], and Diaconis and Freedman [DF87]). Ehrhard [Ehr83] gave a different proof using a symmetrization argument intrinsic to Gaussian space. This may be compared to the alternate proof of the spherical isoperimetric inequality [Ben84] based on the “two-point symmetrization” of Baernstein and Taylor [BT76] (analogous to Riesz rearrangement in Euclidean space and to the polarization operation from Exercise 2.45).
To carefully define Gaussian surface area for a broad class of sets requires venturing into the study of geometric measure theory and functions of bounded variation. For a clear and comprehensive development in the Euclidean setting (including the remark in Exercise 15), see the book by Ambrosio, Fusco, and Pallara [AFP00]. There’s not much difference between the Euclidean and finite-dimensional Gaussian settings; research on Gaussian perimeter tends to focus on the trickier infinite-dimensional case. For a thorough development of surface area in this latter setting (which of course includes finite-dimensional Gaussian space as a special case) see the work of Ambrosio, Miranda, Maniglia, and Pallara [AMMP10]; in particular, Theorem 4.1 in that work gives several additional equivalent definitions for $\text{surf}_\gamma$ besides those in Definition 48. Regarding the fact that $\mathbf{RS}_A’(0^+)$ is an equivalent definition, the Euclidean analogue of this statement was proven in Miranda et al. [MPPP07] and the statement itself follows similarly [Mir13] using Ambrosio et al. [AFR13]. (Our heuristic justification of 11.4.(2) is similar to the one given by Kane [Kan11a].) Additional related results can be found in Hino [Hin10] (which includes the remark about convex sets at the end of Definition 48), Ambrosio and Figalli [AF11], Miranda et al. [MNP12], and Ambrosio et al. [AFR13].
The inequality of Theorem 51 is explicit in Ledoux [Led94] (see also the excellent survey [Led96]); he used it to deduce the Gaussian Isoperimetric Inequality. He also noted that it’s essentially deducible from an earlier inequality of Pisier and Maurey [Pis86] (Theorem 2.2). Theorem 43, which expresses the subadditivity of rotation sensitivity, can be viewed as a discretization of the Pisier–Maurey inequality. This theorem appeared in work of Kindler and O’Donnell [KO12], which also made the observations about the volume-$\tfrac{1}{2}$ case of Borell’s Isoperimetric Theorem at the end of Section 3 and in Remark 75.
Bobkov’s Inequality [Bob97] in the special case of Gaussian space had already been implicitly established by Ehrhard [Ehr84]; the striking novelty of Bobkov’s work (partially inspired by Talagrand [Tal93]) was his reduction to the two-point Boolean inequality. The proof of this inequality which we presented is, as mentioned a discretization of the stochastic calculus proof of Barthe and Maurey [BM00a]. (In turn, they were extending the stochastic calculus proof of Bobkov’s Inequality in the Gaussian setting due to Capitaine, Hsu, and Ledoux [CHL97].) The idea that it’s enough to show that Claim 54 is “nearly true” by computing two derivatives — as opposed to showing it’s exactly true by computing four derivatives — was communicated to the author by Yuval Peres. Following Bobkov’s paper, Bakry and Ledoux [BL96b] established Theorem 55 in very general infinite-dimensional settings including Gaussian space; Ledoux [Led98] further pointed out that the Gaussian version of Bobkov’s Inequality has a very short and direct semigroup-based proof. See also Bobkov and Götze [BG99] and Tillich and Zémor [TZ00] for results similar to Bobkov’s Inequality in other discrete settings. Borell’s Isoperimetric Theorem is from Borell [Bor85]. Borell’s proof used “Ehrhard symmetrization” and actually gave much stronger results — e.g., that if $f, g \in L^2({\mathbb R}^n, \gamma)$ are nonnegative and $q \geq 1$, then $\langle (\mathrm{U}_\rho f)^q, g\rangle$ can only increase under simultaneous Ehrhard symmetrization of $f$ and $g$. There are at least four other known proofs of the basic Borell Isoperimetric Theorem. Beckner [Bec92] observed that the analogous isoperimetric theorem on the sphere follows from two-point symmetrization; this yields the Gaussian result via Poincaré’s limit (for details, see Carlen and Loss [CL90]). (This proof is perhaps the conceptually simplest one, though carrying out all the technical details is a chore.) Mossel and Neeman [MN12] gave the proof based on semigroup methods outlined in Exercises 26–29, and later together with De [DMN12] gave a “Bobkov-style” Boolean proof (see Exercise 30). Finally, Eldan [Eld13] gave a proof using stochastic calculus.
As mentioned in Section 5 there are several known ways to prove the Berry–Esseen Theorem. Aside from the original method (characteristic functions), there is also Stein’s Method [Ste72,Ste86a]; see also, e.g., [Bol84,BH84,CGS11]. The Replacement Method approach we presented originates in the work of Lindeberg [Lin22]. The mollification techniques used (e.g., those in Exercise 40) are standard. The Invariance Principle as presented in Section 6 is from Mossel, O’Donnell, and Oleszkiewicz [MOO10]. Further extensions (e.g., Exercise 48) appear in the work of Mossel [Mos10]. In fact the Invariance Principle dates back to the 1971 work of Rotar’ [Rot73,Rot74]; therein he essentially proved the Invariance Principle for degree-$2$ multilinear polynomials (even employing the term “influence” as we do for the quantity in Definition 63). Earlier work on extending the Central Limit Theorem to higher-degree polynomials had focused on obtaining sufficient conditions for polynomials (especially quadratics) to have a Gaussian limit distribution; this is the subject of U-statistics. Rotar’ emphasized the idea of invariance and of allowing any (quadratic) polynomial with low influences. Rotar’ also credited Girko [Gir73] with related results in the case of positive definite quadratic forms. In 1975, Rotar’ [Rot75] generalized his results to handle multilinear polynomials of any constant degree, and also random vectors (as in Exercise 48). (Rotar’ also gave further refinements in 1979 [Rot79].)
The difference between the results of Rotar’ [Rot75] and Mossel et al. [MOO10] comes in the treatment of the error bounds. It’s somewhat difficult to extract simple-to-state error bounds from Rotar’ [Rot75], as the error there is presented as a sum over $i \in [n]$ of expressions $\mathop{\bf E}[F({\boldsymbol{x}}) \boldsymbol{1}_{|F({\boldsymbol{x}})| > u_i}]$, where $u_i$ involves $\mathbf{Inf}_i[F]$. (Partly this is so as to generalize the statement of the Lindeberg CLT.) Nevertheless, the work of Rotar’ implies a Lévy distance bound as in Corollary 69, with some inexplicit function $o_\epsilon(1)$ in place of $(1/\rho)^{O(k)} \epsilon^{1/8}$. By contrast, the work of Mossel et al. [MOO10] shows that a straightforward combination of the Replacement Method and hypercontractivity yields good, explicit error bounds. Regarding the Carbery–Wright Theorem [CW01], an alternative exposition appears in Nazarov, Sodin, and Vol’berg [NSV02].
Regarding the Majority Is Stablest Theorem (conjectured in Khot, Kindler, Mossel, and O’Donnell [KKMO04] and proved originally in Mossel, O’Donnell, and Oleszkiewicz [MOO05a]), it can be added that additional motivation for the conjecture came from Kalai [Kal02]. The fact that (SDP) is an efficiently computable relaxation for the Max-Cut problem dates back to the 1990 work of Delorme and Poljak [DP93]; however, they were unable to give an analysis relating its value to the optimum cut value. In fact, they conjectured that the case of the $5$-cycle from Example 72 had the worst ratio of $\mathrm{Opt}(G)$ to $\mathrm{SDPOpt}(G)$. Goemans and Williamson [GW94] were the first to give a sharp analysis of the SDP (Theorem 71), at least for $\theta \geq \theta^*$. Feige and Schechtman [FS02] showed an optimal integrality gap for the SDP for all values $\theta \geq \theta^*$ (in particular, showing an integrality gap ratio of $c_{\text{GW}}$); interestingly, their construction essentially involved proving Borell’s Isoperimetric Inequality (though they did it on the sphere rather than in Gaussian space). Both before and after the Khot et al. [KKMO04] UG-hardness result for Max-Cut there was a long line of work [Kar99,Zwi99,AS00,ASZ02,CW04,KV05,FL06,KO06] devoted to improving the known approximation algorithms and UG-hardness results, in particular for $\theta < \theta^*$. This culminated in the results from O’Donnell and Wu [OW08] (mentioned in Remark 74), which showed explicit matching $(\alpha,\beta)$-approximation algorithms, integrality gaps, and UG-hardness results for all $\frac12 < \beta < 1$. The fact that the best integrality gaps matched the best UG-hardness results proved not to be a coincidence; in contemporaneous work, Raghavendra [Rag08] showed that for any CSP, any SDP integrality gap could be turned into a matching Dictator-vs.-No-Notables test. This implies the existence of matching efficient $(\alpha,\beta)$-approximation algorithms and UG-hardness results for every CSP and every $\beta$. See Raghavendra’s thesis [Rag09] for full details of his earlier publication [Rag08] (including some Invariance Principle extensions building further on Mossel [Mos10]); see also Austrin’s work [Aus07,Aus07a] for precursors to the Raghavendra theory.
Exercise 31 concerns a problem introduced by Talagrand [Tal89]. Talagrand offers a $1000 prize [Tal06a] for a solution to the following Boolean version of the problem: Show that for any fixed $0 < \rho < 1$ and for $f : \{-1,1\}^n \to {\mathbb R}^{\geq 0}$ with $\mathop{\bf E}[f] = 1$ it holds that $\mathop{\bf Pr}[\mathrm{T}_\rho f > t] = o(1/t)$ as $t \to \infty$. (The rate of decay may depend on $\rho$ but not, of course, on $n$; in fact, a bound of the form $O(\frac{1}{t \sqrt{\log t}})$ is expected.) The result outlined in Exercise 31 (obtained together with James Lee) is for the very special case of $1$-dimensional Gaussian space; Ball, Barthe, Bednorz, Oleszkiewicz, and Wolff [BBB+13] obtained the same result and also showed a bound of $O(\frac{\log \log t}{t \sqrt{\log t}})$ for $d$-dimensional Gaussian space (with the constant in the $O(\cdot)$ depending on $d$).
The Multifunction Invariance Principle (Exercise 48 and its special case Exercise 46) are from Mossel [Mos10]; the version for general product spaces (Exercise 49) is from Mossel, O’Donnell, and Oleszkiewicz [MOO10].
]]>Theorem 77 For each $\rho \in (0,1)$ and $\epsilon > 0$, there exists a constant $C_{\rho, \epsilon}$ such that the following holds: If $f, g : \{-1,1\}^n \to [\epsilon, 1-\epsilon]$, then \[ \mathop{\bf E}_{\substack{({\boldsymbol{x}}, {\boldsymbol{x}}') \\ \text{$\rho$-correlated}}}[\Lambda_\rho(f({\boldsymbol{x}}), g({\boldsymbol{x}}'))] \leq \Lambda_\rho(\mathop{\bf E}[f], \mathop{\bf E}[g]) + C_{\rho, \epsilon}\cdot (\Delta_n[f] + \Delta_n[g]). \] Here we using the following inductive notation: $\Delta_1[f] = \mathop{\bf E}[|f - \mathop{\bf E}[f]|^3]$, and \[ \Delta_n[f] = \mathop{\bf E}_{{\boldsymbol{x}}_n \sim \{-1,1\}}\left[\Delta_{n-1}[f_{ \mid {\boldsymbol{x}}_n}]\right] + \Delta_1[f^{\subseteq \{n\}}]. \]
Proposition 78 Let $\psi : {\mathbb R}^d \to {\mathbb R}$ be $c$-Lipschitz. Then for any $\eta > 0$ there exists $\widetilde{\psi}_\eta : {\mathbb R}^d \to {\mathbb R}$ satisfying $\|\psi – \widetilde{\psi}_\eta\|_\infty \leq c \sqrt{d} \eta$ and $\|\partial^\beta \widetilde{\psi}_\eta\|_\infty \leq C_{|\beta|} c \sqrt{d}/\eta^{|\beta|-1}$ for each multi-index $\beta \in {\mathbb N}^{d}$ with $|\beta| = \sum_i \beta_i \geq 1$, where $C_k$ is a constant depending only on $k$.
Invariance Principle for Sums of Random Vectors Let $\vec{\boldsymbol{X}}_1, \dots, \vec{\boldsymbol{X}}_n$, $\vec{\boldsymbol{Y}}_1, \dots, \vec{\boldsymbol{Y}}_n$ be independent ${\mathbb R}^d$-valued random variables with matching means and covariance matrices; i.e., $\mathop{\bf E}[\vec{\boldsymbol{X}}_t] = \mathop{\bf E}[\vec{\boldsymbol{Y}}_t]$ and $\mathop{\bf Cov}[\vec{\boldsymbol{X}}_t] = \mathop{\bf Cov}[\vec{\boldsymbol{Y}}_t]$ for all $t \in [n]$. (Note that the $d$ individual components of a particular $\vec{\boldsymbol{X}}_t$ or $\vec{\boldsymbol{Y}}_t$ are not required to be independent.) Write $\vec{\boldsymbol{S}}_X = \sum_{t=1}^n \vec{\boldsymbol{X}}_t$ and $\vec{\boldsymbol{S}}_Y = \sum_{t=1}^n \vec{\boldsymbol{Y}}_t$. Then for any $\mathcal{C}^3$ function $\psi : {\mathbb R}^d \to {\mathbb R}$ satisfying $\|\partial^\beta \psi\|_\infty \leq C$ for all $|\beta| = 3$, \[ \left| \mathop{\bf E}[\psi(\vec{\boldsymbol{S}}_X)] – \mathop{\bf E}[\psi(\vec{\boldsymbol{S}}_Y)] \right| \leq C \gamma_{\vec{X}\vec{Y}}, \] where \[ \gamma_{\vec{X}\vec{Y}} = \sum_{\substack{\beta \in {\mathbb N}^d\\|\beta| = 3}} \frac{1}{\beta!} \sum_{t=1}^n \bigl(\mathop{\bf E}[|\vec{\boldsymbol{X}}_t^\beta|] + \mathop{\bf E}[|\vec{\boldsymbol{Y}}_t^\beta|]\bigr). \]
Multifunction Invariance Principle Let $F^{(1)}$, \dots, $F^{(d)}$ be formal $n$-variate multilinear polynomials each of degree at most $k \in {\mathbb N}$. Let $\vec{{\boldsymbol{x}}}_1, \dots, \vec{{\boldsymbol{x}}}_n$ and $\vec{\boldsymbol{y}}_1, \dots, \vec{\boldsymbol{y}}_n$ be independent ${\mathbb R}^d$-valued random variables such that $\mathop{\bf E}[\vec{{\boldsymbol{x}}}_t] = \mathop{\bf E}[\vec{\boldsymbol{y}}_t] = 0$ and $M_t = \mathop{\bf Cov}[\vec{{\boldsymbol{x}}}_t] = \mathop{\bf Cov}[\vec{\boldsymbol{y}}_t]$ for each $t \in [n]$. Assume each $M_t$ has all its diagonal entries equal to $1$ (i.e., each of the $d$ components of $\vec{{\boldsymbol{x}}}_t$ has variance $1$, and similarly for $\vec{\boldsymbol{y}}_t$). Further assume each component random variable $\vec{{\boldsymbol{x}}}_t^{(j)}$ and $\vec{\boldsymbol{y}}_t^{(j)}$ is $(2,3,\rho)$-hypercontractive ($t \in [n]$, $j \in [d]$). Then for any $\mathcal{C}^3$ function $\psi : {\mathbb R}^d \to {\mathbb R}$ satisfying $\|\partial^\beta \psi\|_\infty \leq C$ for all $|\beta| = 3$, \[ \left|\mathop{\bf E}[\psi(\vec{F}(\vec{{\boldsymbol{x}}}))] – \mathop{\bf E}[\psi(\vec{F}(\vec{\boldsymbol{y}}))]\right| \leq \tfrac{Cd^2}{3} \cdot (1/\rho)^{3k} \cdot \sum_{t=1}^n \sum_{j=1}^d \mathbf{Inf}_t[F^{(j)}]^{3/2}. \] Here we are using the following notation: If $\vec{\boldsymbol{z}} = (\vec{\boldsymbol{z}}_1, \dots, \vec{\boldsymbol{z}}_n)$ is a sequence of ${\mathbb R}^d$-valued random variables, $\vec{F}(\vec{\boldsymbol{z}})$ denotes the vector in ${\mathbb R}^d$ whose $j$th component is $F^{(j)}(\vec{\boldsymbol{z}}_1^{(j)}, \dots, \vec{\boldsymbol{z}}_n^{(j)})$.
(Hint: Combine the proofs of the Basic Invariance Principle and the Invariance Principle for Sums of Random Vectors, Exercise 46. The only challenging part should be notation.)
Invariance Principle in general product spaces
Let $(\Omega, \pi)$ be a finite probability space, $|\Omega| = m \geq 2$, in which every outcome has probability at least $\lambda$. Suppose $f \in L^2(\Omega^n, \pi^{\otimes n})$ has degree at most at $k$; thus, fixing some Fourier basis $\phi_0, \dots, \phi_{m-1}$ for $L^2(\Omega, \pi)$, we have \[ f = \sum_{\substack{\alpha \in {\mathbb N}^n_{<m} \\ \#\alpha \leq k}} \widehat{f}(\alpha)\phi_\alpha. \] Introduce indeterminates $x = (x_{i,j})_{i \in [n], j \in [m-1]}$ and let $F$ be the formal $(m-1)n$-variate polynomial of degree at most $k$ defined by \[ F(x) = \sum_{\#\alpha \leq k} \widehat{f}(\alpha) \prod_{i \in \mathrm{supp}(\alpha)} x_{i, \alpha_i}. \] Then for any $\psi : {\mathbb R} \to {\mathbb R}$ that is $\mathcal{C}^3$ and satisfies $\|\psi^{\prime \prime \prime}\|_\infty \leq C$ we have \[ \left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^{(m-1)n}}[\psi(F({\boldsymbol{x}}))] – \mathop{\bf E}_{\boldsymbol{\omega} \sim \pi^{\otimes n}}[\psi(f(\boldsymbol{\omega}))]\right| \leq \tfrac{C}{3} \cdot (2\sqrt{2/\lambda})^k \cdot \sum_{i=1}^n \mathbf{Inf}_i[f]^{3/2}. \]
(Hint: For $0 \leq t \leq n$, define the function $h_t \in L^2(\Omega^t \times \{-1,1\}^{(m-1)(n-t)}, \pi^{\otimes t} \otimes \pi_{1/2}^{\otimes (m-1)(n-t)})$ via \[ h_t(\boldsymbol{\omega}_1, \dots, \boldsymbol{\omega}_t, {\boldsymbol{x}}_{t+1,1}, \dots, {\boldsymbol{x}}_{n,m-1}) = \sum_{\#\alpha \leq k} \widehat{f}(\alpha) \prod_{\substack{i \in \mathrm{supp}(\alpha) \\ i \leq t}} \phi_{\alpha_i}(\boldsymbol{\omega}_i) \prod_{\substack{i \in \mathrm{supp}(\alpha) \\ i > t}} {\boldsymbol{x}}_{i,\alpha_i}. \] Express \[ h_{t} = \mathrm{E}_t h_{t} + \mathrm{L}_t h_{t} = \mathrm{E}_t h_t + \sum_{j = 1}^m D_j \cdot \phi_j(\boldsymbol{\omega}_t) \] where \[ D_j = \sum_{\alpha : \alpha_t = j} \widehat{f}(\alpha) \prod_{\substack{i \in \mathrm{supp}(\alpha) \\ i < t}} \phi_{\alpha_i}(\boldsymbol{\omega}_i) \prod_{\substack{i \in \mathrm{supp}(\alpha) \\ i > t}} {\boldsymbol{x}}_{i,\alpha_i}, \] and note that $h_{t-1} = \mathrm{E}_t h_t + \sum_{j = 1}^m D_j \cdot {\boldsymbol{x}}_{t,j}$.)
Theorem 79 Let $(\Omega, \pi)$ be a finite probability space in which each outcome has probability at least $\lambda$. Let $f \in L^2(\Omega^n, \pi^{\otimes n})$ have range $[0,1]$. Suppose that $f$ has no $(\epsilon, \frac{1}{\log(1/\epsilon)})$-notable coordinates. Then for any ${0 \leq \rho < 1}$, \[ \mathbf{Stab}_\rho[f] \leq \Lambda_\rho(\mathop{\bf E}[f]) + O\bigl(\tfrac{\log \log (1/\epsilon)}{\log(1/\epsilon)}\bigr) \cdot \tfrac{\log(1/\lambda)}{1-\rho}. \]
(Hint: Naturally, you’ll need Exercise 49(b).)
Remark: It can be shown that the $\mathcal{C}^3$ hypothesis in Propositions 26 and 28 is not necessary (provided the derivatives are interpreted in the distributional sense); see, e.g., Bogachev [Bog98] (Chapter 1) for more details.
Definition 76 For $\rho \in [-1,1]$ we define the Gaussian quadrant probability function $\Lambda_\rho : [0,1]^2 \to [0,1]$ by \[ \Lambda_\rho(\alpha,\beta) = \mathop{\bf Pr}_{\substack{(\boldsymbol{z}, \boldsymbol{z}') \text{ $\rho$-correlated} \\ \text{standard Gaussians}}}[\boldsymbol{z} \leq t, \boldsymbol{z}' \leq t'], \] where $t$ and $t’$ are defined by $\Phi(t) = \alpha$, $\Phi(t’) = \beta$. This is a slight reparametrization of the bivariate Gaussian cdf. We also use the shorthand notation \[ \Lambda_\rho(\alpha) = \Lambda_\rho(\alpha, \alpha), \] which we encountered in Borell’s Isoperimetric Theorem (and also in Exercises 5.33 and 9.24, with a different, but equivalent, definition).
Two-Set Borell Isoperimetric Theorem Fix $\rho \in (0,1)$ and $\alpha, \beta \in [0,1]$. Then for any $A, B \subseteq {\mathbb R}^n$ with $\mathrm{vol}_\gamma(A) = \alpha$, $\mathrm{vol}_\gamma(B) = \beta$, \begin{equation} \label{eqn:two-set-borell1} \mathop{\bf Pr}_{\substack{(\boldsymbol{z}, \boldsymbol{z}’) \text{ $\rho$-correlated} \\ \text{$n$-dimensional Gaussians}}}[\boldsymbol{z} \in A, \boldsymbol{z}' \in B] \leq \Lambda_\rho(\alpha,\beta). \end{equation}
By definition of $\Lambda_\rho(\alpha,\beta)$, equality holds if $A$ and $B$ are parallel halfspaces. Taking $\beta = \alpha$ and $B = A$ in this theorem gives Borell’s Isoperimetric Theorem as stated in Section 3 (in the case of range $\{0,1\}$, at least, which is equivalent by Exercise 25). It’s quite natural to guess that parallel halfspaces should maximize the “joint Gaussian noise stability” quantity on the left of \eqref{eqn:two-set-borell1}, especially in light of Remark 2 from Chapter 10.1 concerning the analogous Generalized Small-Set Expansion Theorem. Just as our proof of the Small-Set Expansion Theorem passed through the Two-Function Hypercontracitivity Theorem to facilitate induction, so too does the Mossel–Neeman proof pass through the following “two-function version” of Borell’s Isoperimetric Theorem:
Two-Function Borell Isoperimetric Theorem Fix $\rho \in (0,1)$ and let $f, g \in L^2({\mathbb R}^n, \gamma)$ have range $[0,1]$. Then \[ \mathop{\bf E}_{\substack{(\boldsymbol{z}, \boldsymbol{z}') \text{ $\rho$-correlated} \\ \text{$n$-dimensional Gaussians}}}[\Lambda_\rho(f(\boldsymbol{z}),g(\boldsymbol{z}'))] \leq \Lambda_\rho\left(\mathop{\bf E}[f], \mathop{\bf E}[g]\right). \]
Turning to hardness of approximation, we know from Theorem 7.41 (developed in [KKMO04]) that to prove UG-hardness of $(\alpha+\delta, \beta-\delta)$-approximating Max-Cut, it suffices to construct an $(\alpha,\beta)$-Dictator-vs.-No-Notables test which uses the predicate $\neq$. As we’ll see in this section, the quality of the most natural such test can be easily inferred from the Majority Is Stablest Theorem. Assuming that theorem (as Khot et al. [KKMO04] did), we get a surprising conclusion: It’s UG-hard to approximate the Max-Cut CSP any better than the Goemans–Williamson Algorithm does. In other words, the peculiar approximation guarantee of Goemans and Williamson on the very simple Max-Cut problem is optimal (assuming the Unique Games Conjecture). Let’s demystify this somewhat, starting with a description of the Goemans–Williamson Algorithm. Let $G = (V,E)$ be an $n$-vertex input graph for the algorithm; we’ll write $(\boldsymbol{v}, \boldsymbol{w}) \sim E$ to denote that $(\boldsymbol{v}, \boldsymbol{w})$ is a uniformly random edge (i.e., $\neq$-constraint) in the graph. The first step of the Goemans–Williamson Algorithm is to solve following optimization problem: \begin{equation} \label{eqn:gw-sdp} \tag{SDP} \begin{aligned} \text{maximize}& \quad \mathop{\bf E}_{(\boldsymbol{v},\boldsymbol{w}) \sim E}\left[\tfrac{1}{2} - \tfrac{1}{2} \langle \vec{U}(\boldsymbol{v}), \vec{U}(\boldsymbol{w}) \rangle\right] \\ \text{subject to}& \quad \vec{U} : V \to S^{n-1}. \end{aligned} \end{equation} Here $S^{n-1}$ denotes the set of all unit vectors in ${\mathbb R}^n$. Somewhat surprisingly, since this optimization problem is a “semidefinite program” it can be solved in polynomial time using the Ellipsoid Algorithm. (Technically, it can only be solved up to any desired additive tolerance $\epsilon > 0$, but we’ll ignore this point.) Let’s write $\mathrm{SDPOpt}(G)$ for the optimum value of \eqref{eqn:gw-sdp}, and $\mathrm{Opt}(G)$ for the optimum Max-Cut value for $G$. We claim that \eqref{eqn:gw-sdp} is a relaxation of the Max-Cut CSP on input $G$, and therefore \[ \mathrm{SDPOpt}(G) \geq \mathrm{Opt}(G). \] To see this, simply note that if $F^* : V \to \{-1,1\}$ is an optimal assignment (“cut”) for $G$ then we can define $\vec{U}(v) = (F^*(v), 0, \dots, 0) \in S^{n-1}$ for each $v \in V$ and achieve the optimal cut value $\mathrm{Val}_G(F^*)$ in \eqref{eqn:gw-sdp}.
The second step of the Goemans–Williamson Algorithm might look familiar from Fact 7 and Remark 8. Let $\vec{U}^* : V \to S^{n-1}$ be the optimal solution for \eqref{eqn:gw-sdp}, achieving $\mathrm{SDPOpt}(G)$; abusing notation we’ll write $\vec{U}^*(v) = \vec{v}$. The algorithm now chooses $\vec{\boldsymbol{g}} \sim \mathrm{N}(0,1)^n$ at random and outputs the assignment (cut) $\boldsymbol{F} : V \to \{-1,1\}$ defined by $\boldsymbol{F}(v) = \mathrm{sgn}(\langle \vec{v}, \vec{\boldsymbol{g}} \rangle)$. Let’s analyze the (expected) quality of this assignment. The probability the algorithm’s assignment $\boldsymbol{F}$ cuts a particular edge $(v,w) \in E$ is \[ \mathop{\bf Pr}_{\vec{\boldsymbol{g}} \sim \mathrm{N}(0,1)^n}[\mathrm{sgn}(\langle \vec{v}, \vec{\boldsymbol{g}} \rangle) \neq \mathrm{sgn}(\langle \vec{w}, \vec{\boldsymbol{g}} \rangle)]. \] This is precisely the probability that $\mathrm{sgn}(\boldsymbol{z}) \neq \mathrm{sgn}(\boldsymbol{z}’)$ when $(\boldsymbol{z}, \boldsymbol{z}’)$ is a pair of $\langle \vec{v}, \vec{w}\rangle$-correlated $1$-dimensional Gaussians. Writing $\angle(\vec{v},\vec{w}) \in [0,\pi]$ for the angle between the unit vectors $\vec{v}, \vec{w}$, we conclude from Sheppard’s Formula (see equation 11.1.(2)) that \[ \mathop{\bf Pr}_{\vec{\boldsymbol{g}}}[\boldsymbol{F} \text{ cuts edge $(v,w)$}] = \frac{\angle(\vec{v}, \vec{w})}{\pi}. \] By linearity of expectation we can compute the expected value of the algorithm’s assignment $\boldsymbol{F}$: \begin{equation} \mathop{\bf E}_{\vec{\boldsymbol{g}}}[\mathrm{Val}_G(\boldsymbol{F})] = \mathop{\bf E}_{(\boldsymbol{v},\boldsymbol{w}) \sim E}\left[\angle(\vec{\boldsymbol{v}}, \vec{\boldsymbol{w}})/\pi\right]. \label{eqn:gw-gets}
\end{equation}
On the other hand, by definition we have \begin{equation} \mathrm{SDPOpt}(G) = \mathop{\bf E}_{(\boldsymbol{v},\boldsymbol{w}) \sim E}\left[\tfrac{1}{2}-\tfrac{1}{2}\cos \angle(\vec{\boldsymbol{v}}, \vec{\boldsymbol{w}})\right]. \label{eqn:gw-sdpopt} \end{equation} It remains to compare \eqref{eqn:gw-gets} and \eqref{eqn:gw-sdpopt}. Define \begin{equation} \label{eqn:cgw-def} c_{\text{GW}} = \min_{\theta \in [0,\pi]} \left\{\frac{\theta/\pi}{\tfrac{1}{2}-\tfrac{1}{2} \cos \theta}\right\} \approx .8786. \end{equation} Then from \eqref{eqn:gw-gets} and \eqref{eqn:gw-sdpopt} we immediately get \[ \mathop{\bf E}_{\vec{\boldsymbol{g}}}[\mathrm{Val}_G(\boldsymbol{F})] \geq c_{\text{GW}} \cdot \mathrm{SDPOpt}(G) \geq c_{\text{GW}} \cdot \mathrm{Opt}(G); \] i.e., in expectation the Goemans–Williamson Algorithm delivers a cut of value at least $c_{\text{GW}}$ times the Max-Cut. In other words, it’s a $(c_{\text{GW}} \beta, \beta)$-approximation algorithm, as claimed. By being a little bit more careful about this analysis (exercise) you can show following additional result:
Theorem 71 [GW95]. Let $\theta \in [\theta^*, \pi]$, where $\theta^* \approx .74\pi$ is the minimizing $\theta$ in \eqref{eqn:cgw-def} (also definable as the positive solution of $\tan(\theta/2) = \theta$). Then on any graph $G$ with $\mathrm{SDPOpt}(G) \geq \tfrac{1}{2} – \tfrac{1}{2} \cos \theta$, the Goemans–Williamson Algorithm produces a cut of (expected) value at least $\theta/\pi$. In particular, the algorithm is a $(\theta/\pi, \tfrac{1}{2} – \tfrac{1}{2} \cos \theta)$-approximation algorithm for Max-Cut.
Example 72 Consider the Max-Cut problem on the $5$-vertex cycle graph ${\mathbb Z}_5$. The best bipartition of this graph cuts $4$ out of the $5$ edges; hence $\mathrm{Opt}({\mathbb Z}_5) = \frac45$. In the exercises you are asked to show that taking \[ \vec{U}(v) = (\cos \tfrac{4\pi v}{5}, \sin \tfrac{4\pi v}{5}), \quad v \in {\mathbb Z}_5, \] in the semidefinite program \eqref{eqn:gw-sdp} establishes that $\mathrm{SDPOpt}({\mathbb Z}_5) \geq \tfrac{1}{2} – \tfrac{1}{2} \cos \frac{4\pi}{5}$. (These are actually unit vectors in ${\mathbb R}^2$ rather than in ${\mathbb R}^5$ as \eqref{eqn:gw-sdp} requires, but we can pad out the last three coordinates with zeroes.) This example shows that the Goemans–Williamson analysis in Theorem 71 lower-bounding $\mathrm{Opt}(G)$ in terms of $\mathrm{SDPOpt}(G)$ cannot be improved (at least when $\mathrm{SDPOpt}(G) = \frac45$). This is termed an optimal integrality gap. In fact, Theorem 71 also implies that $\mathrm{SDPOpt}({\mathbb Z}_5)$ must equal $\tfrac{1}{2} – \tfrac{1}{2} \cos \frac{4\pi}{5}$, for if it were greater, the theorem would falsely imply that $\mathrm{Opt}({\mathbb Z}_5) > \frac45$. Note that the Goemans–Williamson Algorithm actually finds the maximum cut when run on the cycle graph ${\mathbb Z}_5$. For a related example, see the exercises.
Now we explain the result of Khot et al. [KKMO04], that the Majority Is Stablest Theorem implies it’s UG-hard to approximate Max-Cut better than the Goemans–Williamson Algorithm does:
Theorem 73 [KKMO04]. Let $\theta \in (\frac{\pi}{2}, \pi)$. Then for any $\delta > 0$ it’s UG-hard to $(\theta/\pi + \delta, \tfrac{1}{2} – \tfrac{1}{2} \cos\theta)$-approximate Max-Cut.
Proof: It follows from Theorem 7.41 that we just need to construct a $(\theta/\pi, \tfrac{1}{2} – \tfrac{1}{2} \cos\theta)$-Dictator-vs.-No-Notables test using the predicate $\neq$. (See the exercises for an extremely minor technical point.) It’s very natural to try the following, with $\beta = \tfrac{1}{2} – \tfrac{1}{2} \cos \theta \in (\frac12, 1)$:
$\boldsymbol{\beta}$-Noise Sensitivity Test Given query access to $f : \{-1,1\}^n \to \{-1,1\}$:
- Choose ${\boldsymbol{x}} \sim \{-1,1\}^n$ and form ${\boldsymbol{x}}’$ by reversing each bit of ${\boldsymbol{x}}$ independently with probability $\beta = \tfrac{1}{2} – \tfrac{1}{2} \cos\theta$. In other words let $({\boldsymbol{x}}, {\boldsymbol{x}}’)$ be a pair of $\cos\theta$-correlated strings. (Note that $\cos\theta < 0$.)
- Query $f$ at ${\boldsymbol{x}}$, ${\boldsymbol{x}}’$.
- Accept if $f({\boldsymbol{x}}) \neq f({\boldsymbol{x}}’)$.
By design, \begin{equation} \label{eqn:kkmo-acceptance} \mathop{\bf Pr}[\text{the test accepts } f] = \mathbf{NS}_\beta[f] = \tfrac{1}{2} – \tfrac{1}{2} \mathbf{Stab}_{\cos \theta}[f]. \end{equation} (We might also express this as “$\mathbf{RS}_f(\theta)$”.) In particular, if $f$ is a dictator, it’s accepted with probability exactly $\beta = \tfrac{1}{2} – \tfrac{1}{2} \cos \theta$. To complete the proof that this is a $(\theta/\pi, \tfrac{1}{2} – \tfrac{1}{2} \cos\theta)$-Dictator-vs.-No-Notables test, let’s suppose $f : \{-1,1\}^n \to [-1,1]$ has no $(\epsilon,\epsilon)$-notable coordinates and show that \eqref{eqn:kkmo-acceptance} is at most $\theta/\pi + o_\epsilon(1)$. (Regarding $f$ having range $[-1,1]$, recall Remark 7.39.)
At first it might look like we can immediately apply the Majority Is Stablest Theorem; however, the theorem’s inequality goes the “wrong way” and the correlation parameter $\rho = \cos \theta$ is negative. These two difficulties actually cancel each other out. Note that \begin{align} \mathop{\bf Pr}[\text{the test accepts } f] &= \tfrac{1}{2} – \tfrac{1}{2} \mathbf{Stab}_{\cos \theta}[f] \nonumber\\ &= \tfrac{1}{2} – \tfrac{1}{2} \sum_{k=0}^n (\cos \theta)^{k} \mathbf{W}^{k}[f] \nonumber\\ &\leq \tfrac{1}{2} + \tfrac{1}{2} \sum_{\text{$k$ odd}} (-\cos \theta)^{k} \mathbf{W}^{k}[f] \tag{since $\cos \theta < 0$} \nonumber\\ &= \tfrac{1}{2} + \tfrac{1}{2} \mathbf{Stab}_{-\cos\theta}[f^\mathrm{odd}], \label{eqn:finish-kkmo} \end{align} where $f^\mathrm{odd} : \{-1,1\}^n \to [-1,1]$ is the odd part of $f$ (see Exercise 1.8) defined by \[ f^\mathrm{odd}(x) = \tfrac{1}{2}(f(x) - f(-x)) = \sum_{|S|\text{ odd}} \widehat{f}(S)\,x^S. \] Now we’re really in a position to apply the Majority Is Stablest Theorem to $f^\mathrm{odd}$, because $-\cos\theta \in (0,1)$, $\mathop{\bf E}[f^\mathrm{odd}] = 0$, and $f^\mathrm{odd}$ has no $(\epsilon,\epsilon)$-notable coordinates (since it’s formed from $f$ by just dropping some terms in the Fourier expansion). Using $-\cos\theta = \cos(\pi – \theta)$, the result is that \[ \mathbf{Stab}_{-\cos\theta}[f^\mathrm{odd}] \leq 1 – \tfrac{2}{\pi} \arccos(\cos(\pi – \theta)) + o_\epsilon(1) = 2\theta/\pi – 1 + o_\epsilon(1). \] Putting this into \eqref{eqn:finish-kkmo} yields \[ \mathop{\bf Pr}[\text{the test accepts } f] \leq \tfrac{1}{2} + \tfrac{1}{2} (2\theta/\pi – 1 + o_\epsilon(1)) = \theta/\pi + o_\epsilon(1), \] as needed. $\Box$
Remark 74 There’s actually still a mismatch between the algorithmic guarantee of Theorem 71 and the UG-hardness result Theorem 73, concerning the case of $\theta \in (\frac{\pi}{2}, \theta^*)$. In fact, for these values of $\theta$ — i.e., $\tfrac{1}{2} \leq \beta \lessapprox .8446$ — neither result is sharp; see O’Donnell and Wu [OW08].
Remark 75 If we want to prove UG-hardness of $(\theta’/\pi + \delta, \tfrac{1}{2} – \tfrac{1}{2} \cos\theta’)$-approximating Max-Cut, we don’t need the full version of Borell’s Isoperimetric Theorem; we only need the volume-$\tfrac{1}{2}$ case with parameter $\theta = \pi – \theta’$. Corollary 44 gave a simple proof of this result for $\theta = \frac{\pi}{4}$, hence $\theta’ = \frac{3}{4} \pi$. This yields UG-hardness of $(\tfrac34 + \delta, \tfrac{1}{2} + \frac{1}{2\sqrt{2}})$-approximating Max-Cut. The ratio between $\alpha$ and $\beta$ here is approximately $.8787$, very close to the Goemans–Williamson constant $c_{\text{GW}} \approx .8786$.
Finally, we will prove the General-Volume Majority Is Stablest Theorem, by using the Invariance Principle to reduce it to Borell’s Isoperimetric Theorem.
General-Volume Majority Is Stablest Theorem Let $f : \{-1,1\}^n \to [0,1]$. Suppose that $\mathbf{MaxInf}[f] \leq \epsilon$, or more generally, that $f$ has no $(\epsilon, \frac{1}{\log(1/\epsilon)})$-notable coordinates. Then for any $0 \leq \rho < 1$, \begin{equation} \label{eqn:mist} \mathbf{Stab}_\rho[f] \leq \Lambda_\rho(\mathop{\bf E}[f]) + O\bigl(\tfrac{\log \log (1/\epsilon)}{\log(1/\epsilon)}\bigr) \cdot \tfrac{1}{1-\rho}. \end{equation} (Here the $O(\cdot)$ bound has no dependence on $\rho$.)
Proof: The proof involves using the Basic Invariance Principle twice (in the form of Corollary 68). To facilitate this we introduce $f’ = \mathrm{T}_{1-\delta} f$, where (with foresight) we choose \[ \delta = 3 \tfrac{\log \log(1/\epsilon)}{\log(1/\epsilon)}. \] (We may assume $\epsilon$ is sufficiently small so that $0 < \delta \leq \frac{1}{20}$.) Note that $\mathop{\bf E}[f'] = \mathop{\bf E}[f]$ and that \[ \mathbf{Stab}_\rho[f'] = \sum_{S \subseteq [n]} \rho^{|S|} (1-\delta)^{2|S|} \widehat{f}(S)^2 = \mathbf{Stab}_{\rho(1-\delta)^2}[f]. \] But \begin{equation} \label{eqn:mist-tradeoff} \bigl|\mathbf{Stab}_{\rho(1-\delta)^2}[f] – \mathbf{Stab}_{\rho}[f]\bigr| \leq (\rho – \rho(1-\delta)^2) \cdot \tfrac{1}{1-\rho} \cdot \mathop{\bf Var}[f] \leq 2\delta \cdot \tfrac{1}{1-\rho} \end{equation} by Exercise 2.40, and with our choice of $\delta$ this can be absorbed into the error of \eqref{eqn:mist}. Thus it suffices to prove \eqref{eqn:mist} with $f’$ in place of $f$.
Let $\text{Sq} : {\mathbb R} \to {\mathbb R}$ be the continuous function which agrees with $t \mapsto t^2$ for $t \in [0,1]$ and is constant outside $[0,1]$. Note that $\text{Sq}$ is $2$-Lipschitz. We will apply the second part of Corollary 68 with “$h$” set to $\mathrm{T}_{\sqrt{\rho}} f$ (and thus $\mathrm{T}_{1-\delta} h = \mathrm{T}_{\sqrt{\rho}} f’$). This is valid since the variance and $(1-\delta)$-stable influences of $h$ are only smaller than those of $f$. Thus \begin{equation} \label{eqn:mist-halfway} \left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^n}[\text{Sq}(\mathrm{T}_{\sqrt{\rho}}f'({\boldsymbol{x}}))] – \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\text{Sq}(\mathrm{T}_{\sqrt{\rho}} f'(\boldsymbol{g}))]\right| \leq O(\epsilon^{\delta/3}) = O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr), \end{equation} using our choice of $\delta$. (In fact, it’s trading off this error with \eqref{eqn:mist-tradeoff} that led to our choice of $\delta$.) Now $\mathrm{T}_{\sqrt{\rho}} f’({\boldsymbol{x}}) = \mathrm{T}_{(1-\delta)\sqrt{\rho}} f({\boldsymbol{x}})$ is always bounded in $[0,1]$, so \[ \text{Sq}(\mathrm{T}_{\sqrt{\rho}} f'({\boldsymbol{x}})) = (\mathrm{T}_{\sqrt{\rho}} f'({\boldsymbol{x}}))^2 \quad \implies \quad \mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^n}[\text{Sq}(\mathrm{T}_{\sqrt{\rho}}f'({\boldsymbol{x}}))] = \mathbf{Stab}_\rho[f']. \] Furthermore, $\mathrm{T}_{\sqrt{\rho}} f’(\boldsymbol{g})$ is the same as $\mathrm{U}_{\sqrt{\rho}} f’(\boldsymbol{g})$ because $f’$ is a multilinear polynomial. (Both are equal to $f’(\rho \boldsymbol{g})$; see Fact 13.) Thus in light of \eqref{eqn:mist-halfway}, to complete the proof of \eqref{eqn:mist} it suffices to show \begin{equation} \label{eqn:mist-suffices} \left|\mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} f'(\boldsymbol{g}))] – \Lambda_\rho(\mathop{\bf E}[f'])\right| \leq O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr). \end{equation}
Define the function $F : {\mathbb R}^n \to [0,1]$ by \[ F(g) = \mathrm{trunc}_{[0,1]}(f’(g)) = \begin{cases} 0 & \text{if $f’(g) < 0$,} \\ f’(g) & \text{if $f’(g) \in [0,1]$,}\\ 1 & \text{if $f’(g) >1$.} \end{cases} \] We will establish the following two inequalities, which together imply \eqref{eqn:mist-suffices}: \begin{gather} \left|\mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} f'(\boldsymbol{g}))] – \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} F(\boldsymbol{g}))]\right| \leq O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr), \label{eqn:mist1} \\ \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} F(\boldsymbol{g}))] \leq \Lambda_\rho(\mathop{\bf E}[f']) + O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr). \label{eqn:mist2} \end{gather} Both of these inequalities will in turn follow from \begin{equation} \label{eqn:f’-to-F} \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[|f'(\boldsymbol{g}) - F(\boldsymbol{g})|] = \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\mathrm{dist}_{[0,1]}(f’(\boldsymbol{g}))] \leq O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr). \end{equation} Let’s show how \eqref{eqn:mist1} and \eqref{eqn:mist2} follow from \eqref{eqn:f’-to-F}, leaving the proof of \eqref{eqn:f’-to-F} to the end. For \eqref{eqn:mist1}, \begin{align*} \left|\mathop{\bf E}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} f'(\boldsymbol{g}))] – \mathop{\bf E}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} F(\boldsymbol{g}))] \right| &\leq 2 \mathop{\bf E}[|\mathrm{U}_{\sqrt{\rho}} f'(\boldsymbol{g}) - \mathrm{U}_{\sqrt{\rho}} F(\boldsymbol{g})|]\\ &\leq 2 \mathop{\bf E}[|f'(\boldsymbol{g}) - F(\boldsymbol{g})|] \leq O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr), \end{align*} where the first inequality used that $\text{Sq}$ is $2$-Lipschitz, the second inequality used the fact that $\mathrm{U}_{\sqrt{\rho}}$ is a contraction on $L^1({\mathbb R}^n,\gamma)$, and the third inequality was \eqref{eqn:f’-to-F}. As for \eqref{eqn:mist2}, $\mathrm{U}_{\sqrt{\rho}} F$ is bounded in $[0,1]$ since $F$ is. Thus \[ \mathop{\bf E}[\text{Sq}(\mathrm{U}_{\sqrt{\rho}} F(\boldsymbol{g}))] = \mathop{\bf E}[(\mathrm{U}_{\sqrt{\rho}} F(\boldsymbol{g}))^2] = \mathbf{Stab}_\rho[F] \leq \Lambda_\rho(\mathop{\bf E}[F(\boldsymbol{g})]), \] where we used Borell’s Isoperimetric Theorem. But $|\mathop{\bf E}[F(\boldsymbol{g})] – \mathop{\bf E}[f'(\boldsymbol{g})]| \leq O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr)$ by \eqref{eqn:f’-to-F}, and $\Lambda_\rho$ is easily shown to be $2$-Lipschitz (exercise). This establishes \eqref{eqn:mist2}.
It therefore remains to show \eqref{eqn:f’-to-F}, which we do by applying the Invariance Principle one more time. Taking $\psi$ to be the $1$-Lipschitz function $\mathrm{dist}_{[0,1]}$ in Corollary 68 we deduce \begin{align*} \left|\mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\mathrm{dist}_{[0,1]}(f’(\boldsymbol{g}))] – \mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^n}[\mathrm{dist}_{[0,1]}(f’({\boldsymbol{x}}))]\right| \leq O(\epsilon^{\delta/3}) = O\bigl(\tfrac{1}{\log(1/\epsilon)}\bigr). \end{align*} But $\mathop{\bf E}[\mathrm{dist}_{[0,1]}f’({\boldsymbol{x}})] = 0$ since $f’({\boldsymbol{x}}) = \mathrm{T}_{1-\delta} f({\boldsymbol{x}}) \in [0,1]$ always. This establishes \eqref{eqn:f’-to-F} and completes the proof. $\Box$
We conclude with one more application of the Majority Is Stablest Theorem. Recall Kalai’s version of Arrow’s Theorem from Chapter 2.5, i.e., Theorem 2.55. It states that in a $3$-candidate Condorcet election using the voting rule $f : \{-1,1\}^n \to \{-1,1\}$, the probability of having a Condorcet winner — often called a rational outcome — is precisely $\tfrac34 – \tfrac34\mathbf{Stab}_{-1/3}[f]$. As we saw in the proof of Theorem 73 near \eqref{eqn:finish-kkmo}, this is in turn at most $\tfrac34 + \tfrac34\mathbf{Stab}_{1/3}[f^\mathrm{odd}]$, with equality if $f$ is already odd. It follows from the Majority Is Stablest Theorem that among all voting rules with $\epsilon$-small influences (a condition all reasonable voting rules should satisfy), majority rule is the “most rational”. Thus we see that the principle of representative democracy can be derived using analysis of Boolean functions.
]]>In this section we’ll develop the most basic Invariance Principle, which involves replacing bits by Gaussians for a single Boolean function $f$. We’ll show that this doesn’t change the distribution of $f$ much provided $f$ has small influences and provided that $f$ is of “constant degree” — or at least, provided $f$ is uniformly noise-stable so that it’s “close to having constant degree”. Invariance Principles in much more general settings are possible — for example the exercises describe variants which handle several functions applied to correlated inputs, and functions on general product spaces. Here we’ll just focus on the simplest possible Invariance Principle, which is already sufficient for the proof of the Majority Is Stablest Theorem in Section 7.
Let’s begin with some notation.
Definition 63 Let $F$ be a formal multilinear polynomial over the sequence of indeterminates $x = (x_1, \dots, x_n)$: \[ F(x) = \sum_{S \subseteq [n]} \widehat{F}(S) \prod_{i \in S} x_i, \] where the coefficients $\widehat{F}(S)$ are real numbers. We introduce the notation \[ \mathop{\bf Var}[F] = \sum_{S \neq \emptyset} \widehat{F}(S)^2, \qquad \mathbf{Inf}_i[F] = \sum_{S \ni i} \widehat{F}(S)^2. \]
Remark 64 To justify this notation, we remark that we’ll always consider $F$ applied to a sequence $\boldsymbol{z} = (\boldsymbol{z}_1, \dots, \boldsymbol{z}_n)$ independent random variables satisfying $\mathop{\bf E}[\boldsymbol{z}_i] = 0$, $\mathop{\bf E}[\boldsymbol{z}_i^2] = 1$. Under these circumstances the collection of monomial random variables $\prod_{i \in S} \boldsymbol{z}_i$ is orthonormal and so it’s easy to see (cf. Chapter 8.2) that \[ \mathop{\bf E}[F(\boldsymbol{z})] = \widehat{F}(\emptyset), \quad \mathop{\bf E}[F(\boldsymbol{z})^2] = \sum_{S \subseteq [n]} \widehat{F}(S)^2, \quad \mathop{\bf Var}[F(\boldsymbol{z})] = \mathop{\bf Var}[F] = \sum_{S \neq \emptyset} \widehat{F}(S)^2. \] We also have $\mathop{\bf E}[\mathop{\bf Var}_{\boldsymbol{z}_i}[F(\boldsymbol{z})]] = \mathbf{Inf}_i[F] = \sum_{S \ni i} \widehat{F}(S)^2$, though we won’t use this.
As in the Berry–Esseen Theorem, to get good error bounds we’ll need our random variables $\boldsymbol{z}_i$ to be “reasonable”. Sacrificing generality for simplicity in this section, we’ll take the bounded $4$th-moment notion from Definition 9.1 which will allow us to use the basic Bonami Lemma (more precisely, Corollary 9.6):
Hypothesis The random variable $\boldsymbol{z}_i$ satisfies $\mathop{\bf E}[\boldsymbol{z}_i] = 0$, $\mathop{\bf E}[\boldsymbol{z}_i^2] = 1$, $\mathop{\bf E}[\boldsymbol{z}_i^3] = 0$, and is “$9$-reasonable” in the sense of Definition 9.1; i.e., $\mathop{\bf E}[\boldsymbol{z}_i^4] \leq 9$.
The main examples we have in mind are that each $\boldsymbol{z}_i$ is either a uniform $\pm 1$ random bit or a standard Gaussian. (There are other possibilities, though; e.g., $\boldsymbol{z}_i$ could be uniform on the interval $[-\sqrt{3}, \sqrt{3}]$.)
We can now prove the most basic Invariance Principle, for low-degree multilinear polynomials of random variables:
Basic Invariance Principle Let $F$ be a formal $n$-variate multilinear polynomial of degree at most $k \in {\mathbb N}$, \[ F(x) = \sum_{S \subseteq [n], |S| \leq k} \widehat{F}(S) \prod_{i \in S} x_i. \] Let ${\boldsymbol{x}} = ({\boldsymbol{x}}_1, \dots, {\boldsymbol{x}}_n)$ and $\boldsymbol{y} = (\boldsymbol{y}_1, \dots, \boldsymbol{y}_n)$ be sequences of independent random variables, each satisfying the above Hypothesis. Assume $\psi : {\mathbb R} \to {\mathbb R}$ is $\mathcal{C}^4$ with $\|\psi^{\prime\prime\prime\prime}\|_\infty \leq C$. Then \begin{equation} \label{eqn:basic-inv-goal} \left|\mathop{\bf E}[\psi(F({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(F(\boldsymbol{y}))]\right| \leq \tfrac{C}{12}\cdot 9^k \cdot \sum_{t=1}^n \mathbf{Inf}_t[F]^2. \end{equation}
Remark 65 The proof will be very similar to the one we used for Berry–Esseen except that we’ll take a $3$rd-order Taylor expansion rather than a $2$nd-order one (so that we can use the easy Bonami Lemma). As you are asked to show in the exercises, had we only required that $\psi$ be $\mathcal{C}^3$ and that the ${\boldsymbol{x}}_i$’s and $\boldsymbol{y}_i$’s be $(2,3,\rho)$-hypercontractive with $2$nd moment equal to $1$, then we could obtain \[ \left|\mathop{\bf E}[\psi(F({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(F(\boldsymbol{y}))]\right| \leq \tfrac{\|\psi^{\prime\prime\prime}\|_\infty}{3}\cdot (1/\rho)^{3k} \cdot \sum_{t=1}^n \mathbf{Inf}_t[F]^{3/2}. \]
Proof: The proof uses the Replacement Method. For $0 \leq t \leq n$ we define \[ \boldsymbol{H}_t = F(\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t}, {\boldsymbol{x}}_{t+1}, \dots, {\boldsymbol{x}}_{n}), \] so $F({\boldsymbol{x}}) = \boldsymbol{H}_0$ and $F(\boldsymbol{y}) = \boldsymbol{H}_n$. We will show that \begin{equation} \label{eqn:basic-inv-step} \left|\mathop{\bf E}[\psi(\boldsymbol{H}_{t-1}) - \psi(\boldsymbol{H}_t)]\right| \leq \tfrac{C}{12}\cdot 9^k \cdot \mathbf{Inf}_t[F]^2; \end{equation} as in our proof of the Berry–Esseen Theorem, this will complete the proof after summing over $t$ and using the triangle inequality. To analyze \eqref{eqn:basic-inv-step} we separate out the part of $F(x)$ that depends on $x_t$; i.e., we write $F(x) = \mathrm{E}_t F(x) + x_t \mathrm{D}_t F(x)$, where the formal polynomials $\mathrm{E}_t F$ and $\mathrm{D}_t F$ are defined by \[ \mathrm{E}_tF(x) = \sum_{S \not \ni t} \widehat{F}(S) \prod_{i \in S} x_i, \qquad \mathrm{D}_tF(x) = \sum_{S \ni t} \widehat{F}(S) \prod_{i \in S \setminus \{t\}} x_i. \] Note that neither $\mathrm{E}_t F$ nor $\mathrm{D}_t F$ depends on the indeterminate $x_t$; thus we can define \begin{align*} \boldsymbol{U}_t &= \mathrm{E}_t F(\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t-1}, \cdot, {\boldsymbol{x}}_{t+1}, \dots, {\boldsymbol{x}}_n), \\ \boldsymbol{\Delta}_t &= \mathrm{D}_t F(\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t-1}, \cdot, {\boldsymbol{x}}_{t+1}, \dots, {\boldsymbol{x}}_n), \end{align*} so that \[ \boldsymbol{H}_{t-1} = \boldsymbol{U}_t + \boldsymbol{\Delta}_t{\boldsymbol{x}}_t, \qquad \boldsymbol{H}_{t} = \boldsymbol{U}_t + \boldsymbol{\Delta}_t\boldsymbol{y}_t. \] We now use a $3$rd-order Taylor expansion to bound \eqref{eqn:basic-inv-step}: \begin{align*} \psi(\boldsymbol{H}_{t-1}) &= \psi(\boldsymbol{U}_t) + \psi’(\boldsymbol{U}_t) \boldsymbol{\Delta}_t {\boldsymbol{x}}_t + \tfrac{1}{2} \psi^{\prime\prime}(\boldsymbol{U}_t) \boldsymbol{\Delta}_t^2 {\boldsymbol{x}}_t^2 + \tfrac16 \psi^{\prime\prime\prime}(\boldsymbol{U}_t) \boldsymbol{\Delta}_t^3 {\boldsymbol{x}}_t^3 + \tfrac{1}{24} \psi^{\prime\prime\prime\prime}(\boldsymbol{U}_t^*) \boldsymbol{\Delta}_t^4 {\boldsymbol{x}}_t^4 \\ \psi(\boldsymbol{H}_{t}) &= \psi(\boldsymbol{U}_t) + \psi’(\boldsymbol{U}_t) \boldsymbol{\Delta}_t \boldsymbol{y}_t + \tfrac{1}{2} \psi^{\prime\prime}(\boldsymbol{U}_t) \boldsymbol{\Delta}_t^2 \boldsymbol{y}_t^2 + \tfrac16 \psi^{\prime\prime\prime}(\boldsymbol{U}_t) \boldsymbol{\Delta}_t^3 \boldsymbol{y}_t^3 + \tfrac{1}{24} \psi^{\prime\prime\prime\prime}(\boldsymbol{U}_t^{**}) \boldsymbol{\Delta}_t^4 \boldsymbol{y}_t^4 \end{align*} for some random variables $\boldsymbol{U}_t^*$ and $\boldsymbol{U}_t^{**}$. As in the proof of the Berry–Esseen Theorem, when we subtract these and take the expectation there are significant simplifications. The $0$th-order terms cancel. As for the $1$st-order terms, \[ \mathop{\bf E}[\psi'(\boldsymbol{U}_t) \boldsymbol{\Delta}_t {\boldsymbol{x}}_t - \psi'(\boldsymbol{U}_t) \boldsymbol{\Delta}_t \boldsymbol{y}_t] = \mathop{\bf E}[\psi'(\boldsymbol{U}_t) \boldsymbol{\Delta}_t \cdot( {\boldsymbol{x}}_t - \boldsymbol{y}_t)] = \mathop{\bf E}(\psi’(\boldsymbol{U}_t) \boldsymbol{\Delta}_t] \cdot \mathop{\bf E}[{\boldsymbol{x}}_t - \boldsymbol{y}_t] = 0. \] The second equality here crucially uses the fact that ${\boldsymbol{x}}_t,\ \boldsymbol{y}_t$ are independent of $\boldsymbol{U}_t,\ \boldsymbol{\Delta}_t$. The final equality only uses the fact that ${\boldsymbol{x}}_t$ and $\boldsymbol{y}_t$ have matching $1$st moments (and not the stronger assumption that both of these $1$st moments are $0$). The $2$nd- and $3$rd-order terms will similarly cancel, using the fact that ${\boldsymbol{x}}_t$ and $\boldsymbol{y}_t$ have matching $2$nd and $3$rd moments. Finally, for the “error” term we’ll just use $|\psi^{\prime\prime\prime\prime}(\boldsymbol{U}_t^*)|, |\psi^{\prime\prime\prime\prime}(\boldsymbol{U}_t^{**})| \leq C$ and the triangle inequality; we thus obtain \[ \left|\mathop{\bf E}[\psi(\boldsymbol{H}_{t-1}) - \psi(\boldsymbol{H}_t)]\right| \leq \tfrac{C}{24} \cdot (\mathop{\bf E}[(\boldsymbol{\Delta}_t{\boldsymbol{x}}_t)^4] + \mathop{\bf E}[(\boldsymbol{\Delta}_t\boldsymbol{y}_t)^4]). \] To complete the proof of \eqref{eqn:basic-inv-step} we now just need to bound \[ \mathop{\bf E}[(\boldsymbol{\Delta}_t{\boldsymbol{x}}_t)^4],\ \mathop{\bf E}[(\boldsymbol{\Delta}_t\boldsymbol{y}_t)^4] \leq 9^k \cdot \mathbf{Inf}_t[F]^2, \] which we’ll do using the Bonami Lemma. We’ll give the proof for $\mathop{\bf E}[(\boldsymbol{\Delta}_t{\boldsymbol{x}}_t)^4]$, the case of $\mathop{\bf E}[(\boldsymbol{\Delta}_t\boldsymbol{y}_t)^4]$ being identical. We have \[ \boldsymbol{\Delta}_t {\boldsymbol{x}}_t = \mathrm{L}_t F(\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t-1}, {\boldsymbol{x}}_t, {\boldsymbol{x}}_{t+1}, \dots, {\boldsymbol{x}}_n), \] where \[ \mathrm{L}_t F(x) = x_t \mathrm{D}_t F(x) = \sum_{S \ni t} \widehat{F}(S) \prod_{i \in S} x_i. \] Since $\mathrm{L}_t F$ has degree at most $k$ we can apply the Bonami Lemma (more precisely, Corollary 9.6) to obtain \[ \mathop{\bf E}[(\boldsymbol{\Delta}_t {\boldsymbol{x}}_t)^4] \leq 9^k \mathop{\bf E}[\mathrm{L}_t F(\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t-1}, {\boldsymbol{x}}_t, {\boldsymbol{x}}_{t+1}, \dots, {\boldsymbol{x}}_n)^2]^2. \] But since $\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t-1}, {\boldsymbol{x}}_{t}, \dots, {\boldsymbol{x}}_{n}$ are independent with mean $0$ and $2$nd moment $1$, we have (see Remark 64) \[ \mathop{\bf E}[\mathrm{L}_t F(\boldsymbol{y}_1, \dots, \boldsymbol{y}_{t-1}, {\boldsymbol{x}}_t, {\boldsymbol{x}}_{t+1}, \dots, {\boldsymbol{x}}_n)^2] = \sum_{S \subseteq [n]} \widehat{\mathrm{L}_t F}(S)^2 = \sum_{S \ni t} \widehat{F}(S)^2 = \mathbf{Inf}_t[F]. \] Thus we indeed have $\mathop{\bf E}[(\boldsymbol{\Delta}_t {\boldsymbol{x}}_t)^4] \leq 9^k \cdot \mathbf{Inf}_t[F]^2$, and the proof is complete. $\Box$
Corollary 66 In the setting of the preceding theorem, if we furthermore have $\mathop{\bf Var}[F] \leq 1$ and $\mathbf{Inf}_t[F] \leq \epsilon$ for all $t \in [n]$, then \[ \left|\mathop{\bf E}[\psi(F({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(F(\boldsymbol{y}))]\right| \leq \tfrac{C}{12}\cdot k9^k \cdot \epsilon. \]
Proof: We have $\sum_t \mathbf{Inf}_t[F]^2 \leq \epsilon \sum_t \mathbf{Inf}_t[F] \leq \sum_{S} |S| \widehat{F}(S)^2 \leq k \mathop{\bf Var}[F]$. $\Box$
Corollary 67 In the setting of the preceding corollary, if we merely have that $\psi : {\mathbb R} \to {\mathbb R}$ is $c$-Lipschitz (rather than $\mathcal{C}^4$), then \[ \left|\mathop{\bf E}[\psi(F({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(F(\boldsymbol{y}))]\right| \leq O(c) \cdot 2^{k} \epsilon^{1/4}. \]
Proof: Just as in the proof of Corollary 59, by using $\widetilde{\psi}_\eta$ from Proposition 58 (which has $\|\widetilde{\psi}_\eta^{\prime\prime\prime\prime}\|_\infty \leq O(c/\eta^3)$) we obtain \[ \left|\mathop{\bf E}[\psi(F({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(F(\boldsymbol{y}))]\right| \leq O(c) \cdot (\eta + k 9^k \epsilon/\eta^3). \] The proof is completed by taking $\eta = \sqrt[4]{k 9^k \epsilon} \leq 2^k \epsilon^{1/4}$. $\Box$
Let’s connect this last corollary back to the study of Boolean functions. Suppose $f : \{-1,1\}^n \to {\mathbb R}$ has $\epsilon$-small influences (in the sense of Definition 6.9) and degree at most $k$. Letting $\boldsymbol{g} = (\boldsymbol{g}_1, \dots, \boldsymbol{g}_n)$ be a sequence of independent standard Gaussians, Corollary 67 tells us that for any Lipschitz $\psi$ we have \begin{equation} \label{eqn:cor-simple-inv} \left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^n}[\psi(f({\boldsymbol{x}}))] – \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\psi(f(\boldsymbol{g}))]\right| \leq O(2^{k} \epsilon^{1/4}). \end{equation} Here the expression “$f(\boldsymbol{g})$” is an abuse of notation indicating that the real numbers $\boldsymbol{g}_1, \dots, \boldsymbol{g}_n$ are substituted into $f$’s Fourier expansion (multilinear polynomial representation).
At first it may seem peculiar to substitute arbitrary real numbers into the Fourier expansion of a Boolean function. Actually, if all the numbers being substituted are in the range $[-1,1]$ then there’s a natural interpretation: as you were asked to show in Exercise 1.5, if $\mu \in [-1,1]^n$, then $f(\mu) = \mathop{\bf E}[f(\boldsymbol{y})]$ where $\boldsymbol{y} \sim \{-1,1\}^n$ is drawn from the product distribution in which $\mathop{\bf E}[\boldsymbol{y}_i] = \mu_i$. On the other hand, there doesn’t seem to be any obvious meaning when real numbers outside the range $[-1,1]$ are substituted into $f$’s Fourier expansion, as may certainly occur when we consider $f(\boldsymbol{g})$.
Nevertheless, \eqref{eqn:cor-simple-inv} says that when $f$ is a low-degree, small-influence function, the distribution of the random variable $f(\boldsymbol{g})$ will be close to that of $f({\boldsymbol{x}})$. Now suppose $f : \{-1,1\}^n \to \{-1,1\}$ is Boolean-valued and unbiased. Then \eqref{eqn:cor-simple-inv} might seem impossible; how could the continuous random variable $f(\boldsymbol{g})$ essentially be $-1$ with probability $1/2$ and $+1$ with probability $1/2$? The solution to this mystery is that there are no low-degree, small-influence, unbiased Boolean-valued functions. This is a consequence of the OSSS Inequality — more precisely, Exercise 40(b) — which shows that in this setting we will always have $\epsilon \geq 1/k^3$ in \eqref{eqn:cor-simple-inv}, rendering the bound very weak. If the Aaronson–Ambainis Conjecture holds (see the notes for Chapter 8), a similar statement is true even for functions with range $[-1,1]$.
The reason \eqref{eqn:cor-simple-inv} is still useful is that we can apply it to small-influence, low-degree functions which are almost $\{-1,1\}$-valued, or $[-1,1]$-valued. Such functions can arise from truncating a very noise-stable Boolean-valued function to a large but constant degree. For example, we might profitably apply \eqref{eqn:cor-simple-inv} to $f = \mathrm{Maj}_n^{\leq k}$ and then deduce some consequences for $\mathrm{Maj}_n({\boldsymbol{x}})$ using the fact that $\mathop{\bf E}[(\mathrm{Maj}_n^{\leq k}({\boldsymbol{x}}) - \mathrm{Maj}_n({\boldsymbol{x}}))^2] = \mathbf{W}^{> k}[\mathrm{Maj}_n] \leq O(1/\sqrt{k})$ (Corollary 5.20). Let’s consider this sort of idea more generally:
Corollary 68 Let $f : \{-1,1\}^n \to {\mathbb R}$ have $\mathop{\bf Var}[f] \leq 1$. Let $k \geq 0$ and suppose $f^{\leq k}$ has $\epsilon$-small influences. Then for any $c$-Lipschitz $\psi : {\mathbb R} \to {\mathbb R}$ we have \begin{equation} \label{eqn:simple-inv-trunc1} \left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^n}[\psi(f({\boldsymbol{x}}))] – \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\psi(f(\boldsymbol{g}))]\right| \leq O(c) \cdot \bigl(2^{k} \epsilon^{1/4} + \|f^{> k}\|_2\bigr). \end{equation} In particular, suppose $h : \{-1,1\}^n \to {\mathbb R}$ has $\mathop{\bf Var}[h] \leq 1$ and no $(\epsilon,\delta)$-notable coordinates (we assume $\epsilon \leq 1$, $\delta \leq \frac{1}{20}$). Then \[ \left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \{-1,1\}^n}[\psi(\mathrm{T}_{1-\delta} h({\boldsymbol{x}}))] – \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)^n}[\psi(\mathrm{T}_{1-\delta} h(\boldsymbol{g}))]\right| \leq O(c) \cdot \epsilon^{\delta/3}. \]
Proof: For the first statement we simply decompose $f = f^{\leq k} + f^{> k}$. Then the left-hand side of \eqref{eqn:simple-inv-trunc1} can be written as \begin{multline*} \left|\mathop{\bf E}[\psi(f^{\leq k}({\boldsymbol{x}}) + f^{> k}({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(f^{\leq k}(\boldsymbol{g}) + f^{> k}(\boldsymbol{g}))]\right| \\ \leq \left|\mathop{\bf E}[\psi(f^{\leq k}({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(f^{\leq k}(\boldsymbol{g}))]\right| + c\mathop{\bf E}[|f^{>k}({\boldsymbol{x}})|] + c\mathop{\bf E}[|f^{>k}(\boldsymbol{g})|], \end{multline*} using the fact that $\psi$ is $c$-Lipschitz. The first quantity is at most $O(c) \cdot 2^{k} \epsilon^{1/4}$, by Corollary 67 (even if $k$ is not an integer). As for the other two quantities, Cauchy–Schwarz implies \[ \mathop{\bf E}[|f^{>k}({\boldsymbol{x}})|] \leq \sqrt{\mathop{\bf E}[f^{>k}({\boldsymbol{x}})^2]} = \sqrt{\sum_{|S| > k} \widehat{f}(S)^2} = \|f^{> k}\|_2, \] and the same bound also holds for $\mathop{\bf E}[|f^{>k}(\boldsymbol{g})|]$; this uses the fact that $\mathop{\bf E}[f^{>k}(\boldsymbol{g})^2] = \sum_{|S| > k} \widehat{f}(S)^2$ just as in Remark 64. This completes the proof of \eqref{eqn:simple-inv-trunc1}.
As for the second statement of the corollary, let $f = \mathrm{T}_{1-\delta} h$. The assumptions on $h$ imply that $\mathop{\bf Var}[f] \leq 1$ and that $f^{\leq k}$ has $\epsilon$-small influences for any $k$; the latter is true because \[ \mathbf{Inf}_i[f^{\leq k}] = \sum_{|S| \leq k, S \ni i} (1-\delta)^{2|S|} \widehat{h}(S)^2 \leq \sum_{S \ni i} (1-\delta)^{|S| – 1} \widehat{h}(S)^2 = \mathbf{Inf}_i^{(1-\delta)}[h] \leq \epsilon \] since $h$ has no $(\epsilon, \delta)$-notable coordinate. Furthermore, \[ \|f^{ > k}\|_2^2 = \sum_{|S| > k} (1-\delta)^{2|S|} \widehat{h}(S)^2 \leq (1-\delta)^{2k} \mathop{\bf Var}[h] \leq (1-\delta)^{2k} \leq \exp(-2k\delta) \] for any $k \geq 1$; i.e., $\|f^{ > k}\|_2 \leq \exp(-k\delta)$. So applying the first part of the corollary gives \begin{equation} \label{eqn:weird-balance} \left|\mathop{\bf E}[\psi(f({\boldsymbol{x}}))] – \mathop{\bf E}[\psi(f(\boldsymbol{g}))]\right| \leq O(c) \cdot \bigl(2^{k} \epsilon^{1/4} + \exp(-k\delta)\bigr) \end{equation} for any $k \geq 0$. Choosing $k = \frac13 \ln(1/\epsilon)$, the right-hand side of \eqref{eqn:weird-balance} becomes \[ O(c) \cdot \left(\epsilon^{-(1/3)\ln 2} \epsilon^{1/4} + \epsilon^{\delta/3}\right) \leq O(c) \cdot \epsilon^{\delta/3}, \] where the inequality uses the assumption $\delta \leq \frac{1}{20}$ (numerically, $\frac14 – \frac13 \ln 2 \approx \frac{1}{53}$). This completes the proof of the second statement of the corollary. $\Box$
Finally, if we think of the Basic Invariance Principle as the nonlinear analogue of our Variant Berry–Esseen Theorem, it’s natural to ask for the nonlinear analogue of the Berry–Esseen Theorem itself, i.e., a statement showing cdf-closeness of $F({\boldsymbol{x}})$ and $F(\boldsymbol{g})$. It’s straightforward to obtain a Lévy distance bound just as in the degree-$1$ case, Corollary 61; in the exercises you are asked to show the following:
Corollary 69 In the setting of Corollary 66 we have the Lévy distance bound $d_{L}(F({\boldsymbol{x}}),F(\boldsymbol{y})) \leq O(2^k\epsilon^{1/5})$. In the setting of Remark 65 we have the bound $d_{L}(F({\boldsymbol{x}}),F(\boldsymbol{y})) \leq (1/\rho)^{O(k)} \epsilon^{1/8}$.
Suppose we now want actual cdf-closeness in the case that $\boldsymbol{y} \sim \mathrm{N}(0,1)^n$. In the degree-$1$ (Berry–Esseen) case we used the fact that degree-$1$ polynomials of independent Gaussians have good anticoncentration. The analogous statement for higher-degree polynomials of Gaussians is not so easy to prove; however, Carbery and Wright [CW01] have obtained the following essentially optimal result:
Carbery–Wright Theorem Let $p : {\mathbb R}^n \to {\mathbb R}$ be a polynomial (not necessarily multilinear) of degree at most $k$, let $\boldsymbol{g} \sim \mathrm{N}(0,1)^n$, and assume $\mathop{\bf E}[p(\boldsymbol{g})^2] = 1$. Then for all $\epsilon > 0$, \[ \mathop{\bf Pr}[|p(\boldsymbol{g})| \leq \epsilon] \leq O(k \epsilon^{1/k}), \] where the $O(\cdot)$ hides a universal constant.
Using this theorem it’s not hard (exercise) to obtain:
]]>Theorem 70 Let $f : \{-1,1\}^n \to {\mathbb R}$ be of degree at most $k$, with $\epsilon$-small influences and $\mathop{\bf Var}[f] = 1$. Then for all $u \in {\mathbb R}$, \[ \left| \mathop{\bf Pr}[f({\boldsymbol{x}}) \leq u] – \mathop{\bf Pr}[f(\boldsymbol{g}) \leq u] \right| \leq O(k) \cdot \epsilon^{1/(4k+1)}, \] where the $O(\cdot)$ hides a universal constant.
Berry–Esseen Theorem Let $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n$ be independent random variables with $\mathop{\bf E}[\boldsymbol{X}_i] = 0$ and $\mathop{\bf Var}[\boldsymbol{X}_i] = \sigma_i^2$, and assume $\sum_{i=1}^n \sigma_i^2 = 1$. Let $\boldsymbol{S} = \sum_{i=1}^n \boldsymbol{X}_i$ and let $\boldsymbol{Z} \sim \mathrm{N}(0,1)$ be a standard Gaussian. Then for all $u \in {\mathbb R}$, \[ |\mathop{\bf Pr}[\boldsymbol{S} \leq u] – \mathop{\bf Pr}[\boldsymbol{Z} \leq u]| \leq c \gamma, \] where \[ \gamma = \sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3 \] and $c$ is a universal constant. (For definiteness, $c = .56$ is acceptable.)
In this traditional statement of Berry–Esseen, the error term $\gamma$ is a little opaque. To say that $\gamma$ is small is to simultaneously say two things: the random variables $\boldsymbol{X}_i$ are all “reasonable” (as in Chapter 9.1); and, none is too dominant in terms of variance. In Chapter 9.1 we discussed several related notions of “reasonableness” for a random variable $\boldsymbol{X}$. It was convenient there to use the definition that $\|\boldsymbol{X}\|_4^4$ is not much larger than $\|\boldsymbol{X}\|_2^4$. For the Berry–Esseen Theorem it’s more convenient (and slightly stronger) to use the analogous condition for the $3$rd moment. (For the Invariance Principle it will be more convenient to use $(2,3,\rho)$- or $(2,4,\rho)$-hypercontractivity.) The implication for Berry–Esseen is the following:
Remark 56 In the Berry–Esseen Theorem, if all of the $\boldsymbol{X}_i$’s are “reasonable” in the sense that $\|\boldsymbol{X}_i\|_3^3 \leq B \|\boldsymbol{X}_i\|_2^3 = B \sigma_i^{3}$, then we can use the bound \begin{equation} \label{eqn:be-no-variance-dom} \gamma \leq B \cdot \max_i\{\sigma_i\}, \end{equation} as this is a consequence of \[ \gamma =\sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3 \leq B \sum_{i=1}^n \sigma_i^{3} \leq B \cdot \max_i \{\sigma_i\} \cdot \sum_{i=1}^n \sigma_i^2 = B \cdot \max_i \{\sigma_i\}. \] (Cf. Remark 5.12.) Note that some “reasonableness” condition must hold if $\boldsymbol{S} = \sum_i \boldsymbol{X}_i$ is to behave like a Gaussian. For example, if each $\boldsymbol{X}_i$ is the “unreasonable” random variable which is $\pm \sqrt{n}$ with probability $\frac{1}{2n^2}$ each and $0$ otherwise, then $\boldsymbol{S} = 0$ except with probability at most $\frac{1}{n}$ — quite unlike a Gaussian. Further, even assuming reasonableness we still need a condition like \eqref{eqn:be-no-variance-dom} ensuring that no $\boldsymbol{X}_i$ is too dominant (“influential”) in terms of variance. For example, if $\boldsymbol{X}_1 \sim \{-1,1\}$ is a uniformly random bit and $\boldsymbol{X}_2, \dots, \boldsymbol{X}_n \equiv 0$, then $\boldsymbol{S} \equiv \boldsymbol{X}_1$, which is again quite unlike a Gaussian.
There are several known ways to prove the Berry–Esseen Theorem; for example, using characteristic functions (i.e., “real” Fourier analysis), or Stein’s Method. We’ll use the “Replacement Method” (also known as the Lindeberg Method, and similar to the “Hybrid Method” in theoretical cryptography). Although it doesn’t always give the sharpest results, it’s a very flexible technique which generalizes easily to higher-degree polynomials of random variables (as in the Invariance Principle) and random vectors. The Replacement Method suggests itself as soon as the Berry–Esseen Theorem is written in a slightly different form: Instead of trying to show \begin{equation} \label{eqn:replacement-no-suggest} \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Z}, \end{equation} where $\boldsymbol{Z} \sim \mathrm{N}(0,1)$, we’ll instead try to show the equivalent statement \begin{equation} \label{eqn:replacement-suggest} \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Z}_1 + \boldsymbol{Z}_2 + \cdots + \boldsymbol{Z}_n, \end{equation} where the $\boldsymbol{Z}_i$’s are independent Gaussians with $\boldsymbol{Z}_i \sim \mathrm{N}(0,\sigma_i^2)$. The statements \eqref{eqn:replacement-no-suggest} and \eqref{eqn:replacement-suggest} really are identical, since the sum of independent Gaussians is Gaussian, with the variances adding. The Replacement Method proves \eqref{eqn:replacement-suggest} by replacing the $\boldsymbol{X}_i$’s with $\boldsymbol{Z}_i$’s one by one. Roughly speaking, we introduce the “hybrid” random variables \[ \boldsymbol{H}_t = \boldsymbol{Z}_1 + \cdots + \boldsymbol{Z}_t + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n, \] show that $\boldsymbol{H}_{t-1} \approx \boldsymbol{H}_t$ for each $t \in [n]$, and then simply add up the $n$ errors.
As a matter of fact, the Replacement Method doesn’t really have anything to do with Gaussian random variables. It actually seeks to show that \[ \boldsymbol{X}_1 + \boldsymbol{X}_2 + \cdots + \boldsymbol{X}_n \approx \boldsymbol{Y}_1 + \boldsymbol{Y}_2 + \cdots + \boldsymbol{Y}_n, \] whenever $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n, \boldsymbol{Y}_1, \dots, \boldsymbol{Y}_n$ are independent random variables with “matching first and second moments”, meaning $\mathop{\bf E}[\boldsymbol{X}_i] = \mathop{\bf E}[\boldsymbol{Y}_i]$ and $\mathop{\bf E}[\boldsymbol{X}_i^2] = \mathop{\bf E}[\boldsymbol{Y}_i^2]$ for each $i \in [n]$. (The error will be proportional to $\sum_i (\|\boldsymbol{X}_i\|^3 + \|\boldsymbol{Y}_i\|_3^3)$.) Another way of putting it (roughly speaking) is that the linear form $x_1 + \cdots + x_n$ is invariant to what independent random variables you substitute in for $x_1, \dots, x_n$, so long as you always use the same first and second moments. The fact that we can take the $\boldsymbol{Y}_i$’s to be Gaussians (with $\boldsymbol{Y}_i \sim \mathrm{N}(\mathop{\bf E}[\boldsymbol{X}_i], \mathop{\bf Var}[\boldsymbol{X}_i])$) and then in the end use the fact that the sum of Gaussians is Gaussians to derive the simpler-looking \[ \boldsymbol{S} = \sum_{i=1}^n \boldsymbol{X}_i \approx \mathrm{N}(\mathop{\bf E}[\boldsymbol{S}], \mathop{\bf Var}[S]) \] is just a pleasant bonus (and one that we’ll no longer get once we look at nonlinear polynomials of random variables in Section 6). Indeed, the remainder of this section will be devoted to showing that \[ \boldsymbol{S}_X = \boldsymbol{X}_1 + \cdots + \boldsymbol{X}_n \quad \text{is ``close'' to} \quad \boldsymbol{S}_Y = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_n \] whenever the $\boldsymbol{X}_i$’s and $\boldsymbol{Y}_i$’s are independent, “reasonable” random variables with matching first and second moments.
To do this, we’ll first have to discuss in more detail what it means for two random variables to be “close”. A traditional measure of closeness between two random variables $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ is the “cdf-distance” used in the Berry–Esseen Theorem: $\mathop{\bf Pr}[\boldsymbol{S}_X \leq u] \approx \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u]$ for every $u \in {\mathbb R}$. But there are other natural measures of closeness too. We might want to know that the absolute moments of $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ are close; for example, that $\|\boldsymbol{S}_X\|_1 \approx \|\boldsymbol{S}_Y\|_1$. Or, we might like to know that $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ stray from the interval $[-1,1]$ by about the same amount: $\mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\mathrm{dist}_{[-1,1]}(\boldsymbol{S}_Y)]$. Here we are using:
Definition 57 For any interval $\emptyset \neq I \subsetneq {\mathbb R}$ the function $\mathrm{dist}_I : {\mathbb R} \to {\mathbb R}^{\geq 0}$ is defined to measure the distance of a point from $I$; i.e., $\mathrm{dist}_I(s) = \inf_{u \in I}\{|s-u|\}$.
All of the closeness measures just described can be put in a common framework: they are requiring $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] \approx \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$ for various “test functions” (or “distinguishers”) $\psi : {\mathbb R} \to {\mathbb R}$.
It would be nice to prove a version of the Berry–Esseen Theorem that showed closeness for all the test functions $\psi$ depicted in Figure 11.1, and more. What class of tests might we able to handle? On one hand, we can’t be too ambitious. For example, suppose each $\boldsymbol{X}_i \sim \{-1,1\}$, each $\boldsymbol{Y}_i \sim \mathrm{N}(0,1)$, and $\psi(s) = 1_{s \in {\mathbb Z}}$. Then $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = 1$ because $\boldsymbol{S}_X$ is supported on the integers, but $\mathop{\bf E}[\psi(\boldsymbol{S}_Y)] = 0$ because $\boldsymbol{S}_Y \sim \mathrm{N}(0,n)$ is a continuous random variable. On the other hand, there are some simple kinds of tests $\psi$ for which we have exact equality. For example, if $\psi(s) = s$, then $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$; this is by the assumption of matching first moments, $\mathop{\bf E}[\boldsymbol{X}_i] = \mathop{\bf E}[\boldsymbol{Y}_i]$ for all $i$. Similarly, if $\psi(s) = s^2$, then \begin{align} \mathop{\bf E}[\psi(\boldsymbol{S}_X)] &= \mathop{\bf E}\Bigl[\bigl(\sum_i \boldsymbol{X}_i\bigr)^2\Bigr] = \sum_{i} \mathop{\bf E}[\boldsymbol{X}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{X}_i\boldsymbol{X}_j] \nonumber\\ &= \sum_{i} \mathop{\bf E}[\boldsymbol{X}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{X}_i]\mathop{\bf E}[\boldsymbol{X}_j] \label{eqn:be-quadratic1}
\end{align}
(using independence of the $\boldsymbol{X}_i$’s); and also,
\begin{align}
\label{eqn:be-quadratic2} \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] &= \sum_{i} \mathop{\bf E}[\boldsymbol{Y}_i^2] + \sum_{i \neq j} \mathop{\bf E}[\boldsymbol{Y}_i]\mathop{\bf E}[\boldsymbol{Y}_j] \end{align} The quantities \eqref{eqn:be-quadratic1} and \eqref{eqn:be-quadratic2} are equal because of the matching first and second moment conditions.
As a consequence of these observations we have $\mathop{\bf E}[\psi(\boldsymbol{S}_X)] = \mathop{\bf E}[\psi(\boldsymbol{S}_Y)]$ for any quadratic polynomial $\psi(s) = a + bs + cs^2$. This suggests that to handle a general test $\psi$ we try to approximate it by a quadratic polynomial up to some error; in other words, consider its $2$nd-order Taylor expansion. For this to make sense the function $\psi$ must have a continuous $3$rd derivative, and the error we incur will involve the magnitude of this derivative. Indeed, we will now prove a variant of the Berry–Esseen Theorem for the class of $\mathcal{C}^3$ test functions $\psi$ with $\psi^{\prime\prime\prime}$ uniformly bounded. You might be concerned that this class doesn’t contain any of the interesting test functions depicted in Figure 11.1. But we’ll be able to handle even those test functions with some loss in the parameters by using a simple “hack” — approximating them by smooth functions, as suggested in Figure 11.2.
Invariance Principle for Sums of Random Variables Let $\boldsymbol{X}_1, \dots, \boldsymbol{X}_n$, $\boldsymbol{Y}_1, \dots, \boldsymbol{Y}_n$ be independent random variables with matching $1$st and $2$nd moments; i.e., $\mathop{\bf E}[\boldsymbol{X}_i^k] = \mathop{\bf E}[\boldsymbol{Y}_i^k]$ for $i \in [n]$, $k \in \{1, 2\}$. Write $\boldsymbol{S}_X = \sum_i \boldsymbol{X}_i$ and $\boldsymbol{S}_Y = \sum_i \boldsymbol{Y}_i$. Then for any $\psi : {\mathbb R} \to {\mathbb R}$ with continuous third derivative, \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq \tfrac16 \|\psi^{\prime\prime\prime}\|_\infty \cdot \gamma_{XY}, \] where $\gamma_{XY} = \sum_i (\|\boldsymbol{X}_i\|_3^3 + \|\boldsymbol{Y}_i\|_3^3)$.
Proof: The proof is by the Replacement Method. For $0 \leq t \leq n$, define the “hybrid” random variable \[ \boldsymbol{H}_t = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_t + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n, \] so $\boldsymbol{S}_X = \boldsymbol{H}_0$ and $\boldsymbol{S}_Y = \boldsymbol{H}_n$. Thus by the triangle inequality, \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq \sum_{t=1}^n \left| \mathop{\bf E}[\psi(\boldsymbol{H}_{t-1})] – \mathop{\bf E}[\psi(\boldsymbol{H}_t)] \right|. \] Given the definition of $\gamma_{XY}$, we can complete the proof by showing that for each $t \in [n]$, \begin{align} \tfrac16 \|\psi^{\prime\prime\prime}\|_\infty \cdot (\mathop{\bf E}[|\boldsymbol{X}_t|^3] + \mathop{\bf E}[|\boldsymbol{Y}_t|^3]) &\geq \left| \mathop{\bf E}[\psi(\boldsymbol{H}_{t-1})] – \mathop{\bf E}[\psi(\boldsymbol{H}_t)] \right| \nonumber\\ &= \left| \mathop{\bf E}[\psi(\boldsymbol{H}_{t-1}) - \psi(\boldsymbol{H}_t)] \right| \nonumber\\ &= \left| \mathop{\bf E}[\psi(\boldsymbol{U}_t + \boldsymbol{X}_t) -\psi(\boldsymbol{U}_t + \boldsymbol{Y}_t)] \right|, \label{eqn:var-be-ineq} \end{align} where \[ \boldsymbol{U}_t = \boldsymbol{Y}_1 + \cdots + \boldsymbol{Y}_{t-1} + \boldsymbol{X}_{t+1} + \cdots + \boldsymbol{X}_n. \] Note that $\boldsymbol{U}_t$ is independent of $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$. We are now comparing $\psi$’s values at $\boldsymbol{U}_t + \boldsymbol{X}_t$ and $\boldsymbol{U}_t + \boldsymbol{Y}_t$, with the presumption that $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$ are rather small compared to $\boldsymbol{U}_t$. This clearly suggests the use of Taylor’s theorem: For all $u, \delta \in {\mathbb R}$, \[ \psi(u + \delta) = \psi(u) + \psi'(u) \delta + \tfrac12 \psi''(u) \delta^2 + \tfrac16 \psi^{\prime\prime\prime}(u^*) \delta^3, \] for some $u^* = u^*(u,\delta)$ between $u$ and $u + \delta$. Applying this pointwise with $u = \boldsymbol{U}_t$, $\delta = \boldsymbol{X}_t, \boldsymbol{Y}_t$ yields \begin{align*} \psi(\boldsymbol{U}_t + \boldsymbol{X}_t) &= \psi(\boldsymbol{U}_t) + \psi’(\boldsymbol{U}_t) \boldsymbol{X}_t + \tfrac12 \psi”(\boldsymbol{U}_t) \boldsymbol{X}_t^2 + \tfrac16 \psi^{\prime\prime\prime}(\boldsymbol{U}_t^*) \boldsymbol{X}_t^3\\ \psi(\boldsymbol{U}_t + \boldsymbol{Y}_t) &= \psi(\boldsymbol{U}_t) + \psi’(\boldsymbol{U}_t) \boldsymbol{Y}_t + \tfrac12 \psi”(\boldsymbol{U}_t) \boldsymbol{Y}_t^2 + \tfrac16 \psi^{\prime\prime\prime}(\boldsymbol{U}_t^{**}) \boldsymbol{Y}_t^3 \end{align*} for some random variables $\boldsymbol{U}_t^*, \boldsymbol{U}_t^{**}$. Referring back to our goal of \eqref{eqn:var-be-ineq}, what happens when we subtract these two identities and take expectations? The $\psi(\boldsymbol{U}_t)$ terms cancel. The next difference is \[ \mathop{\bf E}[\psi'(\boldsymbol{U}_t) (\boldsymbol{X}_t - \boldsymbol{Y}_t)] = \mathop{\bf E}[\psi'(\boldsymbol{U}_t)]\cdot\mathop{\bf E}[\boldsymbol{X}_t - \boldsymbol{Y}_t] = \mathop{\bf E}[\psi'(\boldsymbol{U}_t)]\cdot 0 = 0, \] where the first equality used that $\boldsymbol{U}_t$ is independent of $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$, and the second equality used the matching $1$st moments of $\boldsymbol{X}_t$ and $\boldsymbol{Y}_t$. An identical argument, using matching $2$nd moments, shows that the shows that the difference of the quadratic terms disappears in expectation. Thus we’re left only with the “error term”: \begin{align*} \left| \mathop{\bf E}[\psi(\boldsymbol{U}_t + \boldsymbol{X}_t) -\psi(\boldsymbol{U}_t + \boldsymbol{Y}_t)] \right| &= \tfrac16 \left| \mathop{\bf E}[\psi^{\prime\prime\prime}(\boldsymbol{U}_t^*)\boldsymbol{X}_t^3 -\psi^{\prime\prime\prime}(\boldsymbol{U}_t^{**})\boldsymbol{Y}_t^3] \right| \\ &\leq \tfrac16 \|\psi^{\prime\prime\prime}\|_\infty \cdot (\mathop{\bf E}[|\boldsymbol{X}_t|^3] + \mathop{\bf E}[|\boldsymbol{Y}_t|^3]), \end{align*} where the last step used the triangle inequality. This confirms \eqref{eqn:var-be-ineq} and completes the proof. $\Box$
We can now give a Berry–Esseen-type corollary by taking the $\boldsymbol{Y}_i$’s to be Gaussians:
Variant Berry–Esseen Theorem In the setting of the Berry–Esseen Theorem, for all $\mathcal{C}^3$ functions $\psi : {\mathbb R} \to {\mathbb R}$, \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S})] – \mathop{\bf E}[\psi(\boldsymbol{Z})] \right| \leq \tfrac16(1+2\sqrt{\tfrac{2}{\pi}}) \|\psi^{\prime\prime\prime}\|_\infty \cdot \gamma \leq .433 \|\psi^{\prime\prime\prime}\|_\infty \cdot \gamma. \]
Proof: Applying the preceding theorem with $\boldsymbol{Y}_i \sim \mathrm{N}(0, \sigma_i^2)$ (and hence $\boldsymbol{S}_Y \sim \mathrm{N}(0, 1)$), it suffices to show that \begin{equation} \label{eqn:be-bound} \gamma_{XY} = \sum_{i=1}^n (\|\boldsymbol{X}_i\|_3^3 + \|\boldsymbol{Y}_i\|_3^3) \leq (1+2\sqrt{\tfrac{2}{\pi}}) \cdot \gamma = (1+2\sqrt{\tfrac{2}{\pi}}) \cdot \sum_{i=1}^n \|\boldsymbol{X}_i\|_3^3. \end{equation} In particular, we just need to show that $\|\boldsymbol{Y}_i\|_3^3 \leq 2\sqrt{\tfrac{2}{\pi}}\|\boldsymbol{X}_i\|_3^3$ for each $i$. This holds because Gaussians are extremely reasonable; by explicitly computing $3$rd absolute moments we indeed obtain \[ \|\boldsymbol{Y}_i\|_3^3 = \sigma_i^3 \|\mathrm{N}(0,1)\|_3^3 = 2\sqrt{\tfrac{2}{\pi}} \sigma_i^3 = 2\sqrt{\tfrac{2}{\pi}} \|\boldsymbol{X}_i\|_2^3 \leq 2\sqrt{\tfrac{2}{\pi}} \|\boldsymbol{X}_i\|_3^3. \qquad \Box \]
This version of the Berry–Esseen Theorem is incomparable with the standard version. Sometimes it can be stronger; for example, if for some reason we wanted to show $\mathop{\bf E}[\cos \boldsymbol{S}] \approx \mathop{\bf E}[\cos \boldsymbol{Z}]$ then the Variant Berry–Esseen Theorem gives this with error $.433\gamma$, whereas it can’t be directly deduced from the standard Berry–Esseen at all. On the other hand, as we’ll see shortly, we can only obtain the standard Berry–Esseen conclusion from the Variant version with an error bound of $O(\gamma^{1/4})$ rather than $O(\gamma)$.
We end this section by describing the “hacks” which let us extend the Variant Berry–Esseen Theorem to cover certain non-$\mathcal{C}^3$ tests $\psi$. As mentioned the idea is to smooth them out, or “mollify” them:
Proposition 58 Let $\psi : {\mathbb R} \to {\mathbb R}$ be $c$-Lipschitz. Then for any $\eta > 0$ there exists $\widetilde{\psi}_\eta : {\mathbb R} \to {\mathbb R}$ satisfying $\|\psi – \widetilde{\psi}_\eta\|_\infty \leq c \eta$ and $\|\widetilde{\psi}_\eta^{(k)}\|_\infty \leq C_k c/\eta^{k-1}$ for each $k \in {\mathbb N}^+$. Here $C_k$ is a constant depending only on $k$, and $\widetilde{\psi}_\eta^{(k)}$ denotes the $k$th derivative of $\widetilde{\psi}_\eta$.
The proof is straightforward, taking $\widetilde{\psi}_\eta(s) = \displaystyle \mathop{\bf E}_{\boldsymbol{g} \sim \mathrm{N}(0,1)}[\psi(s+\eta \boldsymbol{g})]$; see the exercises.
As $\eta \to 0$ this gives a better and better smooth approximation to $\psi$, but also a larger and larger value of $\|\widetilde{\psi}_\eta^{\prime\prime\prime}\|_\infty$. Trading these off gives the following:
Corollary 59 In the setting of the Invariance Principle for Sums of Random Variables, if we merely have that $\psi : {\mathbb R} \to {\mathbb R}$ is $c$-Lipschitz, then \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq O(c) \cdot \gamma_{XY}^{1/3}. \]
Proof: Applying the Invariance Principle for Sums of Random Variables with the test $\widetilde{\psi}_\eta$ from Proposition 58 we get \[ \left| \mathop{\bf E}[\widetilde{\psi}_\eta(\boldsymbol{S}_X)] – \mathop{\bf E}[\widetilde{\psi}_\eta(\boldsymbol{S}_Y)] \right| \leq O(c/\eta^2) \cdot \gamma_{XY}. \] But $\|\widetilde{\psi}_\eta – \psi\|_\infty \leq c\eta$ implies \[ \left| \mathop{\bf E}[\widetilde{\psi}_\eta(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_X)] \right| \leq \mathop{\bf E}[|\widetilde{\psi}_\eta(\boldsymbol{S}_X) - \psi(\boldsymbol{S}_X)|] \leq c\eta \] and similarly for $\boldsymbol{S}_Y$. Thus we get \[ \left| \mathop{\bf E}[\psi(\boldsymbol{S}_X)] – \mathop{\bf E}[\psi(\boldsymbol{S}_Y)] \right| \leq O(c) \cdot (\eta + \gamma_{XY}/ \eta^2) \] which yields the desired bound by taking $\eta = \gamma_{XY}^{1/3}$. $\Box$
Remark 60 It’s obvious that the dependence on $c$ in this theorem should be linear in $c$; in fact, since we can always divide $\psi$ by $c$ it would have sufficed to prove the theorem assuming $c = 1$.
This corollary covers all Lipschitz tests, which suffices for the functions $\psi(s) = |s|$ and $\psi(s) = \mathrm{dist}_{[-1,1]}(s)$ from Figure 11.1. However, it still isn’t enough for the test $\psi(s) = 1_{s \leq u}$ — i.e., for establishing cdf-closeness as in the usual Berry–Esseen Theorem. Of course, we can’t hope for a smooth approximator $\widetilde{\psi}_\eta$ satisfying $|\widetilde{\psi}_\eta(s) – 1_{s \leq u}| \leq \eta$ for all $s$ because of the discontinuity at $u$. However, as suggested in Figure 11.2, if we’re willing to exclude $s \in [u - \eta, u+\eta]$ we can get an approximator with third derivative bound $O(1/\eta^3)$, and thereby obtain (exercise):
Corollary 61 In the setting of the Invariance Principle for Sums of Random Variables, for all $u \in {\mathbb R}$ we have \[ \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u - \epsilon] – \epsilon \leq \mathop{\bf Pr}[\boldsymbol{S}_X \leq u] \leq \mathop{\bf Pr}[\boldsymbol{S}_Y \leq u + \epsilon] + \epsilon \] for $\epsilon = O(\gamma_{XY}^{1/4})$; i.e., $\boldsymbol{S}_X$ and $\boldsymbol{S}_Y$ have Lévy distance $d_{L}(\boldsymbol{S}_X,\boldsymbol{S}_Y) \leq O(\gamma_{XY}^{1/4})$.
Finally, in the Berry–Esseen setting where $\boldsymbol{S}_Y \sim \mathrm{N}(0,1)$, we can appeal to the “anticoncentration” of Gaussians: \[ \mathop{\bf Pr}[\mathrm{N}(0,1) \leq u + \epsilon] = \mathop{\bf Pr}[\mathrm{N}(0,1) \leq u] + \mathop{\bf Pr}[u < \mathrm{N}(0,1) \leq u+\epsilon] \leq \mathop{\bf Pr}[\mathrm{N}(0,1) \leq u] + \tfrac{1}{\sqrt{2\pi}} \epsilon, \] and similarly for $\mathop{\bf Pr}[\mathrm{N}(0,1) \leq u - \epsilon]$. This lets us convert the Lévy distance bound into a cdf-distance bound. Recalling \eqref{eqn:be-bound}, we immediately deduce the following weaker version of the classical Berry–Esseen Theorem:
Corollary 62 In the setting of the Berry–Esseen Theorem, for all $u \in {\mathbb R}$, \[ \left| \mathop{\bf Pr}[\boldsymbol{S} \leq u] – \mathop{\bf Pr}[\boldsymbol{Z} \leq u \right| \leq O(\gamma^{1/4}), \] where the $O(\cdot)$ hides a universal constant.
Although the error bound here is weaker than necessary by a power of $1/4$, this weakness will be more than made up for by the ease with which the Replacement Method generalizes to other settings. In the next section we’ll see it applied to nonlinear polynomials of independent random variables. Furthermore, an exercise outlines how to use it to give a Berry–Esseen theorem for sums of independent random vectors; as you’ll see, other than replacing Taylor’s theorem with its multivariate form, hardly a symbol in the proof changes.
]]>The 2014 Simons Symposium, “Discrete Analysis: Beyond the Boolean Cube”, has concluded. Many thanks to the Simons Foundation for sponsoring it, to [...]]]>
The 2014 Simons Symposium, “Discrete Analysis: Beyond the Boolean Cube”, has concluded. Many thanks to the Simons Foundation for sponsoring it, to Meghan Fazzi for administrative support, to super-scribe Li-Yang Tan for the outstanding 5 days of blog posts, and to the participants for making it a great success.
]]>