We close this chapter by briefly describing a topic which is in some sense the “opposite” of learning theory: *cryptography*. At the highest level, cryptography is concerned with constructing functions which are computationally easy to compute but computationally difficult to invert. Intuitively, think about the task of encrypting secret messages: you would like a scheme where it’s easy to take any message $x$ and produce an encrypted version $e(x)$, but where it’s hard for an adversary to compute $x$ given $e(x)$. Indeed, even with examples $e(x^{(1)}), \dots, e(x^{(m)})$ of several encryptions, it should be hard for an adversary to learn anything about the encrypted messages, or to predict (“forge”) the encryption of future messages.

A basic task in cryptography is building stronger cryptographic functions from weaker ones. Often the first example in “Cryptography $101$” is the *Goldreich–Levin Theorem*, which is used to build a “pseudorandom generator” from a “one-way permutation”. We sketch the meaning of these terms and the analysis of the construction in the exercises; for now, suffice it to say that the key to the analysis of Goldreich and Levin’s construction is a *learning algorithm*. Specifically, the Goldreich–Levin learning algorithm solves the following problem: Given *query* access to a target function $f : {\mathbb F}_2^n \to {\mathbb F}_2$, find all of the linear functions (in the sense of Chapter 1.6) with which $f$ is at least slightly correlated. Equivalently, find all of the noticeably large Fourier coefficients of $f$.

Goldreich–Levin TheoremGiven query access to a target $f : \{-1,1\}^n \to \{-1,1\}$ as well as input $0 < \tau \leq 1$, there is a $\mathrm{poly}(n, 1/\tau)$-time algorithm which with high probability outputs a list $L = \{U_1, \dots, U_\ell\}$ of subsets of $[n]$ such that:

- $|\widehat{f}(U)| \geq \tau\ \ \Rightarrow\ \ U \in L$;
- $U \in L\ \ \Rightarrow\ \ |\widehat{f}(U)| \geq \tau/2$.
(By Parseval’s Theorem, the second guarantee implies that $|L| \leq 4/\tau^2$.)

Although the Goldreich–Levin Theorem was originally developed for cryptography, it was soon put to use for learning theory. Recall that the “meta-algorithm” of Theorem 29 reduces learning an unknown target $f : \{-1,1\}^n \to \{-1,1\}$ to identifying a collection $\mathcal{F}$ of sets on which $f$’s Fourier spectrum is $\epsilon/2$-concentrated. Using the Goldreich–Levin Algorithm, a learner with query access to $f$ can “collect up” its largest Fourier coefficients until only $\epsilon/2$ Fourier weight remains unfound. This strategy straightforwardly yields the following result (see the exercises):

Theorem 37Let $\mathcal{C}$ be a concept class such that every $f : \{-1,1\}^n \to \{-1,1\}$ in $\mathcal{C}$ has its Fourier spectrum $\epsilon/4$-concentrated on a collection of at most $M$ sets. Then $\mathcal{C}$ can be learned using queries with error $\epsilon$ in time $\mathrm{poly}(M, n, 1/\epsilon)$.

The algorithm of Theorem 37 is often called the *Kushilevitz–Mansour Algorithm*. Much like the Low-Degree Algorithm, it reduces the computational problem of learning $\mathcal{C}$ (using queries) to the analytic problem of proving that the functions in $\mathcal{C}$ have concentrated Fourier spectra. The advantage of the Kushilevitz–Mansour Algorithm is that it works so long as the Fourier spectrum of $f$ is concentrated on *some* small collection of sets; the Low-Degree Algorithm requires that the concentration specifically be on the low-degree characters. The disadvantage of the Kushilevitz–Mansour Algorithm is that it requires query access to $f$, rather than just random examples. An example concept class for which the Kushilevitz–Mansour Algorithm works well is the set of all $f$ for which $\hat{\lVert} f \hat{\rVert}_1$ is not too large:

Theorem 38Let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \hat{\lVert} f \hat{\rVert}_1 \leq s\}$. (E.g., $\mathcal{C}$ contains any $f$ computable by a decision tree of size at most $s$.) Then $\mathcal{C}$ is learnable from membership queries with error $\epsilon$ in time $\mathrm{poly}(n, s, 1/\epsilon)$.

This is proved in the exercises.

Let’s now return to the Goldreich–Levin Algorithm itself, which seeks the Fourier coefficients $\widehat{f}(U)$ with magnitude at least $\tau$. Given any candidate $U \subseteq [n]$, Proposition 30 lets us easily distinguish whether the associated coefficient is large, $|\widehat{f}(U)| \geq \tau$, or small, $|\widehat{f}(U)| \leq \tau/2$. The trouble is that there are $2^n$ potential candidates. The Goldreich–Levin Algorithm overcomes this difficulty using a divide-and-conquer strategy which measures the Fourier weight of $f$ on various collections of sets. Let’s make a definition:

Definition 39Let $f : \{-1,1\}^n \to {\mathbb R}$ and $S \subseteq J \subseteq [n]$. We write \[ \mathbf{W}^{S\mid\overline{J}}[f] = \sum_{T \subseteq \overline{J}} \widehat{f}(S \cup T)^2 \] for the Fourier weight of $f$ on sets whose restriction to $J$ is $S$.

The crucial tool for the Goldreich–Levin Algorithm is Corollary 22 which says that \begin{equation} \label{eqn:GL-key} \mathbf{W}^{S\mid\overline{J}}[f] = \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[\widehat{f_{J \mid \boldsymbol{z}}}(S)^2]. \end{equation} This identity lets a learning algorithm with query access to $f$ efficiently estimate any $\mathbf{W}^{S\mid\overline{J}}[f]$ of its choosing. Intuitively, query access to $f$ allows query access to $f_{J \mid z}$ for any $z \in \{-1,1\}^{\overline{J}}$; with this one can estimate any $\widehat{f_{J \mid z}}(S)$ and hence \eqref{eqn:GL-key}. More precisely:

Proposition 40For any $S \subseteq J \subseteq [n]$ an algorithm with query access to $f : \{-1,1\}^n \to \{-1,1\}$ can compute an estimate of $\mathbf{W}^{S\mid\overline{J}}[f]$ which is accurate to within $\pm \epsilon$ (except with probability at most $\delta$) in time $\mathrm{poly}(n, 1/\epsilon) \cdot \log(1/\delta)$.

*Proof:* From \eqref{eqn:GL-key}, \begin{align*} \mathbf{W}^{S\mid\overline{J}}[f] = \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[\widehat{f_{J \mid \boldsymbol{z}}}(S)^2] &= \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}\left[\mathop{\bf E}_{\boldsymbol{y} \sim \{-1,1\}^J\vphantom{\{-1,1\}^{\overline{J}}}}[f(\boldsymbol{y},\boldsymbol{z})\chi_S(\boldsymbol{y})]^2\right] \\ &= \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}\ \mathop{\bf E}_{\boldsymbol{y}, \boldsymbol{y}’ \sim \{-1,1\}^J\vphantom{\{-1,1\}^{\overline{J}}}}[f(\boldsymbol{y}, \boldsymbol{z})\chi_S(\boldsymbol{y}) \cdot f(\boldsymbol{y}', \boldsymbol{z}) \chi_S(\boldsymbol{y}')], \end{align*} where $\boldsymbol{y}$ and $\boldsymbol{y}’$ are independent. As in Proposition 30, $f(\boldsymbol{y}, \boldsymbol{z})\chi_S(\boldsymbol{y}) \cdot f(\boldsymbol{y}’, \boldsymbol{z}) \chi_S(\boldsymbol{y}’)$ is a $\pm 1$-valued random variable which the algorithm can sample from using queries to $f$. A Chernoff bound implies that $O(\log(1/\delta)/\epsilon^2)$ samples are sufficient to estimate its mean with accuracy $\epsilon$ and confidence $1-\delta$. $\Box$

We’re now ready to prove the Goldreich–Levin Theorem.

*Proof:* We begin with an overview of how the algorithm works. Initially, all $2^n$ sets $U$ are (implicitly) put in a single “bucket”. The algorithm then repeats the following loop:

- Select any bucket $\mathcal{B}$ containing $2^m$ sets, $m \geq 1$.
- Split it into two buckets $\mathcal{B}_1$, $\mathcal{B}_2$ of $2^{m-1}$ sets each.
- “Weigh” each $\mathcal{B}_i$, $i = 1, 2$; i.e., estimate $\sum_{U \in \mathcal{B}_i} \widehat{f}(U)^2$.
- Discard $\mathcal{B}_1$ or $\mathcal{B}_2$ if its weight estimate is at most $\tau^2/2$.

The algorithm stops once all buckets contain just $1$ set; it then outputs the list of these sets.

We now fill in the details. First we argue the correctness of the algorithm, assuming all weight estimates are accurate (this assumption is removed later). On one hand, any set $U$ with $|\widehat{f}(U)| \geq \tau$ will never be discarded, since it always contributes weight at least $\tau^2 \geq \tau^2/2$ to the bucket it’s in. On the other hand, no set $U$ with $|\widehat{f}(U)| \leq \tau/2$ can end up in a singleton bucket because such a bucket, when created, would have weight only $\tau^2/4 \leq \tau^2/2$ and thus be discarded. Notice that this correctness proof does not rely on the weight estimates being exact; it suffices for them to be accurate to within $\pm \tau^2/4$.

The next detail concerns running time. Note that any “active” (undiscarded) bucket has weight at least $\tau^2/4$, even assuming the weight estimates are only accurate to within $\pm \tau^2/4$. Therefore Parseval tells us there can only ever be at most $4/\tau^2$ active buckets. Since a bucket can be split only $n$ times, it follows that the algorithm repeats its main loop at most $4n/\tau^2$ times. Thus as long as the buckets can be maintained and accurately weighed in $\mathrm{poly}(n, 1/\tau)$ time, the overall running time will be $\mathrm{poly}(n, 1/\tau)$ as claimed.

Finally, we describe the bucketing system. The buckets are indexed (and thus maintained implicitly) by an integer $0 \leq k \leq n$ and a subset $S \subseteq [k]$. The bucket $\mathcal{B}_{k,S}$ is defined by \[ \mathcal{B}_{k,S} = \Bigl\{S \cup T : T \subseteq \{k+1, k+2, \dots, n\}\Bigr\}. \] Note that $|\mathcal{B}_{k,S}| = 2^{n-k}$. The initial bucket is $\mathcal{B}_{0, \emptyset}$. The algorithm always splits a bucket $\mathcal{B}_{k,S}$ into the two buckets $\mathcal{B}_{k+1,S}$ and $\mathcal{B}_{k+1,S \cup \{k+1\}}$. The final singleton buckets are of the form $\mathcal{B}_{n,S} = \{S\}$. Finally, the weight of bucket $\mathcal{B}_{k,S}$ is precisely $\mathbf{W}^{S\mid\{k+1, \dots, n\}}[f]$. Thus it can be estimated to accuracy $\pm \tau^2/4$ with confidence $1 – \delta$ in time $\mathrm{poly}(n, 1/\tau)\cdot \log(1/\delta)$ using Proposition 40. Since the main loop is executed at most $4n/\tau^2$ times, the algorithm overall needs to make at most $8n/\tau^2$ weighings; by setting $\delta = \tau^2/(80n)$ we ensure that *all* weighings are accurate with high probability (at least $9/10$). The overall running time is therefore indeed $\mathrm{poly}(n, 1/\tau)$. $\Box$

Hi Ryan

this is a general comment. While I’ve been following the posts, I’ve begun to fall behind – I wanted to take some of the chapters to read offline. In case you happen to be generating the sections from latex, would it be possible to also provide a combo-PDF with all sections thus far that one could download and read ?

I could of course print the pages, but that would be section by section

Hi Suresh — good question.

Let me think through the ramifications of releasing partial pdfs — I don’t want to have lots of unfinished copies floating around.

It would at least be nice if there was one URL which showed all the posts so far, expanded (though it would probably take ages to render). I don’t immediately see how to do that with WordPress/Atahualpa, but I’ll look into it.

Would anyone find such a page, or a “PDF-so-far” useful?

- Ryan

PS: Suresh, for you, I emailed you a PDF of the posts so far…

I would find a PDF-so-far very useful.

I would find it very useful as well, to be able to read these notes on my kindle. I was about to suggest it.

Hmm, okay… I think I can periodically put out pdf. I’m putting it on my to-do list.

Yes, pdf-so-far will be great.

Hi Ryan,

Very minor, but the quotation of Corollary 22 here seems to be incomplete.

(Also, I’m not sure if relevant but there seems to be a mistake in the corresponding lecture notes, http://www.cs.cmu.edu/~odonnell/boolean-analysis/lecture7.pdf ; in the bottom of page 4 and top of page 5, |L| should be \leq, and not as stated).

Roei

Hi Roei, thanks for writing. Not quite sure what the issue is here — is it that Corollary 22 contains more info than what is stated in (1)? If so I guess that’s okay with me, but I wanted to check.

(You may be right about the lecture notes too; I never attempted to proofread those

No no, only a technical thing. The equation quoted from Corollary 22 appears incomplete, but apparently only in some setups (Chrome). Never mind then…

It seems to me that there is a typo: “\mathcal{B}_{k,S} into the two buckets \mathcal{B}_{k+1,S} and \mathcal{B}_{k+1,S\cup \{k\}}”.

\mathcal{B}_{k+1,S\cup \{k\}} should be replaced by \mathcal{B}_{k+1,S\cup \{k+1\}}.

You’re correct — thank you!

I’m afraid I don’t see why Chernoff’s inequality can be used in the proof of Theorem 40: Chernoff requires that the r.v. be independent, but the collection $f(y,z)\chi_S(y)f(y’,z)\chi_S(y’)$ indexed by (y,y’,z) is not independent, because different triples can share a factor f(y,z).

If $X$ and $Y$ are independent, then for any function $\varphi$ the random variables $\varphi(X)$, $\varphi(Y)$ are also independent — whether they can take the same value or not does not change anything. ($\varphi$ could even be a constant function)

Here, “$X$” and “$Y$” are i.i.d. tuples of the form $(y,y^\prime,z)$, and $\varphi\colon(a,b,c)\mapsto f(a,c)\chi_S(a)f(b,c)\chi_S(b)\in\{-1,1\}$.

Yes, in other words, I am just using the fact that if you have the ability to get as many independent samples from some $\pm 1$-valued random variable as you like (even a weirdly complicated one like $f(x,z)\chi_S(y)f(y’,z)\chi_S(y’)$ or whatever, which involves making some correlated queries to $f$), then you can estimate its mean to an additive error of $\pm \epsilon$ (with probability $1-\delta$) just by taking $O(\log(1/\delta)/\epsilon^2)$ independent samples and averaging them.