# §6.4: Applications in learning and testing

In this section we describe some applications of our study of pseudorandomness.

We begin with a notorious open problem from learning theory, that of learning juntas. Let $\mathcal{C} = \{f : {\mathbb F}_2^n \to {\mathbb F}_2 \mid f \text{ is a } k\text{-junta}\}$; we will always assume that $k \leq O(\log n)$. In the query access model, it is quite easy to learn $\mathcal{C}$ exactly (i.e., with error $0$) in $\mathrm{poly}(n)$ time (Exercise 3.36(a)). However in the model of random examples, it’s not obvious how to learn $\mathcal{C}$ more efficiently than in the $n^{k} \cdot \mathrm{poly}(n)$ time required by the Low-Degree Algorithm (see Theorem 3.36). Unfortunately, this is superpolynomial as soon as $k > \omega(1)$. The state of affairs is the same in the case of depth-$k$ decision trees (a superclass of $\mathcal{C}$), and is similar in the case of $\mathrm{poly}(n)$-size DNFs and CNFs. Thus if we wish to learn, say, $\mathrm{poly}(n)$-size decision trees or DNFs from random examples only, a necessary prerequisite is doing the same for $O(\log n)$-juntas.

Whether or not $\omega(1)$-juntas can be learned from random examples in polynomial time is a longstanding open problem. Here we will show a modest improvement on the $n^{k}$-time algorithm:

Theorem 36 For $k \leq O(\log n)$, the class $\mathcal{C} = \{f : {\mathbb F}_2^n \to {\mathbb F}_2 \mid f \text{ is a } k\text{-junta}\}$ can be exactly learned from random examples in time $n^{(3/4)k} \cdot \mathrm{poly}(n)$.

(The $3/4$ in this theorem can in fact be replaced by $\omega/(\omega + 1)$, where $\omega$ is any number such that $n \times n$ matrices can be multiplied in time $O(n^{\omega})$.)

The first observation we will use to prove Theorem 36 is that to learn $k$-juntas, it suffices to be able to identify a single coordinate which is relevant. The proof of this is fairly simple and is left for the exercises:

Lemma 37 Theorem 36 follows from the existence of a learning algorithm which, given random examples from a nonconstant $k$-junta $f : {\mathbb F}_2^n \to {\mathbb F}_2$, finds at least one relevant coordinate for $f$ (with probability at least $1-\delta$) in time $n^{(3/4)k} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$.

Assume then that we have random example access to a (nonconstant) $k$-junta $f : {\mathbb F}_2^n \to {\mathbb F}_2$. As in the Low-Degree Algorithm we will estimate the Fourier coefficients $\widehat{f}(S)$ for all $1 \leq |S| \leq d$, where $d \leq k$ is a parameter to be chosen later. Using Proposition 3.30 we can ensure that all estimates are accurate to within $(1/3)2^{-k}$, except with probability most $\delta/2$, in time $n^d \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$. (Recall that $2^k \leq \mathrm{poly}(n)$.) Since $f$ is a $k$-junta, all of its Fourier coefficients are either $0$ or at least $2^{-k}$ in magnitude; hence we can exactly identify the sets $S$ for which $\widehat{f}(S) \neq 0$. For any such $S$, all of the coordinates $i \in S$ are relevant for $f$. So unless $\widehat{f}(S) = 0$ for all $1 \leq |S| \leq d$, we can find a relevant coordinate for $f$ in time $n^{d} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$ (except with probability at most $\delta/2$).

To complete the proof of Theorem 36 it remains to handle the case that $\widehat{f}(S) = 0$ for all $1 \leq |S| \leq d$; i.e., $f$ is $d$th-order correlation immune. In this case, by Siegenthaler’s Theorem we know that $\deg_{{\mathbb F}_2}(f) \leq k-d$. (Note that $d < k$ since $f$ is not constant.) But there is a learning algorithm running in time in time $O(n)^{3\ell} \cdot \log(1/\delta)$ which exactly learns any ${\mathbb F}_2$-polynomial of degree at most $\ell$ (except with probability at most $\delta/2$). Roughly speaking, the algorithm draws $O(n)^\ell$ random examples and then solves an ${\mathbb F}_2$-linear system to determine the coefficients of the unknown polynomial; see the exercises for details. Thus in time $n^{3(k-d)} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$ this algorithm will exactly determine $f$, and in particular find a relevant coordinate.

By choosing $d = \left\lceil \frac34 k \right\rceil$ we balance the running time of the two algorithms. Regardless of whether $f$ is $d$th-order correlation immune, at least one of the two algorithms will find a relevant coordinate for $f$ (except with probability at most $\delta/2 + \delta/2 = \delta$) in time $n^{(3/4)k} \cdot \mathrm{poly}(n) \cdot \log(1/\delta)$. This completes the proof of Theorem 36.

Our next application of pseudorandomness involves using $\epsilon$-biased distributions to give a deterministic version of the Goldreich–Levin Algorithm (and hence the Kushilevitz–Mansour learning algorithm) for functions $f$ with small $\hat{\lVert} f \hat{\rVert}_1$. We begin with a basic lemma showing that you can get a good estimate for the mean of such functions using an $\epsilon$-biased distribution:

Lemma 38 If $f : \{-1,1\}^n \to {\mathbb R}$ and $\varphi : \{-1,1\}^n \to {\mathbb R}$ is an $\epsilon$-biased density, then $\left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \varphi}[f({\boldsymbol{x}})] – \mathop{\bf E}[f]\right| \leq \hat{\lVert} f \hat{\rVert}_1 \epsilon.$

This lemma follows from Proposition 13(1), but we provide a separate proof:
Proof: By Plancherel, $\mathop{\bf E}_{{\boldsymbol{x}} \sim \varphi}[f({\boldsymbol{x}})] = \langle \varphi, f\rangle = \widehat{f}(\emptyset) + \sum_{S \neq \emptyset} \widehat{\varphi}(S) \widehat{f}(S),$ and the difference of this from $\mathop{\bf E}[f] = \widehat{f}(\emptyset)$ is, in absolute value, at most $\sum_{S \neq \emptyset} |\widehat{\varphi}(S)| \cdot |\widehat{f}(S)| \leq \epsilon \cdot \sum_{S \neq \emptyset} |\widehat{f}(S)| \leq \hat{\lVert} f \hat{\rVert}_1 \epsilon. \qquad \Box$

Since $\hat{\lVert} f^2 \hat{\rVert}_1 \leq \hat{\lVert} f \hat{\rVert}_1^2$ (exercise), we also have the following immediate corollary:

Corollary 39 If $f : \{-1,1\}^n \to {\mathbb R}$ and $\varphi : \{-1,1\}^n \to {\mathbb R}$ is an $\epsilon$-biased density, then $\left|\mathop{\bf E}_{{\boldsymbol{x}} \sim \varphi}[f({\boldsymbol{x}})^2] – \mathop{\bf E}[f^2]\right| \leq \hat{\lVert} f \hat{\rVert}_1^2 \epsilon.$

We can use the first lemma to get a deterministic version of Proposition 3.30, the learning algorithm which estimates a specified Fourier coefficient.

Proposition 40 There is a deterministic algorithm which, given query access to a function $f : \{-1,1\}^n \to {\mathbb R}$ as well as $U \subseteq [n]$, $0 < \epsilon \leq 1/2$, and $s \geq 1$, outputs an estimate $\widetilde{f}(U)$ for $\widehat{f}(U)$ satisfying $|\widetilde{f}(U) - \widehat{f}(U)| \leq \epsilon,$ provided $\hat{\lVert} f \hat{\rVert}_1 \leq s$. The running time is $\mathrm{poly}(n, s, 1/\epsilon)$.

Proof: It suffices to handle the case $U = \emptyset$ because for general $U$, the algorithm can simulate query access to $f \cdot \chi_U$ with $\mathrm{poly}(n)$ overhead, and $\widehat{f \cdot \chi_U}(\emptyset) = \widehat{f}(U)$. The algorithm will use Theorem 30 to construct an $(\epsilon/s)$-biased density $\varphi$ which is uniform over a (multi-)set of cardinality $O(n^2 s^2/\epsilon^2)$. By enumerating over this set and using queries to $f$, it can deterministically output the estimate $\widetilde{f}(\emptyset) = \mathop{\bf E}_{{\boldsymbol{x}} \sim \varphi}[f({\boldsymbol{x}})]$ in time $\mathrm{poly}(n, s, 1/\epsilon)$. The error bound now follows from Lemma 38. $\Box$

The other key ingredient needed for the Goldreich–Levin Algorithm was Proposition 3.40, which let us estimate \begin{equation} \label{eqn:gl-key2} \mathbf{W}^{S\mid\overline{J}}[f] = \sum_{T \subseteq \overline{J}} \widehat{f}(S \cup T)^2 = \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[\widehat{f_{J \mid \boldsymbol{z}}}(S)^2] \end{equation} for any $S \subseteq J \subseteq [n]$. Observe that for any $z \in \{-1,1\}^{\overline{J}}$ we can use Proposition 40 to deterministically estimate $\widehat{f_{J \mid z}}(S)$ to accuracy $\pm \epsilon$. The reason is that we can simulate query access to the restricted function $\widehat{f_{J \mid z}}$, the $(\epsilon/s)$-biased density $\varphi$ remains $(\epsilon/s)$-biased on $\{-1,1\}^{J}$, and most importantly $\hat{\lVert} f_{J \mid z} \hat{\rVert}_1 \leq \hat{\lVert} f \hat{\rVert}_1 \leq s$ by Exercise 3.6. It is not much more difficult to deterministically estimate \eqref{eqn:gl-key2}

Proposition 41 There is a deterministic algorithm which, given query access to a function $f : \{-1,1\}^n \to \{-1,1\}$ as well as $S \subseteq J \subseteq [n]$, $0 < \epsilon \leq 1/2$, and $s \geq 1$, outputs an estimate $\beta$ for $\mathbf{W}^{S\mid\overline{J}}[f]$ which satisfies $|\mathbf{W}^{S\mid\overline{J}}[f] – \beta| \leq \epsilon,$ provided $\hat{\lVert} f \hat{\rVert}_1 \leq s$. The running time is $\mathrm{poly}(n, s, 1/\epsilon)$.

Proof: Recall the notation $\mathrm{F}_{S \mid \overline{J}} f$ from Definition 3.20; by \eqref{eqn:gl-key2}, the algorithm’s task is to estimate $\mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[(\mathrm{F}_{S \mid \overline{J}} f)^2(\boldsymbol{z})]$. If $\varphi : \{-1,1\}^{\overline{J}} \to {\mathbb R}^{\geq 0}$ is an $\tfrac{\epsilon}{4s^2}$-biased density, Corollary 39 tells us that \begin{equation} \label{eqn:gl-det-est} \Bigl| \mathop{\bf E}_{\boldsymbol{z} \sim \varphi}[(\mathrm{F}_{S \mid \overline{J}} f)^2(\boldsymbol{z})] – \mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[(\mathrm{F}_{S \mid \overline{J}} f)^2(\boldsymbol{z})] \Bigr| \leq \hat{\lVert} \mathrm{F}_{S \mid \overline{J}} f \hat{\rVert}_1^2 \cdot \tfrac{\epsilon}{4s^2}\leq \hat{\lVert} f \hat{\rVert}_1^2 \cdot \tfrac{\epsilon}{4s^2} \leq \tfrac{\epsilon}{4}, \end{equation} where the second inequality is immediate from Proposition 3.21. We now show the algorithm can approximately compute $\mathop{\bf E}_{\boldsymbol{z} \sim \varphi}[(\mathrm{F}_{S \mid \overline{J}} f)^2(\boldsymbol{z})]$. For each $z \in \{-1,1\}^{\overline{J}}$, the algorithm can use $\varphi$ to deterministically estimate $(\mathrm{F}_{S \mid \overline{J}} f)(z) = \widehat{f_{J \mid z}}(S)$ to within $\pm s \cdot \tfrac{\epsilon}{4s^2} \leq \tfrac{\epsilon}{4}$ in $\mathrm{poly}(n,s, 1/\epsilon)$ time, just as was described in the text following \eqref{eqn:gl-key2}. Since $|\widehat{f_{J \mid z}}(S)| \leq 1$, the square of this estimate is within, say, $\tfrac{3\epsilon}{4}$ of $(\mathrm{F}_{S \mid \overline{J}} f)^2(z)$. Hence by enumerating over the support of $\varphi$, the algorithm can in deterministic $\mathrm{poly}(n,s, 1/\epsilon)$ time estimate $\mathop{\bf E}_{\boldsymbol{z} \sim \varphi}[(\mathrm{F}_{S \mid \overline{J}} f)^2(\boldsymbol{z})]$ to within $\pm \tfrac{3\epsilon}{4}$, which by \eqref{eqn:gl-det-est} gives an estimate to within $\pm \epsilon$ of the desired quantity $\mathop{\bf E}_{\boldsymbol{z} \sim \{-1,1\}^{\overline{J}}}[(\mathrm{F}_{S \mid \overline{J}} f)^2(\boldsymbol{z})]$. $\Box$

Propositions 40 and 41 are the only two ingredients needed for a derandomization of the Goldreich–Levin Algorithm. We can therefore state a derandomized version of its corollary Theorem 3.38 on learning functions with small Fourier $1$-norm:

Theorem 42 Let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \hat{\lVert} f \hat{\rVert}_1 \leq s\}$. Then $\mathcal{C}$ is deterministically learnable from queries with error $\epsilon$ in time $\mathrm{poly}(n, s, 1/\epsilon)$.

Since any $f : \{-1,1\}^n \to \{-1,1\}$ with $\mathrm{sparsity}(\widehat{f}) \leq s$ also has $\hat{\lVert} f \hat{\rVert}_1 \leq s$, we may also deduce from Exercise 3.36(c):

Theorem 43 Let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \mathrm{sparsity}(\widehat{f}) \leq 2^{O(k)}\}$. Then $\mathcal{C}$ is deterministically learnable exactly ($0$ error) from queries in time $\mathrm{poly}(n, 2^k)$.

Example functions which fall into the concept classes of these theorems are decision trees of size at most $s$, and decision trees of depth at most $k$, respectively.

We conclude this section by discussing a derandomized version of the Blum–Luby–Rubinfeld linearity test from Chapter 1.5:

Derandomized BLR Test Given query access to $f : {\mathbb F}_2^n \to {\mathbb F}_2$:

1. Choose ${\boldsymbol{x}} \sim {\mathbb F}_2^n$ and $\boldsymbol{y} \sim \varphi$, where $\varphi$ is an $\epsilon$-biased density.
2. Query $f$ at ${\boldsymbol{x}}$, $\boldsymbol{y}$, and ${\boldsymbol{x}} + \boldsymbol{y}$.
3. “Accept” if $f({\boldsymbol{x}}) + f(\boldsymbol{y}) = f({\boldsymbol{x}} + \boldsymbol{y})$.

Whereas the original BLR Test required exactly $2n$ independent random bits, the above derandomized version needs only $n + O(\log(n/\epsilon))$. This is very close to minimum possible: a test using only, say, $.99n$ random bits would only be able to inspect a $2^{-.01 n}$ fraction of $f$’s values.

If $f$ is ${\mathbb F}_2$-linear then it is still accepted by the Derandomized BLR Test with probability $1$. As for the approximate converse, we’ll have to make a slight concession: we’ll show that any function accepted with probability close to $1$ must be close to an affine function — i.e., satisfy $\deg_{{\mathbb F}_2}(f) \leq 1$. This concession is necessary: the function $f : {\mathbb F}_2^n \to {\mathbb F}_2$ might be $1$ everywhere except on the (tiny) support of $\varphi$. In that case the acceptance criterion $f({\boldsymbol{x}}) + f(\boldsymbol{y}) = f({\boldsymbol{x}} + \boldsymbol{y})$ will almost always be $1 + 0 = 1$; yet $f$ is very far from every linear function. It is, however, very close to the affine function $1$.

Theorem 44 Suppose the Derandomized BLR Test accepts $f : {\mathbb F}_2^n \to {\mathbb F}_2$ with probability $\tfrac{1}{2} + \tfrac{1}{2} \theta$. Then $f$ has correlation at least $\sqrt{\theta^2 – \epsilon}$ with some affine $g : {\mathbb F}_2^n \to {\mathbb F}_2$; i.e., $\mathrm{dist}(f,g) \leq \tfrac{1}{2} – \tfrac{1}{2} \sqrt{\theta^2 – \epsilon}$.

Remark 45 The bound in this theorem works well both when $\theta$ is close to $0$ and when $\theta$ is close to $1$. E.g., for $\theta = 1-2\delta$ we get that if $f$ is accepted with probability $1-\delta$ then $f$ is nearly $\delta$-close to an affine function, provided $\epsilon \ll \delta$.

Proof: As in the analysis of the BLR Test (Theorem 1.31) we encode $f$’s outputs by $\pm 1 \in {\mathbb R}$. Using the first few lines of that analysis we see that our hypothesis is equivalent to $\theta \leq \mathop{\bf E}_{{\substack{{\boldsymbol{x}} \sim {\mathbb F}_2^n \\ \boldsymbol{y} \sim \varphi}}}[f({\boldsymbol{x}})f(\boldsymbol{y})f({\boldsymbol{x}}+\boldsymbol{y})] = \mathop{\bf E}_{\boldsymbol{y} \sim \varphi}[f(\boldsymbol{y}) \cdot (f * f)(\boldsymbol{y})].$ By Cauchy–Schwarz, $\mathop{\bf E}_{\boldsymbol{y} \sim \varphi}[f(\boldsymbol{y}) \cdot (f * f)(\boldsymbol{y})] \leq \sqrt{\mathop{\bf E}_{\boldsymbol{y} \sim \varphi}[f(\boldsymbol{y})^2]}\sqrt{\mathop{\bf E}_{\boldsymbol{y} \sim \varphi}[(f * f)^2(\boldsymbol{y})]} = \sqrt{\mathop{\bf E}_{\boldsymbol{y} \sim \varphi}[(f * f)^2(\boldsymbol{y})]},$ and hence $\theta^2 \leq \mathop{\bf E}_{\boldsymbol{y} \sim \varphi}[(f * f)^2(\boldsymbol{y})] \leq \mathop{\bf E}[(f * f)^2] + \hat{\lVert} f * f \hat{\rVert}_1 \epsilon = \sum_{\gamma \in {\widehat{{\mathbb F}_2^n}}} \widehat{f}(\gamma)^4 + \epsilon,$ where the inequality is Corollary 39 and we used $\widehat{f * f}(\gamma) = \widehat{f}(\gamma)^2$. The conclusion of the proof is as in the original analysis (cf. Proposition 7, Exercise 1.28): $\theta^2 - \epsilon \leq \sum_{\gamma \in {\widehat{{\mathbb F}_2^n}}} \widehat{f}(\gamma)^4 \leq \max_{\gamma \in {\widehat{{\mathbb F}_2^n}}} \{\widehat{f}(\gamma)^2\} \cdot \sum_{\gamma \in {\widehat{{\mathbb F}_2^n}}} \widehat{f}(\gamma)^2 = \max_{\gamma \in {\widehat{{\mathbb F}_2^n}}} \{\widehat{f}(\gamma)^2\},$ and hence there exists $\gamma^*$ such that $|\widehat{f}(\gamma^*)| \geq \sqrt{\theta^2-\epsilon}$. $\Box$

### 8 comments to §6.4: Applications in learning and testing

• xiwu

Hi Ryan,
I think in the proof of Lemma 38, it shall be \widehat{\varphi}(S),
instead of \varphi(S), the Fourier coefficient of the density function at S
records the bias at S. Do I miss anything?

Best.

• Ryan O'Donnell

Nope, you’re right. Thanks!

• Noam Lifshitz

Minor typo:
There is a missing “]” after “By cauchy schewarz,” in the last proof of this section.

By the way, congratulations for the book!

• Ryan O'Donnell

Thanks! Sharp eyes. Missed by the book’s copyeditor, in fact • Noam Lifshitz

Why is $\mathbf E_{\boldsymbol{y} \sim \varphi}[f(\boldsymbol{y})^2]$ equal to $1$?

• Noam Lifshitz

Sorry it is trivial.

• Ohad Klein

“learning algorithm running in time in time”

• Ryan O'Donnell

Thanks!