# Chapter 3 exercises

1. Let $M : {\mathbb F}_2^n \to {\mathbb F}_2^n$ be an invertible linear transformation. Given $f : {\mathbb F}_2^n \to {\mathbb R}$, let $f \circ M : {\mathbb F}_2^n \to {\mathbb R}$ be defined by $f \circ M(x) = f(Mx)$. Show that $\widehat{f \circ M}(\gamma) = \widehat{f}(M^{-\top}\gamma)$. What if $M$ is an invertible affine transformation? What if $M$ is not invertible?
2. Show that $\frac{2}{1-e^{-2}}$ is smallest constant (not depending on $\delta$ or $n$) that can be taken in Proposition 3.
3. Generalize Proposition 3 by showing that any $f : \{-1,1\}^n \to {\mathbb R}$ is $\epsilon$-concentrated on degree up to $1/\delta$ for $\epsilon = (\mathop{\bf E}[f^2] – \mathbf{Stab}_{1-\delta}[f])/(1-1/e)$.
4. Prove Lemma 5 by induction on $n$. (Hint: if one of the subfunctions $f(x_1, \dots, x_n, \pm 1)$ is identically $0$, show that the other has degree at most $k-1$.)
5. Verify for all $p \in [1, \infty]$ that $\hat{\lVert} \cdot \hat{\rVert}_p$ is a norm on the vector space of functions $f : {\mathbb F}_2^n \to {\mathbb R}$.
6. Let $f : {\mathbb F}_2^n \to {\mathbb R}$ and let $H \leq {\mathbb F}_2^n$, $z \in H^{\perp}$.

1. Show that restriction reduces spectral $1$-norm: $\hat{\lVert} f_{H \mid z} \hat{\rVert}_1 \leq \hat{\lVert} f \hat{\rVert}_1$.
2. Show that it also reduces Fourier sparsity: $\mathrm{sparsity}(\widehat{f_{H \mid z}}) \leq \mathrm{sparsity}(\widehat{f})$.
7. Let $f : \{-1,1\}^n \to {\mathbb R}$ and let $0 < p \leq q \leq \infty$. Show that $\hat{\lVert} f \hat{\rVert}_p \geq \hat{\lVert} f \hat{\rVert}_q$. (Cf. Exercise 1.14.)
8. Let $f : \{-1,1\}^n \to {\mathbb R}$. Show that $\hat{\lVert} f \hat{\rVert}_\infty \leq \|f\|_1$ and $\|f\|_\infty \leq \hat{\lVert} f \hat{\rVert}_1$. (These are easy special cases of the Hausdorff–Young inequality.)
9. Suppose $f : \{-1,1\}^n \to \{-1,1\}$ is monotone. Show that $|\widehat{f}(S)| \leq \widehat{f}(i)$ whenever $i \in S \subseteq [n]$. Deduce that $\hat{\lVert} f \hat{\rVert}_\infty = \max_{S} \{|\widehat{f}(S)|\}$ is achieved by an $S$ of cardinality $0$ or $1$. (Hint: apply the previous exercise to $f$’s derivatives.)
10. Prove Proposition 12.
11. Verify Parseval’s Theorem for the Fourier expansion of subspaces given in Proposition 11.
12. Let $f : {\mathbb F}_2^n \to \{0,1\}$ be the indicator of $A \subseteq {\mathbb F}_2^n$. We know that $\hat{\lVert} f \hat{\rVert}_1 = 1$ if $A$ is an affine subspace. So assume that $A$ is not an affine subspace.

1. Show that there exists an affine subspace $B$ of dimension $2$ on which $f$ takes the value $1$ exactly $3$ times.
2. Let $b$ be the point in $B$ where $f$ is $0$ and let $\psi = \varphi_B – (1/2) \varphi_{b}$. Show that $\hat{\lVert} \psi \hat{\rVert}_\infty = 1/2$.
3. Show that $\langle \psi, f \rangle = 3/4$ and deduce $\hat{\lVert} f \hat{\rVert}_1 \geq 3/2$.
13. Suppose $f : \{-1,1\}^n \to {\mathbb R}$ satisfies $\mathop{\bf E}[f^2] \leq 1$. Show that $\hat{\lVert} f \hat{\rVert}_1 \leq 2^{n/2}$, and show that for any even $n$ the upper bound can be achieved by a function $f : \{-1,1\}^n \to \{-1,1\}$.
14. Given $f : {\mathbb F}_2^n \to {\mathbb R}$, define its (fractional) sparsity to be $\mathrm{sparsity}(f) = |\mathrm{supp}(f)|/2^n = \mathop{\bf Pr}_{{\boldsymbol{x}} \in {\mathbb F}_2^n}[f({\boldsymbol{x}}) \neq 0]$. In this exercise you will prove the uncertainty principle: if $f$ is nonzero then $\mathrm{sparsity}(f) \cdot \mathrm{sparsity}(\widehat{f}) \geq 1$.

1. Show we may assume $\|f\|_1 = 1$.
2. Suppose $\mathcal{F} = \{ \gamma : \widehat{f}(\gamma) \neq 0\}$. Show that $\hat{\lVert} f \hat{\rVert}^2_2 \leq |\mathcal{F}|$.
3. Suppose $\mathcal{G} = \{ x : f(x) \neq 0\}$. Show that $\|f\|_2^2 \geq 2^n/|\mathcal{G}|$, and deduce the uncertainty principle.
4. Identify all cases of equality.
15. Let $f : \{-1,1\}^n \to {\mathbb R}$ and let $\epsilon > 0$. Show that $f$ is $\epsilon$-concentrated on a collection $\mathcal{F} \subseteq 2^{[n]}$ with $|\mathcal{F}| \leq \hat{\lVert} f \hat{\rVert}_1^2/\epsilon$.
16. Suppose the Fourier spectrum of $f : \{-1,1\}^n \to {\mathbb R}$ is $\epsilon_1$-concentrated on $\mathcal{F}$ and that $g : \{-1,1\}^n \to {\mathbb R}$ satisfies $\|f – g\|_2^2 \leq \epsilon_2$. Show that the Fourier spectrum of $g$ is $2(\epsilon_1+\epsilon_2)$-concentrated on $\mathcal{F}$.
17. Show that every function $f : {\mathbb F}_2^n \to {\mathbb R}$ is computed by a decision tree with depth at most $n$ and size at most $2^n$.
18. Let $f : {\mathbb F}_2^n \to {\mathbb R}$ be computable by a decision tree of size $s$ and depth $k$ Show that $-f$ and the boolean dual $f^\dagger$ are also computable by decision trees of size $s$ and depth $k$.
19. For each function in Exercise 1.1 with $4$ or fewer inputs, give a decision tree computing it. Try primarily to use the least possible depth, and secondarily to use the least possible size.
20. Prove Proposition 16.
21. Let $T$ be a decision tree over $n$ coordinates. We define choosing a random string ${\boldsymbol{x}} \in {\mathbb F}_2^n$ according to $T$ as follows. First we choose a random path $\boldsymbol{P}$ in $T$, meaning that we start at the root and at each internal node we (independently) proceed along one of the two outgoing edges with equal probability. Write $\boldsymbol{J}$ for the coordinates $\boldsymbol{P}$ encounters in its internal nodes and $\overline{\boldsymbol{J}} = [n] \setminus \boldsymbol{J}$. The path $\boldsymbol{P}$ also naturally defines a partial string $\boldsymbol{y} \in {\mathbb F}_2^{\boldsymbol{J}}$, where $\boldsymbol{y}_i = b$ if $\boldsymbol{P}$ proceeded along the $b$-edge from the internal node labelled $i$. Finally, we choose $\boldsymbol{z} \sim {\mathbb F}_2^{\overline{\boldsymbol{J}}}$ uniformly and form the composite string ${\boldsymbol{x}} = (\boldsymbol{y}, \boldsymbol{z}) \in {\mathbb F}_2^n$. Show that ${\boldsymbol{x}}$ has the uniform probability distribution on ${\mathbb F}_2^n$, no matter what $T$ is.
22. Let $f : {\mathbb F}_2^n \to \{-1,1\}$ be computed by a decision tree $T$ of size $s$ and let $\epsilon \in (0,1]$. Suppose each path in $T$ is truncated (if necessary) so that its length does not exceed $\log(s/\epsilon)$; new leaves with labels $-1$ and $1$ may be created in an arbitrary way as necessary. Show that the resulting decisions tree $T’$ computes a function which is $\epsilon$-close to $f$. Deduce Proposition 17.
23. A decision list is a decision tree in which every internal node has an outgoing edge to at least one leaf. Show that any function computable by a decision list is a linear threshold function.
24. A read-once decision tree is one in which every internal node queries a distinct variable. Bearing this in mind, show that the bound $k2^{k-1}$ in Theorem 4 cannot be reduced below $2^k – 1$.
25. Suppose that $f$ is computed by a read-once decision tree in which every root-to-leaf path has length $k$ and every internal node at the deepest level has one child (leaf) labelled $-1$ one one child labelled $1$. Compute the influence of each coordinate on $f$, and compute $\mathbf{I}[f]$.
26. The following are generalizations of decision trees:

Subcube partition: This is defined by a collection $C_1, \dots, C_s$ of subcubes which form a partition of ${\mathbb F}_2^n$, along with values $b_1, \dots, b_s \in {\mathbb R}$. It computes the function $f : {\mathbb F}_2^n \to {\mathbb R}$ which has value $b_i$ on all inputs in $C_i$. The subcube partition’s size is $s$ and its “codimension” $k$ (analogous to depth) is the maximum codimension of the cubes $C_i$.

Parity decision tree: This is similar to a decision tree except that the internal nodes are labelled by vectors $\gamma \in {\mathbb F}_2^n$. At such a node the computation path on input $x$ follows the edge labelled $\gamma \cdot x$. We insist that for each root-to-leaf path, the vectors appearing in its internal nodes are linearly independent. Size $s$ and depth $k$ are defined as with normal decision trees.

Affine subspace partition: This is similar to a subcube partition except the subcubes may be $C_i$ may be arbitrary affine subspaces.

1. Show that subcube partition size/codimension and parity decision tree size/depth generalize normal decision tree size/depth, and are generalized by affine subspace partition size/codimension.
2. Show that Proposition 16 holds also for the generalizations, except that the statement about degree need not hold for parity decision trees and affine subspace partitions.
27. Define $\mathrm{Equ}_3 : \{-1,1\}^3 \to \{-1,1\}$ by $\mathrm{Equ}_3(x) = -1$ if and only if $x_1 = x_2 = x_3$.
1. Show that $\deg(\mathrm{Equ}_3) = 2$.
2. Show that ${\mathrm{DT}}(\mathrm{Equ}_3) = 3$.
3. Show that $\mathrm{Equ}_3$ is computable by a parity decision tree of codimension $2$.
4. For $d \in {\mathbb N}$, define $f \{-1,1\}^{3^d} \to \{-1,1\}$ by $f = \mathrm{Equ}_3^{\otimes d}$ (using the notation from Definition 2.6). Show that $\deg(f) = 2^d$ but ${\mathrm{DT}}(f) = 3^d$.
28. Let $f : \{-1,1\}^n \to {\mathbb R}$ and $J \subseteq [n]$. Define $f^{\subseteq J} : \{-1,1\}^n \to {\mathbb R}$ by $f(x) = \mathop{\bf E}_{\boldsymbol{y} \sim \{-1,1\}^{\overline{J}}}[f(x_J, \boldsymbol{y})]$, where $x_J \in \{-1,1\}^J$ is the projection of $x$ to coordinates $J$. Compute the Fourier expansion of $f^{\subseteq J}$.
29. Suppose $f : \{-1,1\}^n \to {\mathbb R}$ is computable by a decision tree which has a leaf at depth $k$ labelled $b$. Show that $\hat{\lVert} f \hat{\rVert}_\infty \geq b/2^k$. (Hint: you may find the previous exercise helpful.)
30. Prove Fact 25 by using Theorem 1.28 and Exercise 1.1(d).
1. Suppose $f : {\mathbb F}_2^n \to {\mathbb R}$ has $\mathrm{sparsity}(\widehat{f}) < 2^n$. Show that for any $\gamma \in \mathrm{supp}(\widehat{f})$ there exists nonzero $\beta \in {\widehat{{\mathbb F}_2^n}}$ such that $f_{\beta^\perp}$ has $\widehat{f}(\gamma)$ as a Fourier coefficient.
2. Prove by induction on $n$ that if $f : {\mathbb F}_2^n \to \{-1,1\}$ has $\mathrm{sparsity}(\widehat{f}) = s > 1$ then $\widehat{f}$ is $2^{1-\lfloor \log s \rfloor}$-granular. (Hint: Distinguish the cases $s = 2^n$ and $s < 2^n$. In the latter case use part (a).)
3. Prove that there are no functions $f : \{-1,1\}^n \to \{-1,1\}$ with $\mathrm{sparsity}(\widehat{f}) \in \{2, 3, 5, 6, 7, 9\}$.
31. Show that one can learn any target $f : \{-1,1\}^n \to \{-1,1\}$ with error $0$ from random examples only in time $\widetilde{O}(2^n)$.
32. Improve Proposition 31 as follows. Suppose $f : \{-1,1\}^n \to \{-1,1\}$ and $g : \{-1,1\}^n \to {\mathbb R}$ satisfy $\|f – g\|_1 \leq \epsilon$. Pick $\boldsymbol{\theta} \in [-1,1]$ uniformly at random and define $\boldsymbol{h} : \{-1,1\}^n \to \{-1,1\}$ by $\boldsymbol{h}(x) = \mathrm{sgn}(g(x) – \boldsymbol{\theta})$. Show that $\mathop{\bf E}[\mathrm{dist}(f,\boldsymbol{h})] \leq \epsilon/2$.
1. For $n$ even, find a function $f : \{-1,1\}^n \to \{-1,1\}$ which is not $1/2$-concentrated on any $\mathcal{F} \subseteq 2^{[n]}$ with $|\mathcal{F}| < 2^{n-1}$. (Hint: Exercise 1.1.)
2. Let $\boldsymbol{f} : \{-1,1\}^n \to \{-1,1\}$ be a random function as in Exercise 1.7. Show that with probability at least $1/2$, $\bf$ is not $1/4$-concentrated on degree up to $\lfloor n/2 \rfloor$.
33. Prove Theorem 36. (Hint: in light of Exercise 1.12 you may round off certain estimates with confidence.)
34. Show that each of the following classes $\mathcal{C}$ (ordered by inclusion) can be learned exactly (i.e., with error $0$) using membership queries in time $\mathrm{poly}(n, 2^k)$:
1. $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid f$ is a $k$-junta$\}$. (Hint: estimate influences.)
2. $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid {\mathrm{DT}}(f) \leq k\}$.
3. $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \mathrm{sparsity}(\widehat{f}) \leq 2^{O(k)}\}$. (Hint: Exercise 31.)
35. Prove Theorem 38. (Hint: Exercise 15.)
36. Deduce Theorem 37 from the Goldreich–Levin algorithm.
37. Suppose $A$ learns $\mathcal{C}$ from random examples with error $\epsilon/2$ in time $T$ — with probability at least $9/10$.

1. After producing hypothesis $h$ on target $f : \{-1,1\}^n \to \{-1,1\}$, show that $A$ can “check” whether $h$ is a good hypothesis in time $\mathrm{poly}(n, T, 1/\epsilon) \cdot \log(1/\delta)$. Specifically, except with probability at most $\delta$, $A$ should output YES’ if $\mathrm{dist}(f, h) \leq \epsilon/2$ and NO’ if $\mathrm{dist}(f, h) > \epsilon$. (Hint: time $\mathrm{poly}(T)$ may be required for $A$ to evaluate $h(x)$.)
2. Show that for any $\delta \in (0, 1/2]$, there is a learning algorithm which learns $\mathcal{C}$ with error $\epsilon$ in time $\mathrm{poly}(n, T, \epsilon)\cdot \log(1/\delta)$ — with probability at least $1-\delta$.
1. Our description of the Low-Degree Algorithm with degree $k$ and error $\epsilon$ involved using a new batch of random examples to estimate each low-degree Fourier coefficient. Show that one can instead simply draw a single batch $\mathcal{E}$ of $\mathrm{poly}(n^k, 1/\epsilon)$ examples and use $\mathcal{E}$ to estimate each of the low-degree coefficients.
2. Show that when using the above form of the Low-Degree Algorithm, the final hypothesis $h : \{-1,1\}^n \to \{-1,1\}$ is of the form $h(y) = \mathrm{sgn}\left(\sum_{(x, f(x)) \in \mathcal{E}} w(\mathrm{Dist}(y,x)) \cdot f(x) \right),$ for some function $w : \{0, 1, \dots, n\} \to {\mathbb R}$. In other words, the hypothesis on a given $y$ is equal to a weighted vote over all examples seen, where an example’s weight depends only on its Hamming distance to $y$. Simplify your expression for $w$ as much as you can.
38. Extend the Goldreich–Levin algorithm so that it works also for functions $f : \{-1,1\}^n \to [-1,1]$. (The learning model for targets $f : \{-1,1\}^n \to [-1,1]$ assumes that $f(x)$ is always a rational number expressible by $\mathrm{poly}(n)$ bits.)
1. Assume $\gamma, \gamma’ \in {\widehat{{\mathbb F}_2^n}}$ are distinct. Show that $\mathop{\bf Pr}_{{\boldsymbol{x}}}[\gamma \cdot {\boldsymbol{x}} = \gamma' \cdot {\boldsymbol{x}}] = 1/2$.
2. Fix $\gamma \in {\widehat{{\mathbb F}_2^n}}$ and suppose ${\boldsymbol{x}}^{(1)}, \dots, {\boldsymbol{x}}^{(m)} \sim {\mathbb F}_2^n$ are drawn uniformly and independently. Show that if $m = Cn$ for $C$ a sufficiently large constant then with high probability, the only $\gamma’ \in {\widehat{{\mathbb F}_2^n}}$ satisfying $\gamma’ \cdot {\boldsymbol{x}}^{(i)} = \gamma \cdot {\boldsymbol{x}}^{(i)}$ for all $i \in [m]$ is $\gamma’ = \gamma$.
3. Essentially improve on Exercise 1.26 by showing that the concept class of all linear functions ${\mathbb F}_2^n \to {\mathbb F}_2$ can be learned from random examples only, with error $0$, in time $\mathrm{poly}(n)$. (Remark: if $\omega \in {\mathbb R}$ is such that $n \times n$ matrix multiplication can be done in $O(n^\omega)$ time, then the learning algorithm also requires only $O(n^\omega)$ time.)
39. Let $\tau \geq 1/2 + \epsilon$ for some constant $\epsilon > 0$. Give an algorithm simpler than Goldreich and Levin’s which solves the following problem with high probability: Given query access to $f : \{-1,1\}^n \to \{-1,1\}$, in time $\mathrm{poly}(n, 1/\epsilon)$ find the unique $U \subseteq [n]$ such that $|\widehat{f}(U)| \geq \tau$, assuming it exists. (Hint: use Proposition 1.32 and Exercise 1.26.)
40. Informally: a “one-way permutation” is a bijective function $f : {\mathbb F}_2^n \to {\mathbb F}_2^n$ which is easy to compute but hard to invert; a “pseudorandom generator” is a function $g : {\mathbb F}_2^k \to {\mathbb F}_2^m$ for $m > k$ whose output on a random input “looks unpredictable” to any efficient algorithm. Goldreich and Levin proposed the following construction of the latter from the former: for $k = 2n$, $m = 2n+1$, define $g(r, s) = (r, f(s), r \cdot s),$ where $r, s \in {\mathbb F}_2^n$. When $g$’s input $(\boldsymbol{r}, \boldsymbol{s})$ is uniformly random then so is the first $2n$ bits of its output (using the fact that $f$ is a bijection). The key to the analysis is showing that the final bit, $\boldsymbol{r} \cdot \boldsymbol{s}$, is highly unpredictable to efficient algorithms even given the first $2n$ bits $(\boldsymbol{r}, f(\boldsymbol{s}))$. This is proved by contradiction.

1. Suppose that an adversary has a deterministic, efficient algorithm $A$ good at predicting the bit $\boldsymbol{r} \cdot \boldsymbol{s}$: $\mathop{\bf Pr}_{\boldsymbol{r}, \boldsymbol{s} \sim {\mathbb F}_2^n}[A(\boldsymbol{r}, f(\boldsymbol{s})) = \boldsymbol{r} \cdot \boldsymbol{s}] \geq \frac12 + \gamma.$ Show there exists $B \subseteq {\mathbb F}_2^n$ with $|B|/2^n \geq \frac12 \gamma$ such that $\mathop{\bf Pr}_{\boldsymbol{r} \sim {\mathbb F}_2^n}[A(\boldsymbol{r}, f(s)) = \boldsymbol{r} \cdot s] \geq \frac12 + \frac12 \gamma$ for all $s \in B$.
2. Switching to $\pm 1$ notation in the output, deduce $\widehat{A_{ \mid f(s)}}(s) \geq \gamma$ for all $s \in B$.
3. Show that the adversary can efficiently compute $s$ given $f(s)$ (with high probability) for any $s \in B$. If $\gamma$ is nonnegligible this contradicts the assumption that $f$ is “one-way”. (Hint: use the Goldreich–Levin algorithm.)
4. Deduce the same conclusion even if $A$ is a randomized algorithm.

### 23 comments to Chapter 3 exercises

• LYT

Ex 15: I am probably missing something here, but it seems to me that the assumption $\|f\|_2 \leq 1$ is actually not necessary, and we have a slightly stronger bound of $|\mathcal{F}| \leq \hat{\lVert} f \hat{\rVert}_1^2/\epsilon$ (by using the L1 bound instead of Parseval’s)?

• Absolutely right, thanks Li-Yang! I fixed it.

• Noam lifshitz

In ex 1: is $M^{-T}$ a notation for $M^{-1}$?

• Transpose of the inverse. Does that make sense?

• Noam lifshitz

Yes. thank you!

• Noam Lifshitz

In ex. 4 you probably mean:
if $f(x_1,x_2,\ldots ,x_n,y)=0$ then $f(x_1,x_2,\ldots ,x_n,y)=0$ is of $deg\le k-1$

• Noam Lifshitz

*In ex. 4 you probably mean:
if f(x1,x2,…,xn,y)=0 then f(x1,x2,…,xn,-y)=0 is of deg≤k−1
for each constant y.

• Right, thanks! Fixed.

• Noam Lifshitz

In ex 22: *does not exceed

• Thanks!

• Noam Lifshitz

In ex 35: I think you mean ex 1.12 rather than ex 1.11 .

• Thanks!

• Noam Lifshitz

The link in ex 37 is broken.

• Fixed, I think.

• Noam Lifshitz

I don’t understand something in ex 6. For $f=\mathbb{1}_{(1,1,\ldots,1)}$ , $H=0$ , $z=(1,1,\ldots,1)$ I got that the restricted spectral norm is $1$ while the spectral norm is smaller.

• Shouldn’t the original spectral norm also be 1? The Fourier expansion of $f$ is just $\sum_{\gamma} 2^{-n} \chi_\gamma$.

• Noam Lifshitz

If I didn’t misunderstandand the definitions then:
$\hat{\|}f\hat{\|}_{p}^{p}=\sum2^{-np}<\sum2^{-n}=1$

• Ryan O'Donnell

You’re right — 6(a) is almost totally wrong! It should just ask you to prove it for $p = 1$. Thanks very much Noam!!

• Keep the corrections coming, by the way!

In 26, in the “Affine subspace partition” definition, “may be $C_i$ may be”.