 Let $M : {\mathbb F}_2^n \to {\mathbb F}_2^n$ be an invertible linear transformation. Given $f : {\mathbb F}_2^n \to {\mathbb R}$, let $f \circ M : {\mathbb F}_2^n \to {\mathbb R}$ be defined by $f \circ M(x) = f(Mx)$. Show that $\widehat{f \circ M}(\gamma) = \widehat{f}(M^{\top}\gamma)$. What if $M$ is an invertible affine transformation? What if $M$ is not invertible?
 Show that $\frac{2}{1e^{2}}$ is smallest constant (not depending on $\delta$ or $n$) that can be taken in Proposition 3.
 Generalize Proposition 3 by showing that any $f : \{1,1\}^n \to {\mathbb R}$ is $\epsilon$concentrated on degree up to $1/\delta$ for $\epsilon = (\mathop{\bf E}[f^2] – \mathbf{Stab}_{1\delta}[f])/(11/e)$.
 Prove Lemma 5 by induction on $n$. (Hint: if one of the subfunctions $f(x_1, \dots, x_n, \pm 1)$ is identically $0$, show that the other has degree at most $k1$.)
 Verify for all $p \in [1, \infty]$ that $\hat{\lVert} \cdot \hat{\rVert}_p$ is a norm on the vector space of functions $f : {\mathbb F}_2^n \to {\mathbb R}$.
 Let $f : {\mathbb F}_2^n \to {\mathbb R}$ and let $H \leq {\mathbb F}_2^n$, $z \in H^{\perp}$.
 Show that restriction reduces spectral $1$norm: $\hat{\lVert} f_{H \mid z} \hat{\rVert}_1 \leq \hat{\lVert} f \hat{\rVert}_1$.
 Show that it also reduces Fourier sparsity: $\mathrm{sparsity}(\widehat{f_{H \mid z}}) \leq \mathrm{sparsity}(\widehat{f})$.
 Let $f : \{1,1\}^n \to {\mathbb R}$ and let $0 < p \leq q \leq \infty$. Show that $\hat{\lVert} f \hat{\rVert}_p \geq \hat{\lVert} f \hat{\rVert}_q$. (Cf. Exercise 1.14.)
 Let $f : \{1,1\}^n \to {\mathbb R}$. Show that $\hat{\lVert} f \hat{\rVert}_\infty \leq \f\_1$ and $\f\_\infty \leq \hat{\lVert} f \hat{\rVert}_1$. (These are easy special cases of the Hausdorff–Young inequality.)
 Suppose $f : \{1,1\}^n \to \{1,1\}$ is monotone. Show that $\widehat{f}(S) \leq \widehat{f}(i)$ whenever $i \in S \subseteq [n]$. Deduce that $\hat{\lVert} f \hat{\rVert}_\infty = \max_{S} \{\widehat{f}(S)\}$ is achieved by an $S$ of cardinality $0$ or $1$. (Hint: apply the previous exercise to $f$’s derivatives.)
 Prove Proposition 12.
 Verify Parseval’s Theorem for the Fourier expansion of subspaces given in Proposition 11.
 Let $f : {\mathbb F}_2^n \to \{0,1\}$ be the indicator of $A \subseteq {\mathbb F}_2^n$. We know that $\hat{\lVert} f \hat{\rVert}_1 = 1$ if $A$ is an affine subspace. So assume that $A$ is not an affine subspace.
 Show that there exists an affine subspace $B$ of dimension $2$ on which $f$ takes the value $1$ exactly $3$ times.
 Let $b$ be the point in $B$ where $f$ is $0$ and let $\psi = \varphi_B – (1/2) \varphi_{b}$. Show that $\hat{\lVert} \psi \hat{\rVert}_\infty = 1/2$.
 Show that $\langle \psi, f \rangle = 3/4$ and deduce $\hat{\lVert} f \hat{\rVert}_1 \geq 3/2$.
 Suppose $f : \{1,1\}^n \to {\mathbb R}$ satisfies $\mathop{\bf E}[f^2] \leq 1$. Show that $\hat{\lVert} f \hat{\rVert}_1 \leq 2^{n/2}$, and show that for any even $n$ the upper bound can be achieved by a function $f : \{1,1\}^n \to \{1,1\}$.
 Given $f : {\mathbb F}_2^n \to {\mathbb R}$, define its (fractional) sparsity to be $\mathrm{sparsity}(f) = \mathrm{supp}(f)/2^n = \mathop{\bf Pr}_{{\boldsymbol{x}} \in {\mathbb F}_2^n}[f({\boldsymbol{x}}) \neq 0]$. In this exercise you will prove the uncertainty principle: if $f$ is nonzero then $\mathrm{sparsity}(f) \cdot \mathrm{sparsity}(\widehat{f}) \geq 1$.
 Show we may assume $\f\_1 = 1$.
 Suppose $\mathcal{F} = \{ \gamma : \widehat{f}(\gamma) \neq 0\}$. Show that $\hat{\lVert} f \hat{\rVert}^2_2 \leq \mathcal{F}$.
 Suppose $\mathcal{G} = \{ x : f(x) \neq 0\}$. Show that $\f\_2^2 \geq 2^n/\mathcal{G}$, and deduce the uncertainty principle.
 Identify all cases of equality.
 Let $f : \{1,1\}^n \to {\mathbb R}$ and let $\epsilon > 0$. Show that $f$ is $\epsilon$concentrated on a collection $\mathcal{F} \subseteq 2^{[n]}$ with $\mathcal{F} \leq \hat{\lVert} f \hat{\rVert}_1^2/\epsilon$.
 Suppose the Fourier spectrum of $f : \{1,1\}^n \to {\mathbb R}$ is $\epsilon_1$concentrated on $\mathcal{F}$ and that $g : \{1,1\}^n \to {\mathbb R}$ satisfies $\f – g\_2^2 \leq \epsilon_2$. Show that the Fourier spectrum of $g$ is $2(\epsilon_1+\epsilon_2)$concentrated on $\mathcal{F}$.
 Show that every function $f : {\mathbb F}_2^n \to {\mathbb R}$ is computed by a decision tree with depth at most $n$ and size at most $2^n$.
 Let $f : {\mathbb F}_2^n \to {\mathbb R}$ be computable by a decision tree of size $s$ and depth $k$ Show that $f$ and the boolean dual $f^\dagger$ are also computable by decision trees of size $s$ and depth $k$.
 For each function in Exercise 1.1 with $4$ or fewer inputs, give a decision tree computing it. Try primarily to use the least possible depth, and secondarily to use the least possible size.
 Prove Proposition 16.
 Let $T$ be a decision tree over $n$ coordinates. We define choosing a random string ${\boldsymbol{x}} \in {\mathbb F}_2^n$ according to $T$ as follows. First we choose a random path $\boldsymbol{P}$ in $T$, meaning that we start at the root and at each internal node we (independently) proceed along one of the two outgoing edges with equal probability. Write $\boldsymbol{J}$ for the coordinates $\boldsymbol{P}$ encounters in its internal nodes and $\overline{\boldsymbol{J}} = [n] \setminus \boldsymbol{J}$. The path $\boldsymbol{P}$ also naturally defines a partial string $\boldsymbol{y} \in {\mathbb F}_2^{\boldsymbol{J}}$, where $\boldsymbol{y}_i = b$ if $\boldsymbol{P}$ proceeded along the $b$edge from the internal node labelled $i$. Finally, we choose $\boldsymbol{z} \sim {\mathbb F}_2^{\overline{\boldsymbol{J}}}$ uniformly and form the composite string ${\boldsymbol{x}} = (\boldsymbol{y}, \boldsymbol{z}) \in {\mathbb F}_2^n$. Show that ${\boldsymbol{x}}$ has the uniform probability distribution on ${\mathbb F}_2^n$, no matter what $T$ is.
 Let $f : {\mathbb F}_2^n \to \{1,1\}$ be computed by a decision tree $T$ of size $s$ and let $\epsilon \in (0,1]$. Suppose each path in $T$ is truncated (if necessary) so that its length does not exceed $\log(s/\epsilon)$; new leaves with labels $1$ and $1$ may be created in an arbitrary way as necessary. Show that the resulting decisions tree $T’$ computes a function which is $\epsilon$close to $f$. Deduce Proposition 17.
 A decision list is a decision tree in which every internal node has an outgoing edge to at least one leaf. Show that any function computable by a decision list is a linear threshold function.
 A readonce decision tree is one in which every internal node queries a distinct variable. Bearing this in mind, show that the bound $k2^{k1}$ in Theorem 4 cannot be reduced below $2^k – 1$.
 Suppose that $f$ is computed by a readonce decision tree in which every roottoleaf path has length $k$ and every internal node at the deepest level has one child (leaf) labelled $1$ one one child labelled $1$. Compute the influence of each coordinate on $f$, and compute $\mathbf{I}[f]$.
 The following are generalizations of decision trees:
Subcube partition: This is defined by a collection $C_1, \dots, C_s$ of subcubes which form a partition of ${\mathbb F}_2^n$, along with values $b_1, \dots, b_s \in {\mathbb R}$. It computes the function $f : {\mathbb F}_2^n \to {\mathbb R}$ which has value $b_i$ on all inputs in $C_i$. The subcube partition’s size is $s$ and its “codimension” $k$ (analogous to depth) is the maximum codimension of the cubes $C_i$.
Parity decision tree: This is similar to a decision tree except that the internal nodes are labelled by vectors $\gamma \in {\mathbb F}_2^n$. At such a node the computation path on input $x$ follows the edge labelled $\gamma \cdot x$. We insist that for each roottoleaf path, the vectors appearing in its internal nodes are linearly independent. Size $s$ and depth $k$ are defined as with normal decision trees.
Affine subspace partition: This is similar to a subcube partition except the subcubes may be $C_i$ may be arbitrary affine subspaces.
 Show that subcube partition size/codimension and parity decision tree size/depth generalize normal decision tree size/depth, and are generalized by affine subspace partition size/codimension.
 Show that Proposition 16 holds also for the generalizations, except that the statement about degree need not hold for parity decision trees and affine subspace partitions.
 Define $\mathrm{Equ}_3 : \{1,1\}^3 \to \{1,1\}$ by $\mathrm{Equ}_3(x) = 1$ if and only if $x_1 = x_2 = x_3$.
 Show that $\deg(\mathrm{Equ}_3) = 2$.
 Show that ${\mathrm{DT}}(\mathrm{Equ}_3) = 3$.
 Show that $\mathrm{Equ}_3$ is computable by a parity decision tree of codimension $2$.
 For $d \in {\mathbb N}$, define $f \{1,1\}^{3^d} \to \{1,1\}$ by $f = \mathrm{Equ}_3^{\otimes d}$ (using the notation from Definition 2.6). Show that $\deg(f) = 2^d$ but ${\mathrm{DT}}(f) = 3^d$.
 Let $f : \{1,1\}^n \to {\mathbb R}$ and $J \subseteq [n]$. Define $f^{\subseteq J} : \{1,1\}^n \to {\mathbb R}$ by $f(x) = \mathop{\bf E}_{\boldsymbol{y} \sim \{1,1\}^{\overline{J}}}[f(x_J, \boldsymbol{y})]$, where $x_J \in \{1,1\}^J$ is the projection of $x$ to coordinates $J$. Compute the Fourier expansion of $f^{\subseteq J}$.
 Suppose $f : \{1,1\}^n \to {\mathbb R}$ is computable by a decision tree which has a leaf at depth $k$ labelled $b$. Show that $\hat{\lVert} f \hat{\rVert}_\infty \geq b/2^k$. (Hint: you may find the previous exercise helpful.)
 Prove Fact 25 by using Theorem 1.28 and Exercise 1.1(d).

 Suppose $f : {\mathbb F}_2^n \to {\mathbb R}$ has $\mathrm{sparsity}(\widehat{f}) < 2^n$. Show that for any $\gamma \in \mathrm{supp}(\widehat{f})$ there exists nonzero $\beta \in {\widehat{{\mathbb F}_2^n}}$ such that $f_{\beta^\perp}$ has $\widehat{f}(\gamma)$ as a Fourier coefficient.
 Prove by induction on $n$ that if $f : {\mathbb F}_2^n \to \{1,1\}$ has $\mathrm{sparsity}(\widehat{f}) = s > 1$ then $\widehat{f}$ is $2^{1\lfloor \log s \rfloor}$granular. (Hint: Distinguish the cases $s = 2^n$ and $s < 2^n$. In the latter case use part (a).)
 Prove that there are no functions $f : \{1,1\}^n \to \{1,1\}$ with $\mathrm{sparsity}(\widehat{f}) \in \{2, 3, 5, 6, 7, 9\}$.
 Show that one can learn any target $f : \{1,1\}^n \to \{1,1\}$ with error $0$ from random examples only in time $\widetilde{O}(2^n)$.
 Improve Proposition 31 as follows. Suppose $f : \{1,1\}^n \to \{1,1\}$ and $g : \{1,1\}^n \to {\mathbb R}$ satisfy $\f – g\_1 \leq \epsilon$. Pick $\boldsymbol{\theta} \in [1,1]$ uniformly at random and define $\boldsymbol{h} : \{1,1\}^n \to \{1,1\}$ by $\boldsymbol{h}(x) = \mathrm{sgn}(g(x) – \boldsymbol{\theta})$. Show that $\mathop{\bf E}[\mathrm{dist}(f,\boldsymbol{h})] \leq \epsilon/2$.

 For $n$ even, find a function $f : \{1,1\}^n \to \{1,1\}$ which is not $1/2$concentrated on any $\mathcal{F} \subseteq 2^{[n]}$ with $\mathcal{F} < 2^{n1}$. (Hint: Exercise 1.1.)
 Let $\boldsymbol{f} : \{1,1\}^n \to \{1,1\}$ be a random function as in Exercise 1.7. Show that with probability at least $1/2$, $\bf$ is not $1/4$concentrated on degree up to $\lfloor n/2 \rfloor$.
 Prove Theorem 36. (Hint: in light of Exercise 1.12 you may round off certain estimates with confidence.)
 Show that each of the following classes $\mathcal{C}$ (ordered by inclusion) can be learned exactly (i.e., with error $0$) using membership queries in time $\mathrm{poly}(n, 2^k)$:
 $\mathcal{C} = \{f : \{1,1\}^n \to \{1,1\} \mid f$ is a $k$junta$\}$. (Hint: estimate influences.)
 $\mathcal{C} = \{f : \{1,1\}^n \to \{1,1\} \mid {\mathrm{DT}}(f) \leq k\}$.
 $\mathcal{C} = \{f : \{1,1\}^n \to \{1,1\} \mid \mathrm{sparsity}(\widehat{f}) \leq 2^{O(k)}\}$. (Hint: Exercise 31.)
 Prove Theorem 38. (Hint: Exercise 15.)
 Deduce Theorem 37 from the Goldreich–Levin algorithm.
 Suppose $A$ learns $\mathcal{C}$ from random examples with error $\epsilon/2$ in time $T$ — with probability at least $9/10$.
 After producing hypothesis $h$ on target $f : \{1,1\}^n \to \{1,1\}$, show that $A$ can “check” whether $h$ is a good hypothesis in time $\mathrm{poly}(n, T, 1/\epsilon) \cdot \log(1/\delta)$. Specifically, except with probability at most $\delta$, $A$ should output `YES’ if $\mathrm{dist}(f, h) \leq \epsilon/2$ and `NO’ if $\mathrm{dist}(f, h) > \epsilon$. (Hint: time $\mathrm{poly}(T)$ may be required for $A$ to evaluate $h(x)$.)
 Show that for any $\delta \in (0, 1/2]$, there is a learning algorithm which learns $\mathcal{C}$ with error $\epsilon$ in time $\mathrm{poly}(n, T, \epsilon)\cdot \log(1/\delta)$ — with probability at least $1\delta$.

 Our description of the LowDegree Algorithm with degree $k$ and error $\epsilon$ involved using a new batch of random examples to estimate each lowdegree Fourier coefficient. Show that one can instead simply draw a single batch $\mathcal{E}$ of $\mathrm{poly}(n^k, 1/\epsilon)$ examples and use $\mathcal{E}$ to estimate each of the lowdegree coefficients.
 Show that when using the above form of the LowDegree Algorithm, the final hypothesis $h : \{1,1\}^n \to \{1,1\}$ is of the form \[ h(y) = \mathrm{sgn}\left(\sum_{(x, f(x)) \in \mathcal{E}} w(\mathrm{Dist}(y,x)) \cdot f(x) \right), \] for some function $w : \{0, 1, \dots, n\} \to {\mathbb R}$. In other words, the hypothesis on a given $y$ is equal to a weighted vote over all examples seen, where an example’s weight depends only on its Hamming distance to $y$. Simplify your expression for $w$ as much as you can.
 Extend the Goldreich–Levin algorithm so that it works also for functions $f : \{1,1\}^n \to [1,1]$. (The learning model for targets $f : \{1,1\}^n \to [1,1]$ assumes that $f(x)$ is always a rational number expressible by $\mathrm{poly}(n)$ bits.)

 Assume $\gamma, \gamma’ \in {\widehat{{\mathbb F}_2^n}}$ are distinct. Show that $\mathop{\bf Pr}_{{\boldsymbol{x}}}[\gamma \cdot {\boldsymbol{x}} = \gamma' \cdot {\boldsymbol{x}}] = 1/2$.
 Fix $\gamma \in {\widehat{{\mathbb F}_2^n}}$ and suppose ${\boldsymbol{x}}^{(1)}, \dots, {\boldsymbol{x}}^{(m)} \sim {\mathbb F}_2^n$ are drawn uniformly and independently. Show that if $m = Cn$ for $C$ a sufficiently large constant then with high probability, the only $\gamma’ \in {\widehat{{\mathbb F}_2^n}}$ satisfying $\gamma’ \cdot {\boldsymbol{x}}^{(i)} = \gamma \cdot {\boldsymbol{x}}^{(i)}$ for all $i \in [m]$ is $\gamma’ = \gamma$.
 Essentially improve on Exercise 1.26 by showing that the concept class of all linear functions ${\mathbb F}_2^n \to {\mathbb F}_2$ can be learned from random examples only, with error $0$, in time $\mathrm{poly}(n)$. (Remark: if $\omega \in {\mathbb R}$ is such that $n \times n$ matrix multiplication can be done in $O(n^\omega)$ time, then the learning algorithm also requires only $O(n^\omega)$ time.)
 Let $\tau \geq 1/2 + \epsilon$ for some constant $\epsilon > 0$. Give an algorithm simpler than Goldreich and Levin’s which solves the following problem with high probability: Given query access to $f : \{1,1\}^n \to \{1,1\}$, in time $\mathrm{poly}(n, 1/\epsilon)$ find the unique $U \subseteq [n]$ such that $\widehat{f}(U) \geq \tau$, assuming it exists. (Hint: use Proposition 1.32 and Exercise 1.26.)
 Informally: a “oneway permutation” is a bijective function $f : {\mathbb F}_2^n \to {\mathbb F}_2^n$ which is easy to compute but hard to invert; a “pseudorandom generator” is a function $g : {\mathbb F}_2^k \to {\mathbb F}_2^m$ for $m > k$ whose output on a random input “looks unpredictable” to any efficient algorithm. Goldreich and Levin proposed the following construction of the latter from the former: for $k = 2n$, $m = 2n+1$, define \[ g(r, s) = (r, f(s), r \cdot s), \] where $r, s \in {\mathbb F}_2^n$. When $g$’s input $(\boldsymbol{r}, \boldsymbol{s})$ is uniformly random then so is the first $2n$ bits of its output (using the fact that $f$ is a bijection). The key to the analysis is showing that the final bit, $\boldsymbol{r} \cdot \boldsymbol{s}$, is highly unpredictable to efficient algorithms even given the first $2n$ bits $(\boldsymbol{r}, f(\boldsymbol{s}))$. This is proved by contradiction.
 Suppose that an adversary has a deterministic, efficient algorithm $A$ good at predicting the bit $\boldsymbol{r} \cdot \boldsymbol{s}$: \[ \mathop{\bf Pr}_{\boldsymbol{r}, \boldsymbol{s} \sim {\mathbb F}_2^n}[A(\boldsymbol{r}, f(\boldsymbol{s})) = \boldsymbol{r} \cdot \boldsymbol{s}] \geq \frac12 + \gamma. \] Show there exists $B \subseteq {\mathbb F}_2^n$ with $B/2^n \geq \frac12 \gamma$ such that \[ \mathop{\bf Pr}_{\boldsymbol{r} \sim {\mathbb F}_2^n}[A(\boldsymbol{r}, f(s)) = \boldsymbol{r} \cdot s] \geq \frac12 + \frac12 \gamma \] for all $s \in B$.
 Switching to $\pm 1$ notation in the output, deduce $\widehat{A_{ \mid f(s)}}(s) \geq \gamma$ for all $s \in B$.
 Show that the adversary can efficiently compute $s$ given $f(s)$ (with high probability) for any $s \in B$. If $\gamma$ is nonnegligible this contradicts the assumption that $f$ is “oneway”. (Hint: use the Goldreich–Levin algorithm.)
 Deduce the same conclusion even if $A$ is a randomized algorithm.
Ex 15: I am probably missing something here, but it seems to me that the assumption $\f\_2 \leq 1$ is actually not necessary, and we have a slightly stronger bound of $\mathcal{F} \leq \hat{\lVert} f \hat{\rVert}_1^2/\epsilon$ (by using the L1 bound instead of Parseval’s)?
Absolutely right, thanks LiYang! I fixed it.
In ex 1: is $M^{T}$ a notation for $M^{1}$?
Transpose of the inverse. Does that make sense?
Yes. thank you!
In ex. 4 you probably mean:
if $f(x_1,x_2,\ldots ,x_n,y)=0$ then $f(x_1,x_2,\ldots ,x_n,y)=0$ is of $deg\le k1$
*In ex. 4 you probably mean:
if f(x1,x2,…,xn,y)=0 then f(x1,x2,…,xn,y)=0 is of deg≤k−1
for each constant y.
Right, thanks! Fixed.
In ex 22: *does not exceed
Thanks!
In ex 35: I think you mean ex 1.12 rather than ex 1.11 .
Thanks!
The link in ex 37 is broken.
Fixed, I think.
I don’t understand something in ex 6. For $f=\mathbb{1}_{(1,1,\ldots,1)}$ , $H=0$ , $z=(1,1,\ldots,1)$ I got that the restricted spectral norm is $1$ while the spectral norm is smaller.
Shouldn’t the original spectral norm also be 1? The Fourier expansion of $f$ is just $\sum_{\gamma} 2^{n} \chi_\gamma$.
If I didn’t misunderstandand the definitions then:
$\hat{\}f\hat{\}_{p}^{p}=\sum2^{np}<\sum2^{n}=1$
You’re right — 6(a) is almost totally wrong! It should just ask you to prove it for $p = 1$. Thanks very much Noam!!
Keep the corrections coming, by the way!