- Let $M : {\mathbb F}_2^n \to {\mathbb F}_2^n$ be an invertible linear transformation. Given $f : {\mathbb F}_2^n \to {\mathbb R}$, let $f \circ M : {\mathbb F}_2^n \to {\mathbb R}$ be defined by $f \circ M(x) = f(Mx)$. Show that $\widehat{f \circ M}(\gamma) = \widehat{f}(M^{-\top}\gamma)$. What if $M$ is an invertible affine transformation? What if $M$ is not invertible?
- Show that $\frac{2}{1-e^{-2}}$ is smallest constant (not depending on $\delta$ or $n$) that can be taken in Proposition 3.
- Generalize Proposition 3 by showing that any $f : \{-1,1\}^n \to {\mathbb R}$ is $\epsilon$-concentrated on degree up to $1/\delta$ for $\epsilon = (\mathop{\bf E}[f^2] – \mathbf{Stab}_{1-\delta}[f])/(1-1/e)$.
- Prove Lemma 5 by induction on $n$. (Hint: if one of the subfunctions $f(x_1, \dots, x_n, \pm 1)$ is identically $0$, show that the other has degree at most $k-1$.)
- Verify for all $p \in [1, \infty]$ that $\hat{\lVert} \cdot \hat{\rVert}_p$ is a norm on the vector space of functions $f : {\mathbb F}_2^n \to {\mathbb R}$.
- Let $f : {\mathbb F}_2^n \to {\mathbb R}$ and let $H \leq {\mathbb F}_2^n$, $z \in H^{\perp}$.
- Show that restriction reduces spectral $1$-norm: $\hat{\lVert} f_{H \mid z} \hat{\rVert}_1 \leq \hat{\lVert} f \hat{\rVert}_1$.
- Show that it also reduces Fourier sparsity: $\mathrm{sparsity}(\widehat{f_{H \mid z}}) \leq \mathrm{sparsity}(\widehat{f})$.
- Let $f : \{-1,1\}^n \to {\mathbb R}$ and let $0 < p \leq q \leq \infty$. Show that $\hat{\lVert} f \hat{\rVert}_p \geq \hat{\lVert} f \hat{\rVert}_q$. (Cf. Exercise 1.14.)
- Let $f : \{-1,1\}^n \to {\mathbb R}$. Show that $\hat{\lVert} f \hat{\rVert}_\infty \leq \|f\|_1$ and $\|f\|_\infty \leq \hat{\lVert} f \hat{\rVert}_1$. (These are easy special cases of the Hausdorff–Young inequality.)
- Suppose $f : \{-1,1\}^n \to \{-1,1\}$ is monotone. Show that $|\widehat{f}(S)| \leq \widehat{f}(i)$ whenever $i \in S \subseteq [n]$. Deduce that $\hat{\lVert} f \hat{\rVert}_\infty = \max_{S} \{|\widehat{f}(S)|\}$ is achieved by an $S$ of cardinality $0$ or $1$. (Hint: apply the previous exercise to $f$’s derivatives.)
- Prove Proposition 12.
- Verify Parseval’s Theorem for the Fourier expansion of subspaces given in Proposition 11.
- Let $f : {\mathbb F}_2^n \to \{0,1\}$ be the indicator of $A \subseteq {\mathbb F}_2^n$. We know that $\hat{\lVert} f \hat{\rVert}_1 = 1$ if $A$ is an affine subspace. So assume that $A$ is not an affine subspace.
- Show that there exists an affine subspace $B$ of dimension $2$ on which $f$ takes the value $1$ exactly $3$ times.
- Let $b$ be the point in $B$ where $f$ is $0$ and let $\psi = \varphi_B – (1/2) \varphi_{b}$. Show that $\hat{\lVert} \psi \hat{\rVert}_\infty = 1/2$.
- Show that $\langle \psi, f \rangle = 3/4$ and deduce $\hat{\lVert} f \hat{\rVert}_1 \geq 3/2$.
- Suppose $f : \{-1,1\}^n \to {\mathbb R}$ satisfies $\mathop{\bf E}[f^2] \leq 1$. Show that $\hat{\lVert} f \hat{\rVert}_1 \leq 2^{n/2}$, and show that for any even $n$ the upper bound can be achieved by a function $f : \{-1,1\}^n \to \{-1,1\}$.
- Given $f : {\mathbb F}_2^n \to {\mathbb R}$, define its (fractional) sparsity to be $\mathrm{sparsity}(f) = |\mathrm{supp}(f)|/2^n = \mathop{\bf Pr}_{{\boldsymbol{x}} \in {\mathbb F}_2^n}[f({\boldsymbol{x}}) \neq 0]$. In this exercise you will prove the uncertainty principle: if $f$ is nonzero then $\mathrm{sparsity}(f) \cdot \mathrm{sparsity}(\widehat{f}) \geq 1$.
- Show we may assume $\|f\|_1 = 1$.
- Suppose $\mathcal{F} = \{ \gamma : \widehat{f}(\gamma) \neq 0\}$. Show that $\hat{\lVert} f \hat{\rVert}^2_2 \leq |\mathcal{F}|$.
- Suppose $\mathcal{G} = \{ x : f(x) \neq 0\}$. Show that $\|f\|_2^2 \geq 2^n/|\mathcal{G}|$, and deduce the uncertainty principle.
- Identify all cases of equality.
- Let $f : \{-1,1\}^n \to {\mathbb R}$ and let $\epsilon > 0$. Show that $f$ is $\epsilon$-concentrated on a collection $\mathcal{F} \subseteq 2^{[n]}$ with $|\mathcal{F}| \leq \hat{\lVert} f \hat{\rVert}_1^2/\epsilon$.
- Suppose the Fourier spectrum of $f : \{-1,1\}^n \to {\mathbb R}$ is $\epsilon_1$-concentrated on $\mathcal{F}$ and that $g : \{-1,1\}^n \to {\mathbb R}$ satisfies $\|f – g\|_2^2 \leq \epsilon_2$. Show that the Fourier spectrum of $g$ is $2(\epsilon_1+\epsilon_2)$-concentrated on $\mathcal{F}$.
- Show that every function $f : {\mathbb F}_2^n \to {\mathbb R}$ is computed by a decision tree with depth at most $n$ and size at most $2^n$.
- Let $f : {\mathbb F}_2^n \to {\mathbb R}$ be computable by a decision tree of size $s$ and depth $k$ Show that $-f$ and the boolean dual $f^\dagger$ are also computable by decision trees of size $s$ and depth $k$.
- For each function in Exercise 1.1 with $4$ or fewer inputs, give a decision tree computing it. Try primarily to use the least possible depth, and secondarily to use the least possible size.
- Prove Proposition 16.
- Let $T$ be a decision tree over $n$ coordinates. We define choosing a random string ${\boldsymbol{x}} \in {\mathbb F}_2^n$ according to $T$ as follows. First we choose a random path $\boldsymbol{P}$ in $T$, meaning that we start at the root and at each internal node we (independently) proceed along one of the two outgoing edges with equal probability. Write $\boldsymbol{J}$ for the coordinates $\boldsymbol{P}$ encounters in its internal nodes and $\overline{\boldsymbol{J}} = [n] \setminus \boldsymbol{J}$. The path $\boldsymbol{P}$ also naturally defines a partial string $\boldsymbol{y} \in {\mathbb F}_2^{\boldsymbol{J}}$, where $\boldsymbol{y}_i = b$ if $\boldsymbol{P}$ proceeded along the $b$-edge from the internal node labelled $i$. Finally, we choose $\boldsymbol{z} \sim {\mathbb F}_2^{\overline{\boldsymbol{J}}}$ uniformly and form the composite string ${\boldsymbol{x}} = (\boldsymbol{y}, \boldsymbol{z}) \in {\mathbb F}_2^n$. Show that ${\boldsymbol{x}}$ has the uniform probability distribution on ${\mathbb F}_2^n$, no matter what $T$ is.
- Let $f : {\mathbb F}_2^n \to \{-1,1\}$ be computed by a decision tree $T$ of size $s$ and let $\epsilon \in (0,1]$. Suppose each path in $T$ is truncated (if necessary) so that its length does not exceed $\log(s/\epsilon)$; new leaves with labels $-1$ and $1$ may be created in an arbitrary way as necessary. Show that the resulting decisions tree $T’$ computes a function which is $\epsilon$-close to $f$. Deduce Proposition 17.
- A decision list is a decision tree in which every internal node has an outgoing edge to at least one leaf. Show that any function computable by a decision list is a linear threshold function.
- A read-once decision tree is one in which every internal node queries a distinct variable. Bearing this in mind, show that the bound $k2^{k-1}$ in Theorem 4 cannot be reduced below $2^k – 1$.
- Suppose that $f$ is computed by a read-once decision tree in which every root-to-leaf path has length $k$ and every internal node at the deepest level has one child (leaf) labelled $-1$ one one child labelled $1$. Compute the influence of each coordinate on $f$, and compute $\mathbf{I}[f]$.
- The following are generalizations of decision trees:
Subcube partition: This is defined by a collection $C_1, \dots, C_s$ of subcubes which form a partition of ${\mathbb F}_2^n$, along with values $b_1, \dots, b_s \in {\mathbb R}$. It computes the function $f : {\mathbb F}_2^n \to {\mathbb R}$ which has value $b_i$ on all inputs in $C_i$. The subcube partition’s size is $s$ and its “codimension” $k$ (analogous to depth) is the maximum codimension of the cubes $C_i$.
Parity decision tree: This is similar to a decision tree except that the internal nodes are labelled by vectors $\gamma \in {\mathbb F}_2^n$. At such a node the computation path on input $x$ follows the edge labelled $\gamma \cdot x$. We insist that for each root-to-leaf path, the vectors appearing in its internal nodes are linearly independent. Size $s$ and depth $k$ are defined as with normal decision trees.
Affine subspace partition: This is similar to a subcube partition except the subcubes may be $C_i$ may be arbitrary affine subspaces.
- Show that subcube partition size/codimension and parity decision tree size/depth generalize normal decision tree size/depth, and are generalized by affine subspace partition size/codimension.
- Show that Proposition 16 holds also for the generalizations, except that the statement about degree need not hold for parity decision trees and affine subspace partitions.
- Define $\mathrm{Equ}_3 : \{-1,1\}^3 \to \{-1,1\}$ by $\mathrm{Equ}_3(x) = -1$ if and only if $x_1 = x_2 = x_3$.
- Show that $\deg(\mathrm{Equ}_3) = 2$.
- Show that ${\mathrm{DT}}(\mathrm{Equ}_3) = 3$.
- Show that $\mathrm{Equ}_3$ is computable by a parity decision tree of codimension $2$.
- For $d \in {\mathbb N}$, define $f \{-1,1\}^{3^d} \to \{-1,1\}$ by $f = \mathrm{Equ}_3^{\otimes d}$ (using the notation from Definition 2.6). Show that $\deg(f) = 2^d$ but ${\mathrm{DT}}(f) = 3^d$.
- Let $f : \{-1,1\}^n \to {\mathbb R}$ and $J \subseteq [n]$. Define $f^{\subseteq J} : \{-1,1\}^n \to {\mathbb R}$ by $f(x) = \mathop{\bf E}_{\boldsymbol{y} \sim \{-1,1\}^{\overline{J}}}[f(x_J, \boldsymbol{y})]$, where $x_J \in \{-1,1\}^J$ is the projection of $x$ to coordinates $J$. Compute the Fourier expansion of $f^{\subseteq J}$.
- Suppose $f : \{-1,1\}^n \to {\mathbb R}$ is computable by a decision tree which has a leaf at depth $k$ labelled $b$. Show that $\hat{\lVert} f \hat{\rVert}_\infty \geq b/2^k$. (Hint: you may find the previous exercise helpful.)
- Prove Fact 25 by using Theorem 1.28 and Exercise 1.1(d).
-
- Suppose $f : {\mathbb F}_2^n \to {\mathbb R}$ has $\mathrm{sparsity}(\widehat{f}) < 2^n$. Show that for any $\gamma \in \mathrm{supp}(\widehat{f})$ there exists nonzero $\beta \in {\widehat{{\mathbb F}_2^n}}$ such that $f_{\beta^\perp}$ has $\widehat{f}(\gamma)$ as a Fourier coefficient.
- Prove by induction on $n$ that if $f : {\mathbb F}_2^n \to \{-1,1\}$ has $\mathrm{sparsity}(\widehat{f}) = s > 1$ then $\widehat{f}$ is $2^{1-\lfloor \log s \rfloor}$-granular. (Hint: Distinguish the cases $s = 2^n$ and $s < 2^n$. In the latter case use part (a).)
- Prove that there are no functions $f : \{-1,1\}^n \to \{-1,1\}$ with $\mathrm{sparsity}(\widehat{f}) \in \{2, 3, 5, 6, 7, 9\}$.
- Show that one can learn any target $f : \{-1,1\}^n \to \{-1,1\}$ with error $0$ from random examples only in time $\widetilde{O}(2^n)$.
- Improve Proposition 31 as follows. Suppose $f : \{-1,1\}^n \to \{-1,1\}$ and $g : \{-1,1\}^n \to {\mathbb R}$ satisfy $\|f – g\|_1 \leq \epsilon$. Pick $\boldsymbol{\theta} \in [-1,1]$ uniformly at random and define $\boldsymbol{h} : \{-1,1\}^n \to \{-1,1\}$ by $\boldsymbol{h}(x) = \mathrm{sgn}(g(x) – \boldsymbol{\theta})$. Show that $\mathop{\bf E}[\mathrm{dist}(f,\boldsymbol{h})] \leq \epsilon/2$.
-
- For $n$ even, find a function $f : \{-1,1\}^n \to \{-1,1\}$ which is not $1/2$-concentrated on any $\mathcal{F} \subseteq 2^{[n]}$ with $|\mathcal{F}| < 2^{n-1}$. (Hint: Exercise 1.1.)
- Let $\boldsymbol{f} : \{-1,1\}^n \to \{-1,1\}$ be a random function as in Exercise 1.7. Show that with probability at least $1/2$, $\bf$ is not $1/4$-concentrated on degree up to $\lfloor n/2 \rfloor$.
- Prove Theorem 36. (Hint: in light of Exercise 1.12 you may round off certain estimates with confidence.)
- Show that each of the following classes $\mathcal{C}$ (ordered by inclusion) can be learned exactly (i.e., with error $0$) using membership queries in time $\mathrm{poly}(n, 2^k)$:
- $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid f$ is a $k$-junta$\}$. (Hint: estimate influences.)
- $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid {\mathrm{DT}}(f) \leq k\}$.
- $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \mathrm{sparsity}(\widehat{f}) \leq 2^{O(k)}\}$. (Hint: Exercise 31.)
- Prove Theorem 38. (Hint: Exercise 15.)
- Deduce Theorem 37 from the Goldreich–Levin algorithm.
- Suppose $A$ learns $\mathcal{C}$ from random examples with error $\epsilon/2$ in time $T$ — with probability at least $9/10$.
- After producing hypothesis $h$ on target $f : \{-1,1\}^n \to \{-1,1\}$, show that $A$ can “check” whether $h$ is a good hypothesis in time $\mathrm{poly}(n, T, 1/\epsilon) \cdot \log(1/\delta)$. Specifically, except with probability at most $\delta$, $A$ should output `YES’ if $\mathrm{dist}(f, h) \leq \epsilon/2$ and `NO’ if $\mathrm{dist}(f, h) > \epsilon$. (Hint: time $\mathrm{poly}(T)$ may be required for $A$ to evaluate $h(x)$.)
- Show that for any $\delta \in (0, 1/2]$, there is a learning algorithm which learns $\mathcal{C}$ with error $\epsilon$ in time $\mathrm{poly}(n, T, \epsilon)\cdot \log(1/\delta)$ — with probability at least $1-\delta$.
-
- Our description of the Low-Degree Algorithm with degree $k$ and error $\epsilon$ involved using a new batch of random examples to estimate each low-degree Fourier coefficient. Show that one can instead simply draw a single batch $\mathcal{E}$ of $\mathrm{poly}(n^k, 1/\epsilon)$ examples and use $\mathcal{E}$ to estimate each of the low-degree coefficients.
- Show that when using the above form of the Low-Degree Algorithm, the final hypothesis $h : \{-1,1\}^n \to \{-1,1\}$ is of the form \[ h(y) = \mathrm{sgn}\left(\sum_{(x, f(x)) \in \mathcal{E}} w(\mathrm{Dist}(y,x)) \cdot f(x) \right), \] for some function $w : \{0, 1, \dots, n\} \to {\mathbb R}$. In other words, the hypothesis on a given $y$ is equal to a weighted vote over all examples seen, where an example’s weight depends only on its Hamming distance to $y$. Simplify your expression for $w$ as much as you can.
- Extend the Goldreich–Levin algorithm so that it works also for functions $f : \{-1,1\}^n \to [-1,1]$. (The learning model for targets $f : \{-1,1\}^n \to [-1,1]$ assumes that $f(x)$ is always a rational number expressible by $\mathrm{poly}(n)$ bits.)
-
- Assume $\gamma, \gamma’ \in {\widehat{{\mathbb F}_2^n}}$ are distinct. Show that $\mathop{\bf Pr}_{{\boldsymbol{x}}}[\gamma \cdot {\boldsymbol{x}} = \gamma' \cdot {\boldsymbol{x}}] = 1/2$.
- Fix $\gamma \in {\widehat{{\mathbb F}_2^n}}$ and suppose ${\boldsymbol{x}}^{(1)}, \dots, {\boldsymbol{x}}^{(m)} \sim {\mathbb F}_2^n$ are drawn uniformly and independently. Show that if $m = Cn$ for $C$ a sufficiently large constant then with high probability, the only $\gamma’ \in {\widehat{{\mathbb F}_2^n}}$ satisfying $\gamma’ \cdot {\boldsymbol{x}}^{(i)} = \gamma \cdot {\boldsymbol{x}}^{(i)}$ for all $i \in [m]$ is $\gamma’ = \gamma$.
- Essentially improve on Exercise 1.26 by showing that the concept class of all linear functions ${\mathbb F}_2^n \to {\mathbb F}_2$ can be learned from random examples only, with error $0$, in time $\mathrm{poly}(n)$. (Remark: if $\omega \in {\mathbb R}$ is such that $n \times n$ matrix multiplication can be done in $O(n^\omega)$ time, then the learning algorithm also requires only $O(n^\omega)$ time.)
- Let $\tau \geq 1/2 + \epsilon$ for some constant $\epsilon > 0$. Give an algorithm simpler than Goldreich and Levin’s which solves the following problem with high probability: Given query access to $f : \{-1,1\}^n \to \{-1,1\}$, in time $\mathrm{poly}(n, 1/\epsilon)$ find the unique $U \subseteq [n]$ such that $|\widehat{f}(U)| \geq \tau$, assuming it exists. (Hint: use Proposition 1.32 and Exercise 1.26.)
- Informally: a “one-way permutation” is a bijective function $f : {\mathbb F}_2^n \to {\mathbb F}_2^n$ which is easy to compute but hard to invert; a “pseudorandom generator” is a function $g : {\mathbb F}_2^k \to {\mathbb F}_2^m$ for $m > k$ whose output on a random input “looks unpredictable” to any efficient algorithm. Goldreich and Levin proposed the following construction of the latter from the former: for $k = 2n$, $m = 2n+1$, define \[ g(r, s) = (r, f(s), r \cdot s), \] where $r, s \in {\mathbb F}_2^n$. When $g$’s input $(\boldsymbol{r}, \boldsymbol{s})$ is uniformly random then so is the first $2n$ bits of its output (using the fact that $f$ is a bijection). The key to the analysis is showing that the final bit, $\boldsymbol{r} \cdot \boldsymbol{s}$, is highly unpredictable to efficient algorithms even given the first $2n$ bits $(\boldsymbol{r}, f(\boldsymbol{s}))$. This is proved by contradiction.
- Suppose that an adversary has a deterministic, efficient algorithm $A$ good at predicting the bit $\boldsymbol{r} \cdot \boldsymbol{s}$: \[ \mathop{\bf Pr}_{\boldsymbol{r}, \boldsymbol{s} \sim {\mathbb F}_2^n}[A(\boldsymbol{r}, f(\boldsymbol{s})) = \boldsymbol{r} \cdot \boldsymbol{s}] \geq \frac12 + \gamma. \] Show there exists $B \subseteq {\mathbb F}_2^n$ with $|B|/2^n \geq \frac12 \gamma$ such that \[ \mathop{\bf Pr}_{\boldsymbol{r} \sim {\mathbb F}_2^n}[A(\boldsymbol{r}, f(s)) = \boldsymbol{r} \cdot s] \geq \frac12 + \frac12 \gamma \] for all $s \in B$.
- Switching to $\pm 1$ notation in the output, deduce $\widehat{A_{ \mid f(s)}}(s) \geq \gamma$ for all $s \in B$.
- Show that the adversary can efficiently compute $s$ given $f(s)$ (with high probability) for any $s \in B$. If $\gamma$ is nonnegligible this contradicts the assumption that $f$ is “one-way”. (Hint: use the Goldreich–Levin algorithm.)
- Deduce the same conclusion even if $A$ is a randomized algorithm.
Ex 15: I am probably missing something here, but it seems to me that the assumption $\|f\|_2 \leq 1$ is actually not necessary, and we have a slightly stronger bound of $|\mathcal{F}| \leq \hat{\lVert} f \hat{\rVert}_1^2/\epsilon$ (by using the L1 bound instead of Parseval’s)?
Absolutely right, thanks Li-Yang! I fixed it.
In ex 1: is $M^{-T}$ a notation for $M^{-1}$?
Transpose of the inverse. Does that make sense?
Yes. thank you!
In ex. 4 you probably mean:
if $f(x_1,x_2,\ldots ,x_n,y)=0$ then $f(x_1,x_2,\ldots ,x_n,y)=0$ is of $deg\le k-1$
*In ex. 4 you probably mean:
if f(x1,x2,…,xn,y)=0 then f(x1,x2,…,xn,-y)=0 is of deg≤k−1
for each constant y.
Right, thanks! Fixed.
In ex 22: *does not exceed
Thanks!
In ex 35: I think you mean ex 1.12 rather than ex 1.11 .
Thanks!
The link in ex 37 is broken.
Fixed, I think.
I don’t understand something in ex 6. For $f=\mathbb{1}_{(1,1,\ldots,1)}$ , $H=0$ , $z=(1,1,\ldots,1)$ I got that the restricted spectral norm is $1$ while the spectral norm is smaller.
Shouldn’t the original spectral norm also be 1? The Fourier expansion of $f$ is just $\sum_{\gamma} 2^{-n} \chi_\gamma$.
If I didn’t misunderstandand the definitions then:
$\hat{\|}f\hat{\|}_{p}^{p}=\sum2^{-np}<\sum2^{-n}=1$
You’re right — 6(a) is almost totally wrong! It should just ask you to prove it for $p = 1$. Thanks very much Noam!!
Keep the corrections coming, by the way!
In 25 (also in the book) “one one child”.
Thanks! You have sharper eyes than the professional copyeditor
In 26, in the “Affine subspace partition” definition, “may be $C_i$ may be”.
Thanks, fixed!