Recall from Chapter 2.1 that a linear threshold function (abbreviated LTF) is a boolean-valued function $f : \{-1,1\}^n \to \{-1,1\}$ that can be represented as \begin{equation} \label{eqn:generic-LTF} f(x) = \mathrm{sgn}(a_0 + a_1 x_1 + \cdots + a_n x_n) \end{equation} for some constants $a_0, a_1, \dots, a_n \in {\mathbb R}$.

(For definiteness we’ll take $\mathrm{sgn}(0) = 1$. If we’re using the representation $f : \{-1,1\}^n \to \{0,1\}$ then $f$ is an LTF if it can be represented as $f(x) = 1_{\{a_0 + a_1 x_1 + \cdots + a_n x_n > 0\}}$.) Examples include majority, AND, OR, dictators, and decision lists (Exercise 3.23). Besides representing “weighted majority” voting schemes, LTFs play an important role in learning theory and in circuit complexity.

There is also a geometric perspective on LTFs. Writing $\ell(x) = a_0 + a_1 x_1 + \cdots + a_n x_n$, we can think of $\ell$ as an affine function ${\mathbb R}^n \to {\mathbb R}$. Then $\mathrm{sgn}(\ell(x))$ is the $\pm 1$-indicator of a *halfspace* in ${\mathbb R}^n$. A boolean LTF is thus the restriction of such a halfspace-indicator to the discrete cube $\{-1,1\}^n \subset {\mathbb R}^n$. Equivalently, a function $f : \{-1,1\}^n \to \{-1,1\}$ is an LTF if and only if it has a “linear separator”; i.e., a hyperplane in ${\mathbb R}^n$ which separates the points $f$ labels $1$ from the points $f$ labels $-1$.

An LTF $f : \{-1,1\}^n \to \{-1,1\}$ can have several different representations as in \eqref{eqn:generic-LTF} — in fact it always has infinitely many. This is clear from the geometric viewpoint; any small enough perturbation to a linear separator will not change the way it partitions the discrete cube. Because we can make these perturbations, we may ensure that $a_0 + a_1 x_1 + \cdots + a_n x_n \neq 0$ for every $x \in \{-1,1\}^n$. We’ll usually insist that LTF representations have this property so that the nuisance of $\mathrm{sgn}(0)$ doesn’t arise. We also observe that we can scale all of the coefficients in an LTF representation by the same positive constant without changing the LTF. These observations can be used to show it’s always possible to take the $a_i$’s to be integers (see the exercises). However we will most often scale so that $\sum_{i=1}^n a_i^2 = 1$; this is convenient when using the Central Limit Theorem. The most elegant result connecting LTFs and Fourier expansions is Chow’s Theorem, which says that a boolean LTF is completely determined by its degree-$0$ and degree-$1$ Fourier coefficients. In fact, it’s determined not just within the class of LTFs but within the class of all boolean functions:

Theorem 1Let $f : \{-1,1\}^n \to \{-1,1\}$ be an LTF and let $g : \{-1,1\}^n \to \{-1,1\}$ be any function. If $\widehat{g}(S) = \widehat{f}(S)$ for all $|S| \leq 1$ then $g = f$.

*Proof:* Let $f(x) = \mathrm{sgn}(\ell(x))$, where $\ell : \{-1,1\}^n \to {\mathbb R}$ has degree at most $1$ and is never $0$ on $\{-1,1\}^n$. For any $x \in \{-1,1\}^n$ we have $f(x)\ell(x) = |\ell(x)| \geq g(x)\ell(x)$, with equality if and only if $f(x) = g(x)$ (here we use $\ell(x) \neq 0$). Using this observation along with Plancherel’s Theorem (twice) we have \[ \sum_{|S| \leq 1} \widehat{f}(S) \widehat{\ell}(S) = \mathop{\bf E}[f({\boldsymbol{x}})\ell({\boldsymbol{x}})] \geq \mathop{\bf E}[g({\boldsymbol{x}}) \ell({\boldsymbol{x}})] = \sum_{|S| \leq 1} \widehat{g}(S) \widehat{\ell}(S). \] But by assumption, the left-hand and right-hand sides above are equal. Thus the inequality must be an equality for every choice of ${\boldsymbol{x}}$; i.e., $f(x) = g(x)$ $\forall x$. $\Box$

In light of Chow’s Theorem, the $n+1$ numbers $\widehat{g}(\emptyset)$, $\widehat{g}(\{1\}), \dots, \widehat{g}(\{n\})$ are sometimes called the *Chow parameters* of the boolean function $g$.

As we will show later in this chapter, linear threshold functions are very noise-stable; hence they have a lot of their Fourier weight at low degrees. Here is a simple result along these lines:

Theorem 2Let $f : \{-1,1\}^n \to \{-1,1\}$ be an LTF. Then $\mathbf{W}^{\leq 1}[f] \geq 1/2$.

*Proof:* Writing $f(x) = \mathrm{sgn}(\ell(x))$ we have \[ \|\ell\|_1 = \mathop{\bf E}[|\ell({\boldsymbol{x}})|] = \langle f, \ell \rangle = \langle f^{\leq 1}, \ell \rangle \leq \|f^{\leq 1}\|_2 \|\ell\|_2 = \sqrt{\mathbf{W}^{\leq 1}[f]} \cdot \|\ell\|_2, \] where the third equality follows from Plancherel and the inequality is Cauchy–Schwarz. Assume first that $\ell(x) = a_1 x_1 + \cdots + a_n x_n$ (i.e., $\ell(x)$ has no constant term). The Khintchine–Kahane inequality (Exercise 2.48) states that $\|\ell\|_1 \geq \frac{1}{\sqrt{2}} \|\ell\|_2$ and hence we deduce \[ \tfrac{1}{\sqrt{2}} \|\ell\|_2 \leq \sqrt{\mathbf{W}^{\leq 1}[f]} \cdot \|\ell\|_2. \] The conclusion $\mathbf{W}^{\leq 1}[f] \geq 1/2$ follows immediately (since $\|\ell\|_2$ cannot be $0$). The case when $\ell(x)$ has a constant term is left for the exercises. $\Box$

From Exercise 2.18 we know that $\mathbf{W}^{\leq 1}[\mathrm{Maj}_n] = \mathbf{W}^{1}[\mathrm{Maj}_n] \geq 2/\pi$ for all $n$; it is reasonable to conjecture that majority is extremal for Theorem 2. This is an open problem.

Conjecture.Let $f : \{-1,1\}^n \to \{-1,1\}$ be an LTF. Then $\mathbf{W}^{\leq 1}[f] \geq 2/\pi$.

A natural generalization of linear threshold functions is *polynomial threshold functions*:

Definition 3A function $f : \{-1,1\}^n \to \{-1,1\}$ is called apolynomial threshold function (PTF)of degree at most $k$ if it is expressible as $f(x) = \mathrm{sgn}(p(x))$ for some real polynomial $p : \{-1,1\}^n \to {\mathbb R}$ of degree at most $k$.

Example 4Let $f : \{-1,1\}^4 \to \{-1,1\}$ be the $4$-bit equality function which is $1$ if and only if all input bits are equal. Then $f$ is a degree-$2$ PTF because it has the representation $f(x) = \mathrm{sgn}(-3 + x_1x_2 + x_1x_3 + x_1x_4 + x_2x_3 + x_2x_4 + x_3x_4)$.

*Every* boolean function $f : \{-1,1\}^n \to \{-1,1\}$ is a PTF of degree at most $n$, since we can take the sign of its Fourier expansion. Thus we are usually interested in the case when the degree $k$ is “small”; say, $k = O_n(1)$. Low-degree PTFs arise frequently in learning theory; for example, as hypotheses in the Low-Degree Algorithm and many other practical learning algorithms. They also arise in circuit complexity, wherein a PTF representation \[ f(x) = \mathrm{sgn}(\sum_{i=1}^s {a_i} x^{T_i}) \] is thought of as a “threshold-of-parities circuit”: i.e., a depth-$2$ circuit with $s$ “parity gates” $x^{T_i}$ at layer $1$ and a single “(linear) threshold gate” at layer $2$. From this point of view, the size of the circuit corresponds to the *sparsity* of the PTF representation:

Definition 5We say a PTF representation $f(x) = \mathrm{sgn}(p(x))$ hassparsityat most $s$ if $p(x)$ is a multilinear polynomial with at most $s$ terms.

For example, the PTF representation of the $4$-bit equality function from Example 4 has sparsity $7$. Let’s extend the two theorems about LTFs we proved above to the case of PTFs. The generalization of Chow’s Theorem is straightforward; its proof is left to the exercises:

Theorem 6Let $f : \{-1,1\}^n \to \{-1,1\}$ be a PTF of degree at most $k$ and let $g : \{-1,1\}^n \to \{-1,1\}$ be any function. If $\widehat{g}(S) = \widehat{f}(S)$ for all $|S| \leq k$ then $g = f$.

We also have the following extension of Theorem 2:

Theorem 7Let $f : \{-1,1\}^n \to \{-1,1\}$ be a degree-$k$ PTF. Then $\mathbf{W}^{\leq k}[f] \geq e^{-2k}$.

*Proof:* Writing $f(x) = \mathrm{sgn}(p(x))$ for $p$ of degree $k$, we again have \[ \|p\|_1 = \mathop{\bf E}[|p({\boldsymbol{x}})|] = \langle f, p \rangle = \langle f^{\leq k}, p \rangle \leq \|f^{\leq k}\|_2 \|p\|_2 = \sqrt{\mathbf{W}^{\leq k}[f]} \cdot \|p\|_2. \] To complete the proof we need the fact that $\|p\|_2 \leq e^k \|p\|_1$ for any degree-$k$ polynomial $p : \{-1,1\}^n \to {\mathbb R}$. We will prove this much later in Chapter 10 on hypercontractivity. $\Box$

The $e^{-2k}$ in this theorem cannot be improved beyond $2^{1-k}$; see the exercises.

We close this section by discussing PTF sparsity. We begin with a (simpler) variant of Theorem 7 which is useful for proving PTF sparsity lower bounds:

Theorem 8Let $f : \{-1,1\}^n \to \{-1,1\}$ be expressible as a PTF over the collection of monomials $\mathcal{F} \subseteq 2^{[n]}$; i.e., $f(x) = \mathrm{sgn}(p(x))$ for some polynomial $p(x) = \sum_{S \in \mathcal{F}} \widehat{p}(S) x^S$. Then $\sum_{S \in \mathcal{F}} |\widehat{f}(S)| \geq 1$.

*Proof:* Define $g : \{-1,1\}^n \to {\mathbb R}$ by $g(x) = \sum_{S \in \mathcal{F}} \widehat{f}(S)\,x^S$. Since $\hat{\lVert} p \hat{\rVert}_\infty \leq \|p\|_1$ (Exercise 3.8) we have \[ \hat{\lVert} p \hat{\rVert}_\infty \leq \|p\|_1 = \mathop{\bf E}[f({\boldsymbol{x}})p({\boldsymbol{x}})] = \sum_{S \subseteq [n]} \widehat{f}(S) \widehat{p}(S) = \sum_{S \in \mathcal{F}} \widehat{g}(S) \widehat{p}(S) \leq \hat{\lVert} g \hat{\rVert}_1 \hat{\lVert} p \hat{\rVert}_\infty, \] and hence $\hat{\lVert} g \hat{\rVert}_1 \geq 1$ as claimed. $\Box$

We can use this result to show that the “inner product mod $2$ function” (see Exercise 1.1) requires huge threshold-of-parities circuits:

Corollary 9Any PTF representation of the inner product mod $2$ function $\mathrm{IP}_{2n} : {\mathbb F}_2^{2n} \to \{-1,1\}$ has sparsity at least $2^n$.

*Proof:* This follows immediately from Theorem 8 and the fact that $|\widehat{\mathrm{IP}}_{2n}(S)| = 2^{-n}$ for all $S \subseteq [2n]$ (Exercise 1.1). $\Box$

We can also show that any function $f : \{-1,1\}^n \to \{-1,1\}$ with small Fourier $1$-norm $\hat{\lVert} f \hat{\rVert}_1$ has a sparse PTF representation. In fact a stronger result holds: such a function can be additively *approximated* by a sparse polynomial:

Theorem 10Let $f : \{-1,1\}^n \to {\mathbb R}$ be nonzero, let $\delta > 0$, and let $s \geq 4n\hat{\lVert} f \hat{\rVert}_1^2/\delta^2$ be an integer. Then there is a multilinear polynomial $q : \{-1,1\}^n \to {\mathbb R}$ of sparsity at most $s$ such that $\|f – q\|_\infty < \delta$.

*Proof:* The proof is by the probabilistic method. Let $\boldsymbol{T} \subseteq [n]$ be randomly chosen according to the distribution ${\mathop{\bf Pr}[\boldsymbol{T} = T]} = \frac{|\widehat{f}(T)|}{\hat{\lVert} f \hat{\rVert}_1}$. Let $\boldsymbol{T}_1, \dots, \boldsymbol{T}_s$ be independent draws from this distribution and define the multilinear polynomial \[ \boldsymbol{p}(x) = \sum_{i=1}^s \mathrm{sgn}(\widehat{f}(\boldsymbol{T}_i))\,x^{\boldsymbol{T}_i}. \] When $x \in \{-1,1\}^n$ is fixed, each monomial $\mathrm{sgn}(\widehat{f}(\boldsymbol{T}_i)\,x^{\boldsymbol{T}_i}$ becomes a $\pm 1$-valued random variable with expectation \[ \sum_{T \subseteq [n]} \tfrac{|\widehat{f}(T)|}{\hat{\lVert} f \hat{\rVert}_1}\cdot \mathrm{sgn}(\widehat{f}(T))\,x^{T} = \tfrac{1}{\hat{\lVert} f \hat{\rVert}_1} \sum_{T \subseteq [n]} \widehat{f}(T)\,x^T = \tfrac{f(x)}{\hat{\lVert} f \hat{\rVert}_1}. \] Thus by a Chernoff bound , for any $\epsilon > 0$, \[ \mathop{\bf Pr}_{\boldsymbol{T}_1, \dots, \boldsymbol{T}_s}\left[\Bigl|\boldsymbol{p}(x) - \tfrac{f(x)}{\hat{\lVert} f \hat{\rVert}_1} s\Bigr| \geq \epsilon s\right] \leq 2\exp(-\epsilon^2 s/2). \] Selecting $\epsilon = \delta/\hat{\lVert} f \hat{\rVert}_1$ and using $s \geq 4n\hat{\lVert} f \hat{\rVert}_1^2/\delta^2$, the probability is at most $2\exp(-2n) < 2^{-n}$. Taking a union bound over all $2^n$ choices of $x \in \{-1,1\}^n$, we conclude that there exists some $p(x) = \sum_{i=1}^s \mathrm{sgn}(\widehat{f}(T_i))\,x^{T_i}$ such that for all $x \in \{-1,1\}^n$, \[ \Bigl|p(x) - \tfrac{f(x)}{\hat{\lVert} f \hat{\rVert}_1} s\Bigr| < \epsilon s = \tfrac{\delta}{\hat{\lVert} f \hat{\rVert}_1} s \quad\Rightarrow\quad \Bigl|\tfrac{\hat{\lVert} f \hat{\rVert}_1}{s} \cdot p(x) - f(x) \Bigr| < \delta. \] Thus we may take $q = \frac{\hat{\lVert} f \hat{\rVert}_1}{s} \cdot p$. $\Box$

Corollary 11Let $f : \{-1,1\}^n \to \{-1,1\}$. Then $f$ is expressible as a PTF of sparsity at most $s = \lceil 4n\hat{\lVert} f \hat{\rVert}_1^2\rceil$. Indeed $f$ can be represented as a majority of $s$ parities or negated-parities.

*Proof:* Apply the previous theorem with $\delta = 1$; we then have $f(x) = \mathrm{sgn}(q(x))$. Since this is also equivalent to $\mathrm{sgn}(p(x))$, the terms $\mathrm{sgn}(\widehat{f}(T_i))\,x^{T_i}$ are the required parities/negated-parities. $\Box$

Though functions computable by small DNFs need not have small Fourier $1$-norm, it is a further easy corollary that they can be computed by sparse PTFs: see the exercises. We also remark that there is no good converse to Corollary 11: the $\mathrm{Maj}_n$ function has a PTF (indeed, an LTF) of sparsity $n$ but has exponentially large Fourier $1$-norm (as you will be asked to show in the exercises).

It seems to me that a square is missing in the expression for s in corollary 11.

Fixed, thanks!

Bracket typo: In the proof of thm 10 (12 in the book), $sgn( \hat{f}(T_i) x^{T_i}$ (4th row of the proof, in the book).