§5.1: Linear threshold functions and polynomial threshold functions

Recall from Chapter 2.1 that a linear threshold function (abbreviated LTF) is a boolean-valued function $f : \{-1,1\}^n \to \{-1,1\}$ that can be represented as $$\label{eqn:generic-LTF} f(x) = \mathrm{sgn}(a_0 + a_1 x_1 + \cdots + a_n x_n)$$ for some constants $a_0, a_1, \dots, a_n \in {\mathbb R}$.

(For definiteness we’ll take $\mathrm{sgn}(0) = 1$. If we’re using the representation $f : \{-1,1\}^n \to \{0,1\}$ then $f$ is an LTF if it can be represented as $f(x) = 1_{\{a_0 + a_1 x_1 + \cdots + a_n x_n > 0\}}$.) Examples include majority, AND, OR, dictators, and decision lists (Exercise 3.23). Besides representing “weighted majority” voting schemes, LTFs play an important role in learning theory and in circuit complexity.

There is also a geometric perspective on LTFs. Writing $\ell(x) = a_0 + a_1 x_1 + \cdots + a_n x_n$, we can think of $\ell$ as an affine function ${\mathbb R}^n \to {\mathbb R}$. Then $\mathrm{sgn}(\ell(x))$ is the $\pm 1$-indicator of a halfspace in ${\mathbb R}^n$. A boolean LTF is thus the restriction of such a halfspace-indicator to the discrete cube $\{-1,1\}^n \subset {\mathbb R}^n$. Equivalently, a function $f : \{-1,1\}^n \to \{-1,1\}$ is an LTF if and only if it has a “linear separator”; i.e., a hyperplane in ${\mathbb R}^n$ which separates the points $f$ labels $1$ from the points $f$ labels $-1$.

An LTF $f : \{-1,1\}^n \to \{-1,1\}$ can have several different representations as in \eqref{eqn:generic-LTF} — in fact it always has infinitely many. This is clear from the geometric viewpoint; any small enough perturbation to a linear separator will not change the way it partitions the discrete cube. Because we can make these perturbations, we may ensure that $a_0 + a_1 x_1 + \cdots + a_n x_n \neq 0$ for every $x \in \{-1,1\}^n$. We’ll usually insist that LTF representations have this property so that the nuisance of $\mathrm{sgn}(0)$ doesn’t arise. We also observe that we can scale all of the coefficients in an LTF representation by the same positive constant without changing the LTF. These observations can be used to show it’s always possible to take the $a_i$’s to be integers (see the exercises). However we will most often scale so that $\sum_{i=1}^n a_i^2 = 1$; this is convenient when using the Central Limit Theorem. The most elegant result connecting LTFs and Fourier expansions is Chow’s Theorem, which says that a boolean LTF is completely determined by its degree-$0$ and degree-$1$ Fourier coefficients. In fact, it’s determined not just within the class of LTFs but within the class of all boolean functions:

Theorem 1 Let $f : \{-1,1\}^n \to \{-1,1\}$ be an LTF and let $g : \{-1,1\}^n \to \{-1,1\}$ be any function. If $\widehat{g}(S) = \widehat{f}(S)$ for all $|S| \leq 1$ then $g = f$.

Proof: Let $f(x) = \mathrm{sgn}(\ell(x))$, where $\ell : \{-1,1\}^n \to {\mathbb R}$ has degree at most $1$ and is never $0$ on $\{-1,1\}^n$. For any $x \in \{-1,1\}^n$ we have $f(x)\ell(x) = |\ell(x)| \geq g(x)\ell(x)$, with equality if and only if $f(x) = g(x)$ (here we use $\ell(x) \neq 0$). Using this observation along with Plancherel’s Theorem (twice) we have $\sum_{|S| \leq 1} \widehat{f}(S) \widehat{\ell}(S) = \mathop{\bf E}[f({\boldsymbol{x}})\ell({\boldsymbol{x}})] \geq \mathop{\bf E}[g({\boldsymbol{x}}) \ell({\boldsymbol{x}})] = \sum_{|S| \leq 1} \widehat{g}(S) \widehat{\ell}(S).$ But by assumption, the left-hand and right-hand sides above are equal. Thus the inequality must be an equality for every choice of ${\boldsymbol{x}}$; i.e., $f(x) = g(x)$ $\forall x$. $\Box$

In light of Chow’s Theorem, the $n+1$ numbers $\widehat{g}(\emptyset)$, $\widehat{g}(\{1\}), \dots, \widehat{g}(\{n\})$ are sometimes called the Chow parameters of the boolean function $g$.

As we will show later in this chapter, linear threshold functions are very noise-stable; hence they have a lot of their Fourier weight at low degrees. Here is a simple result along these lines:

Theorem 2 Let $f : \{-1,1\}^n \to \{-1,1\}$ be an LTF. Then $\mathbf{W}^{\leq 1}[f] \geq 1/2$.

Proof: Writing $f(x) = \mathrm{sgn}(\ell(x))$ we have $\|\ell\|_1 = \mathop{\bf E}[|\ell({\boldsymbol{x}})|] = \langle f, \ell \rangle = \langle f^{\leq 1}, \ell \rangle \leq \|f^{\leq 1}\|_2 \|\ell\|_2 = \sqrt{\mathbf{W}^{\leq 1}[f]} \cdot \|\ell\|_2,$ where the third equality follows from Plancherel and the inequality is Cauchy–Schwarz. Assume first that $\ell(x) = a_1 x_1 + \cdots + a_n x_n$ (i.e., $\ell(x)$ has no constant term). The Khintchine–Kahane inequality (Exercise 2.48) states that $\|\ell\|_1 \geq \frac{1}{\sqrt{2}} \|\ell\|_2$ and hence we deduce $\tfrac{1}{\sqrt{2}} \|\ell\|_2 \leq \sqrt{\mathbf{W}^{\leq 1}[f]} \cdot \|\ell\|_2.$ The conclusion $\mathbf{W}^{\leq 1}[f] \geq 1/2$ follows immediately (since $\|\ell\|_2$ cannot be $0$). The case when $\ell(x)$ has a constant term is left for the exercises. $\Box$

From Exercise 2.18 we know that $\mathbf{W}^{\leq 1}[\mathrm{Maj}_n] = \mathbf{W}^{1}[\mathrm{Maj}_n] \geq 2/\pi$ for all $n$; it is reasonable to conjecture that majority is extremal for Theorem 2. This is an open problem.

Conjecture. Let $f : \{-1,1\}^n \to \{-1,1\}$ be an LTF. Then $\mathbf{W}^{\leq 1}[f] \geq 2/\pi$.

A natural generalization of linear threshold functions is polynomial threshold functions:

Definition 3 A function $f : \{-1,1\}^n \to \{-1,1\}$ is called a polynomial threshold function (PTF) of degree at most $k$ if it is expressible as $f(x) = \mathrm{sgn}(p(x))$ for some real polynomial $p : \{-1,1\}^n \to {\mathbb R}$ of degree at most $k$.

Example 4 Let $f : \{-1,1\}^4 \to \{-1,1\}$ be the $4$-bit equality function which is $1$ if and only if all input bits are equal. Then $f$ is a degree-$2$ PTF because it has the representation $f(x) = \mathrm{sgn}(-3 + x_1x_2 + x_1x_3 + x_1x_4 + x_2x_3 + x_2x_4 + x_3x_4)$.

Every boolean function $f : \{-1,1\}^n \to \{-1,1\}$ is a PTF of degree at most $n$, since we can take the sign of its Fourier expansion. Thus we are usually interested in the case when the degree $k$ is “small”; say, $k = O_n(1)$. Low-degree PTFs arise frequently in learning theory; for example, as hypotheses in the Low-Degree Algorithm and many other practical learning algorithms. They also arise in circuit complexity, wherein a PTF representation $f(x) = \mathrm{sgn}(\sum_{i=1}^s {a_i} x^{T_i})$ is thought of as a “threshold-of-parities circuit”: i.e., a depth-$2$ circuit with $s$ “parity gates” $x^{T_i}$ at layer $1$ and a single “(linear) threshold gate” at layer $2$. From this point of view, the size of the circuit corresponds to the sparsity of the PTF representation:

Definition 5 We say a PTF representation $f(x) = \mathrm{sgn}(p(x))$ has sparsity at most $s$ if $p(x)$ is a multilinear polynomial with at most $s$ terms.

For example, the PTF representation of the $4$-bit equality function from Example 4 has sparsity $7$. Let’s extend the two theorems about LTFs we proved above to the case of PTFs. The generalization of Chow’s Theorem is straightforward; its proof is left to the exercises:

Theorem 6 Let $f : \{-1,1\}^n \to \{-1,1\}$ be a PTF of degree at most $k$ and let $g : \{-1,1\}^n \to \{-1,1\}$ be any function. If $\widehat{g}(S) = \widehat{f}(S)$ for all $|S| \leq k$ then $g = f$.

We also have the following extension of Theorem 2:

Theorem 7 Let $f : \{-1,1\}^n \to \{-1,1\}$ be a degree-$k$ PTF. Then $\mathbf{W}^{\leq k}[f] \geq e^{-2k}$.

Proof: Writing $f(x) = \mathrm{sgn}(p(x))$ for $p$ of degree $k$, we again have $\|p\|_1 = \mathop{\bf E}[|p({\boldsymbol{x}})|] = \langle f, p \rangle = \langle f^{\leq k}, p \rangle \leq \|f^{\leq k}\|_2 \|p\|_2 = \sqrt{\mathbf{W}^{\leq k}[f]} \cdot \|p\|_2.$ To complete the proof we need the fact that $\|p\|_2 \leq e^k \|p\|_1$ for any degree-$k$ polynomial $p : \{-1,1\}^n \to {\mathbb R}$. We will prove this much later in Chapter 10 on hypercontractivity. $\Box$

The $e^{-2k}$ in this theorem cannot be improved beyond $2^{1-k}$; see the exercises.

We close this section by discussing PTF sparsity. We begin with a (simpler) variant of Theorem 7 which is useful for proving PTF sparsity lower bounds:

Theorem 8 Let $f : \{-1,1\}^n \to \{-1,1\}$ be expressible as a PTF over the collection of monomials $\mathcal{F} \subseteq 2^{[n]}$; i.e., $f(x) = \mathrm{sgn}(p(x))$ for some polynomial $p(x) = \sum_{S \in \mathcal{F}} \widehat{p}(S) x^S$. Then $\sum_{S \in \mathcal{F}} |\widehat{f}(S)| \geq 1$.

Proof: Define $g : \{-1,1\}^n \to {\mathbb R}$ by $g(x) = \sum_{S \in \mathcal{F}} \widehat{f}(S)\,x^S$. Since $\hat{\lVert} p \hat{\rVert}_\infty \leq \|p\|_1$ (Exercise 3.8) we have $\hat{\lVert} p \hat{\rVert}_\infty \leq \|p\|_1 = \mathop{\bf E}[f({\boldsymbol{x}})p({\boldsymbol{x}})] = \sum_{S \subseteq [n]} \widehat{f}(S) \widehat{p}(S) = \sum_{S \in \mathcal{F}} \widehat{g}(S) \widehat{p}(S) \leq \hat{\lVert} g \hat{\rVert}_1 \hat{\lVert} p \hat{\rVert}_\infty,$ and hence $\hat{\lVert} g \hat{\rVert}_1 \geq 1$ as claimed. $\Box$

We can use this result to show that the “inner product mod $2$ function” (see Exercise 1.1) requires huge threshold-of-parities circuits:

Corollary 9 Any PTF representation of the inner product mod $2$ function $\mathrm{IP}_{2n} : {\mathbb F}_2^{2n} \to \{-1,1\}$ has sparsity at least $2^n$.

Proof: This follows immediately from Theorem 8 and the fact that $|\widehat{\mathrm{IP}}_{2n}(S)| = 2^{-n}$ for all $S \subseteq [2n]$ (Exercise 1.1). $\Box$

We can also show that any function $f : \{-1,1\}^n \to \{-1,1\}$ with small Fourier $1$-norm $\hat{\lVert} f \hat{\rVert}_1$ has a sparse PTF representation. In fact a stronger result holds: such a function can be additively approximated by a sparse polynomial:

Theorem 10 Let $f : \{-1,1\}^n \to {\mathbb R}$ be nonzero, let $\delta > 0$, and let $s \geq 4n\hat{\lVert} f \hat{\rVert}_1^2/\delta^2$ be an integer. Then there is a multilinear polynomial $q : \{-1,1\}^n \to {\mathbb R}$ of sparsity at most $s$ such that $\|f – q\|_\infty < \delta$.

Proof: The proof is by the probabilistic method. Let $\boldsymbol{T} \subseteq [n]$ be randomly chosen according to the distribution ${\mathop{\bf Pr}[\boldsymbol{T} = T]} = \frac{|\widehat{f}(T)|}{\hat{\lVert} f \hat{\rVert}_1}$. Let $\boldsymbol{T}_1, \dots, \boldsymbol{T}_s$ be independent draws from this distribution and define the multilinear polynomial $\boldsymbol{p}(x) = \sum_{i=1}^s \mathrm{sgn}(\widehat{f}(\boldsymbol{T}_i))\,x^{\boldsymbol{T}_i}.$ When $x \in \{-1,1\}^n$ is fixed, each monomial $\mathrm{sgn}(\widehat{f}(\boldsymbol{T}_i)\,x^{\boldsymbol{T}_i}$ becomes a $\pm 1$-valued random variable with expectation $\sum_{T \subseteq [n]} \tfrac{|\widehat{f}(T)|}{\hat{\lVert} f \hat{\rVert}_1}\cdot \mathrm{sgn}(\widehat{f}(T))\,x^{T} = \tfrac{1}{\hat{\lVert} f \hat{\rVert}_1} \sum_{T \subseteq [n]} \widehat{f}(T)\,x^T = \tfrac{f(x)}{\hat{\lVert} f \hat{\rVert}_1}.$ Thus by a Chernoff bound , for any $\epsilon > 0$, $\mathop{\bf Pr}_{\boldsymbol{T}_1, \dots, \boldsymbol{T}_s}\left[\Bigl|\boldsymbol{p}(x) - \tfrac{f(x)}{\hat{\lVert} f \hat{\rVert}_1} s\Bigr| \geq \epsilon s\right] \leq 2\exp(-\epsilon^2 s/2).$ Selecting $\epsilon = \delta/\hat{\lVert} f \hat{\rVert}_1$ and using $s \geq 4n\hat{\lVert} f \hat{\rVert}_1^2/\delta^2$, the probability is at most $2\exp(-2n) < 2^{-n}$. Taking a union bound over all $2^n$ choices of $x \in \{-1,1\}^n$, we conclude that there exists some $p(x) = \sum_{i=1}^s \mathrm{sgn}(\widehat{f}(T_i))\,x^{T_i}$ such that for all $x \in \{-1,1\}^n$, $\Bigl|p(x) - \tfrac{f(x)}{\hat{\lVert} f \hat{\rVert}_1} s\Bigr| < \epsilon s = \tfrac{\delta}{\hat{\lVert} f \hat{\rVert}_1} s \quad\Rightarrow\quad \Bigl|\tfrac{\hat{\lVert} f \hat{\rVert}_1}{s} \cdot p(x) - f(x) \Bigr| < \delta.$ Thus we may take $q = \frac{\hat{\lVert} f \hat{\rVert}_1}{s} \cdot p$. $\Box$

Corollary 11 Let $f : \{-1,1\}^n \to \{-1,1\}$. Then $f$ is expressible as a PTF of sparsity at most $s = \lceil 4n\hat{\lVert} f \hat{\rVert}_1^2\rceil$. Indeed $f$ can be represented as a majority of $s$ parities or negated-parities.

Proof: Apply the previous theorem with $\delta = 1$; we then have $f(x) = \mathrm{sgn}(q(x))$. Since this is also equivalent to $\mathrm{sgn}(p(x))$, the terms $\mathrm{sgn}(\widehat{f}(T_i))\,x^{T_i}$ are the required parities/negated-parities. $\Box$

Though functions computable by small DNFs need not have small Fourier $1$-norm, it is a further easy corollary that they can be computed by sparse PTFs: see the exercises. We also remark that there is no good converse to Corollary 11: the $\mathrm{Maj}_n$ function has a PTF (indeed, an LTF) of sparsity $n$ but has exponentially large Fourier $1$-norm (as you will be asked to show in the exercises).

2 comments to §5.1: Linear threshold functions and polynomial threshold functions

• Avishay Tal

It seems to me that a square is missing in the expression for s in corollary 11.

• Fixed, thanks!