§3.4: Learning theory

Computational learning theory is an area of algorithms research devoted to the following task: given a source of “examples” $(x, f(x))$ from an unknown function $f$, compute a “hypothesis” function $h$ which is good at predicting $f(y)$ on future inputs $y$. We will focus on just one possible formulation of the task:

Definition 27 In the model of PAC (“Probably Approximately Correct”) learning under the uniform distribution on $\{-1,1\}^n$, a learning problem is identified with a concept class $\mathcal{C}$, which is just a collection of functions $f : \{-1,1\}^n \to \{-1,1\}$. A learning algorithm $A$ for $\mathcal{C}$ is a randomized algorithm which has limited access to an unknown target function $f \in \mathcal{C}$. The two access models, in increasing order of strength, are:

  • random examples, meaning $A$ can draw pairs $({\boldsymbol{x}}, f({\boldsymbol{x}}))$ where ${\boldsymbol{x}} \in \{-1,1\}^n$ is uniformly random;
  • queries, meaning $A$ can request the value $f(x)$ for any $x \in \{-1,1\}^n$ of its choice.

In addition, $A$ is given as input an accuracy parameter $\epsilon \in [0,1/2]$. The output of $A$ is required to be (the circuit representation of) a hypothesis function $h : \{-1,1\}^n \to \{-1,1\}$. We say that $A$ learns $\mathcal{C}$ with error $\epsilon$ if for any $f \in \mathcal{C}$, with high probability $A$ outputs an $h$ which is $\epsilon$-close to $f$: i.e., satisfies $\mathrm{dist}(f,h) \leq \epsilon$.

In the above definition, the phrase “with high probability” can be fixed to mean, say, “except with probability at most $1/10$”. (As is common with randomized algorithms, the choice of constant $1/10$ is unimportant; see the exercises.)

For us, the main desideratum of a learning algorithm is efficient running time. One can easily learn any function $f$ to error $0$ in time $\widetilde{O}(2^n)$ (exercise); however this is not very efficient. If the concept class $\mathcal{C}$ contains very complex functions then such exponential running time is necessary; however if $\mathcal{C}$ contains only relatively “simple” functions then more efficient learning may be possible. For example, in a later section we will show that the concept class \[ \mathcal{C} = \{f : {\mathbb F}_2^n \to \{-1,1\} \mid {\mathrm{DT}_{\mathrm{size}}}(f) \leq s\} \] can be learned with queries to error $\epsilon$ by an algorithm running in time $\mathrm{poly}(s, n, 1/\epsilon)$.

A common way of trying to learn an unknown target $f : \{-1,1\}^n \to \{-1,1\}$ is by discovering “most of” its Fourier spectrum. To formalize this, let’s generalize Definition 1:

Definition 28 Let $\mathcal{F}$ be a collection of subsets $S \subseteq [n]$. We say that the Fourier spectrum of $f : \{-1,1\}^n \to {\mathbb R}$ is $\epsilon$-concentrated on $\mathcal{F}$ if \[ \sum_{\substack{S \subseteq [n] \\ S \notin \mathcal{F}}} \widehat{f}(S)^2 \leq \epsilon. \] For $f : \{-1,1\}^n \to \{-1,1\}$ we can express this condition using the spectral sample: $\mathop{\bf Pr}_{\boldsymbol{S} \sim \mathscr{S}_{f}}[\boldsymbol{S} \notin \mathcal{F}] \leq \epsilon$.

Most functions don’t have their Fourier spectrum concentrated on a small collection (see the exercises). But for those that do, we may hope to discover “most of” their Fourier coefficients. The main result of this section is a kind of “meta-algorithm” for learning an unknown target $f$. It reduces the problem of learning $f$ to the problem of identifying a collection of characters on which $f$’s Fourier spectrum is concentrated.

Theorem 29 Assume learning algorithm $A$ has (at least) random example access to target $f : \{-1,1\}^n \to \{-1,1\}$. Suppose that $A$ can — somehow — identify a collection $\mathcal{F}$ of subsets on which $f$’s Fourier spectrum is $\epsilon/2$-concentrated. Then using $\mathrm{poly}(|\mathcal{F}|, n, 1/\epsilon)$ additional time, $A$ can with high probability output a hypothesis $h$ which is $\epsilon$-close to $f$.

The idea of the theorem is that $A$ will estimate all of $f$’s Fourier coefficients in $\mathcal{F}$, obtaining a good approximation to $f$’s Fourier expansion. Then $A$’s hypothesis will be the sign of this approximate Fourier expansion.

The first tool we need to prove Theorem 29 is the ability to accurately estimate any fixed Fourier coefficient:

Proposition 30 Given access to random examples from $f : \{-1,1\}^n \to \{-1,1\}$, there is a randomized algorithm which takes as input $S \subseteq [n]$, $0 < \delta, \epsilon \leq 1/2$, and outputs an estimate $\widetilde{f}(S)$ for $\widehat{f}(S)$ which satisfies \[ |\widetilde{f}(S) - \widehat{f}(S)| \leq \epsilon \] except with probability at most $\delta$. The running time is $\mathrm{poly}(n, 1/\epsilon) \cdot \log(1/\delta)$.

Proof: We have $\widehat{f}(S) = \mathop{\bf E}_{{\boldsymbol{x}}}[f({\boldsymbol{x}})\chi_S({\boldsymbol{x}})]$. Given random examples $({\boldsymbol{x}}, f({\boldsymbol{x}}))$, the algorithm can compute $f({\boldsymbol{x}}) \chi_S({\boldsymbol{x}}) \in \{-1,1\}$ and therefore empirically estimate $\mathop{\bf E}_{{\boldsymbol{x}}}[f({\boldsymbol{x}})\chi_S({\boldsymbol{x}})]$. A standard application of the Chernoff bound implies that $O(\log(1/\delta)/\epsilon^2)$ examples are sufficient to obtain an estimate within $\pm \epsilon$ with probability at least $1-\delta$. $\Box$

The second observation we need to prove Theorem 29 is the following:

Proposition 31 Suppose that $f : \{-1,1\}^n \to \{-1,1\}$ and $g : \{-1,1\}^n \to {\mathbb R}$ satisfy $\|f – g\|_2^2 \leq \epsilon$. Let $h : \{-1,1\}^n \to \{-1,1\}$ be defined by $h(x) = \mathrm{sgn}(g(x))$, with $\mathrm{sgn}(0)$ chosen arbitrarily from $\{-1,1\}$. Then $\mathrm{dist}(f,h) \leq \epsilon$.

Proof: Since $|f(x) – g(x)|^2 \geq 1$ whenever $f(x) \neq \mathrm{sgn}(g(x))$, we conclude \[ \mathrm{dist}(f,h) = \mathop{\bf Pr}_{{\boldsymbol{x}}}[f({\boldsymbol{x}}) \neq h({\boldsymbol{x}})] = \mathop{\bf E}_{{\boldsymbol{x}}}[\boldsymbol{1}_{f({\boldsymbol{x}}) \neq \mathrm{sgn}(g({\boldsymbol{x}}))}] \leq \mathop{\bf E}_{{\boldsymbol{x}}}[|f({\boldsymbol{x}}) - g({\boldsymbol{x}})|^2] = \|f – g\|_2^2. \quad \Box\]

(See the exercises for an improvement to this argument.)

We can now prove Theorem 29:

Proof of Theorem 29: For each $S \in \mathcal{F}$ the algorithm uses Proposition 30 to produce an estimate $\widetilde{f}(S)$ for $\widehat{f}(S)$ which satisfies $|\widetilde{f}(S) – \widehat{f}(S)| \leq \sqrt{\epsilon}/(2\sqrt{|\mathcal{F}|})$ except with probability at most $1/(10 |\mathcal{F}|)$. Overall this requires $\mathrm{poly}(|\mathcal{F}|, n, 1/\epsilon)$ time, and by the union bound, except with probability at most $1/10$ all $|\mathcal{F}|$ estimates have the desired accuracy. Finally, $A$ forms the real-valued function $g = \sum_{S \in \mathcal{F}} \widetilde{f}(S) \chi_S$ and outputs hypothesis $h = \mathrm{sgn}(g)$. By Proposition 31, it suffices to show that $\|f – g\|_2^2 \leq \epsilon$. And indeed, \begin{align*} \|f – g\|_2^2 &= \sum_{S \subseteq [n]} \widehat{f – g}(S)^2 \tag{Parseval} \\ &= \sum_{S \in \mathcal{F}} (\widehat{f}(S) – \widetilde{f}(S))^2 + \sum_{S \notin \mathcal{F}} \widehat{f}(S)^2 \\ &\leq \sum_{S \in \mathcal{F}} \left(\frac{\sqrt{\epsilon}}{2\sqrt{|\mathcal{F}|}}\right)^2 + \epsilon/2 \tag{estimates, concentration} \\ &= \epsilon/4 + \epsilon/2 \quad\leq\quad \epsilon, \end{align*} as desired. $\Box$

As we described, Theorem 29 reduces the algorithmic task of learning $f$ to the algorithmic task of identifying a collection $\mathcal{F}$ on which $f$’s Fourier spectrum is concentrated. In the next section we will describe the Goldreich–Levin algorithm, a sophisticated way to find such an $\mathcal{F}$ assuming query access to $f$. For now, though, we observe that for several interesting concept classes we don’t need to do any algorithmic searching for $\mathcal{F}$: we can just take $\mathcal{F}$ to be all sets of small cardinality. This works whenever all functions in $\mathcal{C}$ have low-degree spectral concentration.

The “Low-Degree Algorithm” Let $k \geq 1$ and let $\mathcal{C}$ be a concept class for which every function $f : \{-1,1\}^n \to \{-1,1\}$ in $\mathcal{C}$ is $\epsilon/2$-concentrated up to degree $k$. Then $\mathcal{C}$ can be learned from random examples only with error $\epsilon$ in time $\mathrm{poly}(n^k, 1/\epsilon)$.

Proof: Apply Theorem 29 with $\mathcal{F} = \{S \subseteq [n] : |S| \leq k\}$. We have $|\mathcal{F}| = \sum_{j=0}^k \binom{n}{j} \leq O(n^k)$. $\Box$

The Low-Degree Algorithm reduces the algorithmic problem of learning $\mathcal{C}$ from random examples to the analytic task of showing low-degree spectral concentration for the functions in $\mathcal{C}$. Using the results of Section 3.1 we can quickly obtain some learning-theoretic results. For example:

Corollary 32 For $t \geq 1$, let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \mathbf{I}[f] \leq t\}$. Then $\mathcal{C}$ is learnable from random examples with error $\epsilon$ in time $n^{O(t/\epsilon)}$.

Proof: Use the Low-Degree Algorithm with $k = 2t/\epsilon$; the result follows from Proposition 2. $\Box$

Corollary 33 Let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid f \text{ is monotone}\}$. Then $\mathcal{C}$ is learnable from random examples with error $\epsilon$ in time $n^{O(\sqrt{n}/\epsilon)}$.

Proof: Follows from the previous corollary and Theorem 2.32. $\Box$

You might be concerned that a running time such as $n^{O(\sqrt{n})}$ does not seem very efficient. Still, it’s much better than the trivial running time of $\widetilde{O}(2^n)$. Further, as we will see in the next section, learning algorithms are sometimes used in attacks on cryptographic schemes, and in this context even subexponential-time algorithms are considered dangerous. Continuing with applications of the Low-Degree Algorithm:

Corollary 34 For $\delta \in (0,1/2]$, let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \mathbf{NS}_\delta[f] \leq \epsilon/6\}$. Then $\mathcal{C}$ is learnable from random examples with error $\epsilon$ in time $\mathrm{poly}(n^{1/\delta}, 1/\epsilon)$.

Proof: Follows from Proposition 3. $\Box$

Corollary 35 Let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid {\mathrm{DT}_{\mathrm{size}}}(f) \leq s\}$. Then $\mathcal{C}$ is learnable from random examples with error $\epsilon$ in time $n^{O(\log(s/\epsilon))}$.

Proof: Follows from Proposition 17. $\Box$

With a slight extra twist one can also exactly learn the class of degree-$k$ functions in time $\mathrm{poly}(n^k)$ — see the exercises:

Theorem 36 Let $k \geq 1$ and let $\mathcal{C} = \{f : \{-1,1\}^n \to \{-1,1\} \mid \deg(f) \leq k\}$. (E.g., $\mathcal{C}$ contains all depth-$k$ decision trees.) Then $\mathcal{C}$ is learnable from random examples with error $0$ in time $n^k \cdot \mathrm{poly}(n,2^k)$.

4 comments to §3.4: Learning theory

Leave a Reply




You can use most standard LaTeX, as well as these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>