# §8.6: Highlight: Randomized decision tree complexity

A decision tree $T$ for $f : \{-1,1\}^n \to \{-1,1\}$ can be thought of as a deterministic algorithm which, given adaptive query access to the bits of an unknown string $x \in \{-1,1\}^n$, outputs $f(x)$. E.g., to describe a natural decision tree for $f = \mathrm{Maj}_3$ in words: “Query $x_1$, then $x_2$. If they are equal, output their value; otherwise, query and output $x_3$.” For a worst-case input (one where $x_1 \neq x_2$) this algorithm has a cost of $3$, meaning it makes $3$ queries. The cost of the worst-case input is the depth of the decision tree.

As is often the case with algorithms it can be advantageous to allow randomization. For example, consider using the following randomized query algorithm for $\mathrm{Maj}_3$: “Choose two distinct input coordinates at random and query them. If they are equal, output their value; otherwise, query and output the third input coordinate.” Now for every input there is at least a $1/3$ chance that the algorithm will finish after only $2$ queries. Indeed, if we define the cost of an input $x$ to be the expected number of queries the algorithm makes on it, it is easy to see that the worst-case inputs for this algorithm have cost $(1/3) \cdot 2 + (2/3) \cdot 3 = 8/3 < 3$.

Let’s formalize the notion of a randomized decision tree:

Definition 60 Given $f : \{-1,1\}^n \to {\mathbb R}$, a (zero-error) randomized decision tree $\mathcal{T}$ computing $f$ is formally defined to be a probability distribution over (deterministic) decision trees which compute $f$. The cost of $\mathcal{T}$ on input $x \in \{-1,1\}^n$ is defined to be the expected number of queries $\boldsymbol{T}$ makes on $x$ when $\boldsymbol{T} \sim \mathcal{T}$. The cost of $\mathcal{T}$ itself is defined to be the maximum cost of any input. Finally, the (zero-error) randomized decision tree complexity of $f$, denoted $\mathrm{RDT}(f)$, is the minimum cost of a randomized decision tree computing $f$.

We can get further savings from randomization if we are willing to assume that the input $x$ is chosen randomly. For example, if ${\boldsymbol{x}} \sim \{-1,1\}^3$ is uniformly random then any of the deterministic decision trees for $\mathrm{Maj}_3$ will make $2$ queries with probability $1/2$ and $3$ queries with probability $1/2$, for an overall expected $5/2 < 8/3 < 3$ queries.

Definition 61 Let $\mathcal{T}$ be a randomized decision tree. We define \begin{align} \delta_i(\mathcal{T}) &= \mathop{\bf Pr}_{\substack{{\boldsymbol{x}} \sim \{-1,1\}^n,\\ \boldsymbol{T} \sim \mathcal{T}}}[\boldsymbol{T} \text{ queries } {\boldsymbol{x}}_i], \nonumber\\ \Delta(\mathcal{T}) &= \sum_{i=1}^n \delta_i(\mathcal{T}) = \mathop{\bf E}_{\substack{{\boldsymbol{x}} \sim \{-1,1\}^n, \\ \boldsymbol{T} \sim \mathcal{T}}}[\# \text{ of coordinates queried by } \boldsymbol{T} \text{ on } {\boldsymbol{x}}]. \label{ex:expected-dt-cost} \end{align} Given $f : \{-1,1\}^n \to {\mathbb R}$, we define $\Delta(f)$ to be the minimum of $\Delta(\mathcal{T})$ over all randomized decision trees $\mathcal{T}$ computing $f$.

We can also generalize these definitions for functions $f \in L^2(\Omega, \pi^{\otimes n})$. A deterministic decision tree over domain $\Omega$ is the natural generalization in which each internal query node has $|\Omega|$ outgoing edges, labelled by the elements of $\Omega$. We write $\delta_i^{(\pi)}(\mathcal{T}), \Delta^{(\pi)}(\mathcal{T}), \Delta^{(\pi)}(f)$ for the generalizations to trees over $\Omega$; in the case of $L^2(\{-1,1\}^n, \pi_p^{\otimes n})$ we use the superscript $(p)$ instead of $(\pi_p)$ for brevity.

It follows immediately from the definitions that for any $f \in L^2(\Omega^n, \pi^{\otimes n})$, $\Delta^{(\pi)}(f) \leq \mathrm{RDT}(f) \leq {\mathrm{DT}}(f).$

Remark 62 In the definition of $\Delta^{(\pi)}(f)$ it is equivalent if we only allow deterministic decision trees; this is because in (1) we can always choose the “best” deterministic $T$ in the support of $\mathcal{T}$.

Example 63 It follows from our discussions that $\mathrm{RDT}(\mathrm{Maj}_3) \leq 8/3$ and $\Delta(\mathrm{Maj}_3) \leq 5/2$; indeed, it’s not hard to show that both of these bounds are equalities. In the exercises you are asked to generalize to the recursive majority of $3$ function on $n = 3^d$ inputs; it satisfies ${\mathrm{DT}}(\mathrm{Maj}_3^{\otimes d}) = 3^d = n$, but \begin{align*} \mathrm{RDT}(\mathrm{Maj}_3^{\otimes d}) &\leq (8/3)^d = n^{\log_3(8/3)} \approx n^{.89},\\ \Delta(\mathrm{Maj}_3^{\otimes d}) &\leq (5/2)^d = n^{\log_3(5/2)} \approx n^{.83}. \end{align*} Incidentally, these bounds are not asymptotically sharp; estimating $\mathrm{RDT}(\mathrm{Maj}_3^{\otimes d})$ in particular is a well-studied open problem.

Example 64 In the exercises you are asked to show that for the logical OR function, $\Delta^{(p)}(\mathrm{OR}_n) = \frac{1 – (1-p)^n}{p},$ which is roughly $2$ for $p = 1/2$ but is asymptotic to $n/(2 \ln 2)$ at the critical probability $p_c$.

Example 63 illustrates a mildly surprising phenomenon: using randomness it’s possible to evaluate certain unbiased $n$-bit functions $f$ while reading only a $1/n^{\Theta(1)}$ fraction of the input bits. This is even more interesting when $f$ is transitive-symmetric like $\mathrm{Maj}_3^{\otimes d}$. In that case it’s not hard to show (exercise) that any randomized decision tree $\mathcal{T}$ computing $f$ can be converted to one where $\Delta(\mathcal{T})$ remains the same but all $\delta_i(\mathcal{T})$ are equal to $\Delta(f)/n$. Then $f$ can be evaluated despite the fact that each input bit is only queried with probability $1/n^{\Theta(1)}$. In this section we explore the limits of this phenomenon. In particular, a longstanding conjecture of Yao [Yao77] says that this is not possible for monotone graph properties:

Yao’s Conjecture Let $f : \{-1,1\}^n \to \{-1,1\}$ be a nonconstant monotone $v$-vertex graph property, where $n = \binom{v}{2}$. Then $\mathrm{RDT}(f) \geq \Omega(n)$.

Towards this conjecture we will present a lower bound due to [OSSS05]. (Two other incomparable bounds are discussed in the notes for this chapter.) It has the advantages that it works for the more general class of transitive-symmetric functions and that it even lower-bounds $\Delta^{(p_c)}(f)$:

Theorem 65 Let $f : \{-1,1\}^n \to \{-1,1\}$ be a nonconstant monotone transitive-symmetric function with critical probability $p_c$. Then $\Delta^{(p_c)}(f) \geq (n/\sigma_c)^{2/3}.$

Theorem 65 is essentially sharp in several interesting cases. Whenever the critical probability $p_c$ is $\Theta(1/n)$ or $1 – \Theta(1/n)$ then $\sigma_c = \Theta(1/\sqrt{n})$ and Theorem 65 gives the strongest possible bound, $\Delta^{(p_c)}(f) \geq \Omega(n)$. This occurs for the $\mathrm{OR}_n$ function (Example 64), or more interestingly for the monotone graph property of “containing a triangle” ($\text{Clique}_3$). Furthermore, Theorem 65 can be tight up to a logarithmic factor when $p_c = 1/2$ as the following theorem of Benjamini, Schramm, and Wilson shows:

Theorem 66 ([BSW05].) There exists an infinite family of monotone transitive-symmetric functions $f_n : \{-1,1\}^n \to \{-1,1\}$ with critical probability $p_c = 1/2$ and $\Delta(f) \leq O(n^{2/3} \log n)$.

Theorem 65 follows easily from two inequalities, appearing first in [OSSS05] and [OS06,OS08b] respectively:

OSSS Inequality Let $f \in L^2(\Omega^n, \pi^{\otimes n})$ have range $\{-1,1\}$ and let $\mathcal{T}$ be any randomized decision tree computing $f$. Then $\mathop{\bf Var}[f] \leq \sum_{i=1}^n \delta_i^{(\pi)}(\mathcal{T}) \cdot \mathbf{Inf}_i[f].$

OS Inequality Let $f \in L^2(\{-1,1\}^n, \pi_p^{\otimes n})$. Then $\sum_{i=1}^n \widehat{f}(i) \leq \|f\|_2 \cdot \sqrt{\Delta^{(p)}(f)}.$

In particular, if $f$ has range $\{-1,1\}$ and is monotone then $\mathbf{I}[f] \leq \sigma\sqrt{\Delta^{(p)}(f)}$.

These two inequalities can be thought of as strengthenings of basic Fourier inequalities which take into account the decision tree complexity of $f$. The OSSS Inequality is a generalization of the Poincaré Inequality, discounting the influences of coordinates which are rarely read. The OS Inequality essentially generalizes the result that majority functions maximizes $\sum_{i=1}^n \widehat{f}(i)$; i.e., Theorem 2.32.

We will first derive the query complexity lower bound Theorem 65 from the OSSS and OS Inequalities. We will then prove the latter two inequalities.

Proof of Theorem 65: We consider $f$ to be an element of $L^2(\{-1,1\}^n, \pi_{p_c}^{\otimes n})$. Let $\mathcal{T}$ be a randomized decision tree achieving $\Delta^{(p_c)}(f)$. In the OSSS Inequality, we have $\mathop{\bf Var}[f] = 1$ since $p_c$ is the critical probability and $\mathbf{Inf}_i[f] = \mathbf{I}[f]/n$ for each $i \in [n]$ since $f$ is transitive-symmetric. Thus $1 \leq \sum_{i=1} \delta_i^{(p)}(\mathcal{T}) \cdot \frac{\mathbf{I}[f]}{n} \quad\Rightarrow\quad n \leq \Delta^{(p)}(f) \cdot \mathbf{I}[f] \leq \sigma \Delta^{(p)}(f)^{3/2},$ where we used the OS Inequality. The theorem follows by rearranging. $\Box$

Now we prove the OSSS Inequality. We will need a simple lemma which uses the decomposition $f = \mathrm{E}_i f + \mathrm{L}_i f$.

Lemma 67 Let $f, g \in L^2(\Omega^n, \pi^{\otimes n})$ and let $j \in [n]$. Given $\omega \in \Omega$, write $f_{ \mid \omega}$ for the restriction of $f$ in which the $j$th coordinate is fixed to value $\omega$, and similarly for $g$. Then $\mathop{\bf Cov}[f,g] = \mathop{\bf E}_{\substack{\boldsymbol{\omega}, \boldsymbol{\omega}’ \sim \pi \\ \text{{independent}}}}[\mathop{\bf Cov}[f_{ \mid \boldsymbol{\omega}}, g_{ \mid \boldsymbol{\omega}'}]] + \langle \mathrm{L}_j f, \mathrm{L}_j g \rangle.$

Proof: Since the covariances and Laplacians are unchanged when constants are added, we may assume without loss of generality that $\mathop{\bf E}[f] = \mathop{\bf E}[g] = 0$. Then $\mathop{\bf Cov}[f,g] = \langle f, g\rangle$ and \begin{multline*} \mathop{\bf E}_{\boldsymbol{\omega}, \boldsymbol{\omega}’}[\mathop{\bf Cov}[f_{ \mid \boldsymbol{\omega}}, g_{ \mid \boldsymbol{\omega}'}]] = \mathop{\bf E}_{\boldsymbol{\omega}, \boldsymbol{\omega}’}[\langle f_{ \mid \boldsymbol{\omega}},g_{ \mid \boldsymbol{\omega}'}\rangle - \mathop{\bf E}[f_{ \mid \boldsymbol{\omega}}]\mathop{\bf E}[g_{ \mid \boldsymbol{\omega}'}]] \\= \mathop{\bf E}_{\boldsymbol{\omega}, \boldsymbol{\omega}’}[\langle f_{ \mid \boldsymbol{\omega}},g_{ \mid \boldsymbol{\omega}'}\rangle] – \mathop{\bf E}[f]\mathop{\bf E}[g]= \mathop{\bf E}_{\boldsymbol{\omega}, \boldsymbol{\omega}’}[\langle f_{ \mid \boldsymbol{\omega}},g_{ \mid \boldsymbol{\omega}'}\rangle] = \langle \mathrm{E}_j f, \mathrm{E}_j g \rangle. \end{multline*} Thus the stated equality reduces to the basic identity $\langle f, g \rangle = \langle \mathrm{E}_j f, \mathrm{E}_j g \rangle + \langle \mathrm{L}_j f, \mathrm{L}_j g \rangle. \quad \Box$

Proof of the OSSS Inequality: More generally we show that if $g : \{-1,1\}^n \to \{-1,1\}$ is also an element of $L^2(\Omega^n, \pi^{\otimes n})$ then $$\label{eqn:cov-osss} \mathop{\bf Cov}[f,g] \leq \sum_{i=1}^n \delta_i^{(\pi)}(\mathcal{T}) \cdot \mathbf{Inf}_i[g].$$ The result then follow by taking $g = f$. We may also assume that $\mathcal{T} = T$ is a single deterministic tree computing $f$; this is because \eqref{eqn:cov-osss} is linear in the quantities $\delta_i^{(\pi)}(\mathcal{T})$.

We prove \eqref{eqn:cov-osss} by induction on the structure of $T$. If $T$ is depth-$0$ then $f$ must be a constant function; hence $\mathop{\bf Cov}[f,g] = 0$ and \eqref{eqn:cov-osss} is trivial. Otherwise, let $j \in [n]$ be the coordinate queried at the root of $T$. For each $\omega \in \Omega$, write $T_\omega$ for the subtree of $T$ given by the $\omega$-labelled child of the root. By applying Lemma 67 and induction (noting that $T_\omega$ computes the restricted function $f_{ \mid \omega}$), we get \begin{align*} \mathop{\bf Cov}[f,g] &= \mathop{\bf E}_{\substack{\boldsymbol{\omega}, \boldsymbol{\omega}’ \sim \pi \\ \text{independent}}}[\mathop{\bf Cov}[f_{ \mid \boldsymbol{\omega}}, g_{ \mid \boldsymbol{\omega}'}]] + \langle \mathrm{L}_j f, \mathrm{L}_j g \rangle \\ &\leq \mathop{\bf E}_{\boldsymbol{\omega}, \boldsymbol{\omega}’ \sim \pi}\Bigl[\sum_{i\neq j} \delta_i^{(\pi)}(T_{\boldsymbol{\omega}}) \cdot \mathbf{Inf}_i[g_{\boldsymbol{\omega}'}]\Bigr]+ \langle \mathrm{L}_j f, \mathrm{L}_j g \rangle \\ &= \sum_{i \neq j} \delta_i^{(\pi)}(T) \cdot \mathbf{Inf}_i[g] + \langle f, \mathrm{L}_j g \rangle \tag{in part since $\mathop{\bf E}[\mathrm{L}_j g] = 0$} \\ &\leq \sum_{i \neq j} \delta_i^{(\pi)}(T) \cdot \mathbf{Inf}_i[g] + \mathop{\bf E}[|\mathrm{L}_j g|] \tag{since $|f| \leq 1$} \\ &= \sum_{i=1}^n \delta_i^{(\pi)}(T) \cdot \mathbf{Inf}_i[g], \end{align*} where the last step used $\delta_j^{(\pi)}(T) = 1$ and Proposition 24. This completes the inductive proof of \eqref{eqn:cov-osss}. $\Box$

Finally, we prove the OS Inequality. For this we require a definition.

Definition 68 Let $(\Omega, \pi)$ be a finite probability space and $T$ a deterministic decision tree over $\Omega$. The decision tree process associated to $T$ generates a random string ${\boldsymbol{x}}$ distributed according to $\pi$ (and some additional random variables), as follows:

1. Start at the root node of $T$; say it queries coordinate $j_1$. Choose ${\boldsymbol{x}}_{j_1} \sim \pi$ and follow the outgoing edge labelled by the outcome.
2. Suppose the node of $T$ which is reached queries coordinate $\boldsymbol{j}_2$. Choose ${\boldsymbol{x}}_{\boldsymbol{j}_2} \sim \pi$ and follow the outgoing edge labelled by the outcome.
3. Repeat until a leaf node is reached. At this point, define $\boldsymbol{J} = \{j_1, \boldsymbol{j}_2, \boldsymbol{j}_3, \dots\} \subseteq [n]$ to be the set of coordinates queried.
4. Draw the as-yet-unqueried coordinates, denoted ${\boldsymbol{x}}_{\overline{\boldsymbol{J}}}$, from $\pi^{\otimes \overline{\boldsymbol{J}}}$.

Despite the fact that the coordinates ${\boldsymbol{x}}_i$ are drawn in a random, dependent order, it’s not hard to see (exercise) that the final string ${\boldsymbol{x}} = ({\boldsymbol{x}}_{\boldsymbol{J}}, {\boldsymbol{x}}_{\overline{\boldsymbol{J}}})$ is distributed according the product distribution $\pi^{\otimes n}$.

Proof of the OS Inequality: We will prove the claim $\sum_{i=1}^n \widehat{f}(i) \leq \|f\|_2 \cdot \sqrt{\Delta^{(p)}(f)}$; the “in particular” statement follows immediately from Proposition 45. Fix a deterministic decision tree $T$ achieving $\Delta^{(p)}(f)$ (see Remark 62) and let ${\boldsymbol{x}} = ({\boldsymbol{x}}_{\boldsymbol{J}}, {\boldsymbol{x}}_{\overline{\boldsymbol{J}}})$ be drawn from the associated decision tree process. Using the notation $\phi$ from Definition 39 we have $\sum_{i=1}^n \widehat{f}(i) = \mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}, {\boldsymbol{x}}_{\overline{\boldsymbol{J}}}}[f({\boldsymbol{x}}) \mathop{{\textstyle \sum}}_{i=1}^n \phi({\boldsymbol{x}}_i)] = \mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}}[f({\boldsymbol{x}}_{\boldsymbol{J}}) \mathop{\bf E}_{{\boldsymbol{x}}_{\overline{\boldsymbol{J}}}}[\mathop{{\textstyle \sum}}_{i=1}^n \phi({\boldsymbol{x}}_i)]].$ Here we abused notation slightly by writing $f({\boldsymbol{x}}_{\boldsymbol{J}})$; in the decision tree process, $f$’s value is determined once ${\boldsymbol{x}}_{\boldsymbol{J}}$ is. Since $\mathop{\bf E}[\phi({\boldsymbol{x}}_i)] = 0$ for each $i \not \in \boldsymbol{J}$ we may continue: \begin{align*} \mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}}[f({\boldsymbol{x}}_{\boldsymbol{J}}) \mathop{\bf E}_{{\boldsymbol{x}}_{\overline{\boldsymbol{J}}}}[\mathop{{\textstyle \sum}}_{i=1}^n \phi({\boldsymbol{x}}_i)]] &= \mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}}[f({\boldsymbol{x}}_{\boldsymbol{J}}) \mathop{{\textstyle \sum}}_{i=1}^n \boldsymbol{1}_{\{i \in \boldsymbol{J}\}} \phi({\boldsymbol{x}}_i)] \\ &\leq \sqrt{\mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}}[f({\boldsymbol{x}}_{\boldsymbol{J}})^2]}\sqrt{\mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}}\left[\Bigl(\mathop{{\textstyle \sum}}_{i=1}^n \boldsymbol{1}_{\{i \in \boldsymbol{J}\}} \phi({\boldsymbol{x}}_i)\Bigr)^2\right]}, \end{align*} by Cauchy–Schwarz. Now $\sqrt{\mathop{\bf E}_{\boldsymbol{J}, {\boldsymbol{x}}_{\boldsymbol{J}}}[f({\boldsymbol{x}}_{\boldsymbol{J}})^2]}$ is simply $\|f\|_2$ since $T$ computes $f$. To complete the pro

### 4 comments to §8.6: Highlight: Randomized decision tree complexity

• Noam Lifshitz

There is a latex typo in theorem 67.

• Fixed, thanks.

• Matt Franklin

In the proof of Theorem 8.66 (middle of p. 225 in book), the
next to last line has three superscript $(p)$’s that maybe
should all be $(p_c)$’s.

• Yes, thank you! Please keep them coming!