Chapter 1 • Exercise 2

Part 1

Consider the finite class consisting of the two elements $\mathcal{F} = \{f_0, f_1\}$ where \begin{align} f_{i}(x) = i \quad \text{for all $x$} \end{align} Now consider the sequnce $(x_i, y_i)$ for $i = 1, \ldots 2T$ where \begin{align} y_{2k-1} = 0, \ y_{2k} = 1 \quad k = 1, \ldots T \end{align} For the indicator loss $l(\cdot, \cdot)$, we must have that \begin{align} \min_{f \in \mathcal{F}} \sum_{i=1}^{T} l(f(x_i), y_i) = T \end{align} since both $f_1$ and $f_2$ achieve the same loss. Further, one empirical minimizer $\hat{f^t}$ (since at odd $t$, $\hat{f^{t}}$ is not uniquely defined) is given by \begin{align} \hat{f}{^{2k-1}} = f_1, \ \hat{f}{^{2k}} = f_0 \quad k = 1, \ldots T \end{align} Hence we have \begin{align} \sum_{t=1}^{2T} l(\hat{f}^{t}(x_t), y^t) = 2T \end{align} and thus \begin{align} \sum_{t=1}^{2T} l(\hat{f}^{t}(x_t), y^t) - \min_{f \in \mathcal{F}} \sum_{i=1}^{T} l(f(x_i), y_i) = T = \Omega(2T) \end{align} as desired.

Note here, the choice of $\hat{f}^{2k-1}$ is sort of arbitrary. However this doesn't really matter. Let $p$ be the probability we get the next token right. As long as $p < 1$, on expectation our regret is still $\Omega(T)$.

Part 2

Big picture: since data is iid, we essentially show that ERM learns $f^{\star}$

Let $\hat{L}^{t}(f) = \frac{1}{t} \sum_{i=1}^{t} l(f(x), y)$ and $\hat{L}(f) = \hat{L}^{T}(f)$.

Note that since $f^{t}$ is a function of data $\mathcal{H}^{t-1}$ and is independent of $(x^t, y^t)$, we have \begin{align} \mathbb{E}\left[\sum_{t=1}^T l(\hat{f}^{t}(x_t), y_t) - T \min \hat{L}(f) \right] &= \sum_{t=1}^T L(\hat{f}^{t}) - T \mathbb{E} \left[ \min_{f \in \mathcal{F}} \hat{L}(f) \right] \end{align} If we let $f^\star = \arg\min L(f)$, then note that for all $t$, we have by proposition $1$ \begin{align} L(\hat{f}^{t}) - L(f^\star) &\lesssim \sqrt{\frac{\log|\mathcal{F}|}{t}} \\ \sum_{t=1}^{T} (L(\hat{f}^{t}) - L(f^\star)) &\lesssim \sqrt{T \log |\mathcal{F}|} \end{align} From proof of proposition 1, we know that with probability $1 - \delta$, for all $f \in \mathcal{F}$, we have \begin{align} L(f) - \hat{L}(f) \leq \sqrt{\frac{1}{2T} \log \frac{|\mathcal{F}|}{\delta}} \end{align} By setting $\hat{f} = \argmin_{f \in \mathcal{F}} \hat{L}(f)$, we have with probability $1 - \delta$ \begin{align} \hat{L}(\hat{f}) - L(f^\star) &= \hat{L}(\hat{f}) - L(\hat{f}) + L(\hat{f}) - L(f^\star) \\ &\gtrsim - \sqrt{\frac{1}{T} \log \frac{|\mathcal{F}|}{\delta}} \end{align} integrating out the tails, we have \begin{align} \mathbb{E}[\hat{L}(\hat{f})] - L(f^{\star}) &\gtrsim - \sqrt{\frac{1}{T} \log \frac{|\mathcal{F}|}{\delta}} \end{align} Combining gives \begin{align} \sum_{t=1}^{T} (L(\hat{f}^{t}) - L(f^\star)) - T \left(\mathbb{E}(\hat{L}(\hat{f})) - L(f^\star)\right) &\lesssim \sqrt{|T| \log \mathcal{F}} \end{align} as desired.

Return home