Chapter 1 • Exercise 2

Part 1

Consider the finite class consisting of the two elements \(\mathcal{F} = \{f_0, f_1\}\) where \begin{align} f_{i}(x) = i \quad \text{for all $x$} \end{align} Now consider the sequnce \((x_i, y_i)\) for \(i = 1, \ldots 2T\) where \begin{align} y_{2k-1} = 0, \ y_{2k} = 1 \quad k = 1, \ldots T \end{align} For the indicator loss \(l(\cdot, \cdot)\), we must have that \begin{align} \min_{f \in \mathcal{F}} \sum_{i=1}^{T} l(f(x_i), y_i) = T \end{align} since both \(f_1\) and \(f_2\) achieve the same loss. Further, one empirical minimizer \(\hat{f^t}\) (since at odd \(t\), \(\hat{f^{t}}\) is not uniquely defined) is given by \begin{align} \hat{f}{^{2k-1}} = f_1, \ \hat{f}{^{2k}} = f_0 \quad k = 1, \ldots T \end{align} Hence we have \begin{align} \sum_{t=1}^{2T} l(\hat{f}^{t}(x_t), y^t) = 2T \end{align} and thus \begin{align} \sum_{t=1}^{2T} l(\hat{f}^{t}(x_t), y^t) - \min_{f \in \mathcal{F}} \sum_{i=1}^{T} l(f(x_i), y_i) = T = \Omega(2T) \end{align} as desired.

Note here, the choice of \(\hat{f}^{2k-1}\) is sort of arbitrary. However this doesn't really matter. Let \(p\) be the probability we get the next token right. As long as \(p < 1\), on expectation our regret is still \(\Omega(T)\).

Part 2

Big picture: since data is iid, we essentially show that ERM learns \(f^{\star}\)

Let \(\hat{L}^{t}(f) = \frac{1}{t} \sum_{i=1}^{t} l(f(x), y)\) and \(\hat{L}(f) = \hat{L}^{T}(f)\).

Note that since \(f^{t}\) is a function of data \(\mathcal{H}^{t-1}\) and is independent of \((x^t, y^t)\), we have \begin{align} \mathbb{E}\left[\sum_{t=1}^T l(\hat{f}^{t}(x_t), y_t) - T \min \hat{L}(f) \right] &= \sum_{t=1}^T L(\hat{f}^{t}) - T \mathbb{E} \left[ \min_{f \in \mathcal{F}} \hat{L}(f) \right] \end{align} If we let \(f^\star = \arg\min L(f)\), then note that for all \(t\), we have by proposition \(1\) \begin{align} L(\hat{f}^{t}) - L(f^\star) &\lesssim \sqrt{\frac{\log|\mathcal{F}|}{t}} \\ \sum_{t=1}^{T} (L(\hat{f}^{t}) - L(f^\star)) &\lesssim \sqrt{T \log |\mathcal{F}|} \end{align} From proof of proposition 1, we know that with probability \(1 - \delta\), for all \(f \in \mathcal{F}\), we have \begin{align} L(f) - \hat{L}(f) \leq \sqrt{\frac{1}{2T} \log \frac{|\mathcal{F}|}{\delta}} \end{align} By setting \(\hat{f} = \argmin_{f \in \mathcal{F}} \hat{L}(f)\), we have with probability $1 - \delta$ \begin{align} \hat{L}(\hat{f}) - L(f^\star) &= \hat{L}(\hat{f}) - L(\hat{f}) + L(\hat{f}) - L(f^\star) \\ &\gtrsim - \sqrt{\frac{1}{T} \log \frac{|\mathcal{F}|}{\delta}} \end{align} integrating out the tails, we have \begin{align} \mathbb{E}[\hat{L}(\hat{f})] - L(f^{\star}) &\gtrsim - \sqrt{\frac{1}{T} \log \frac{|\mathcal{F}|}{\delta}} \end{align} Combining gives \begin{align} \sum_{t=1}^{T} (L(\hat{f}^{t}) - L(f^\star)) - T \left(\mathbb{E}(\hat{L}(\hat{f})) - L(f^\star)\right) &\lesssim \sqrt{|T| \log \mathcal{F}} \end{align} as desired.

Return home