Chapter 1 • Exercise 2
Part 1
Consider the finite class consisting of the two elements \(\mathcal{F} = \{f_0, f_1\}\) where
\begin{align}
f_{i}(x) = i \quad \text{for all $x$}
\end{align}
Now consider the sequnce \((x_i, y_i)\) for \(i = 1, \ldots 2T\) where
\begin{align}
y_{2k-1} = 0, \ y_{2k} = 1 \quad k = 1, \ldots T
\end{align}
For the indicator loss \(l(\cdot, \cdot)\), we must have that
\begin{align}
\min_{f \in \mathcal{F}} \sum_{i=1}^{T} l(f(x_i), y_i) = T
\end{align}
since both \(f_1\) and \(f_2\) achieve the same loss.
Further, one empirical minimizer \(\hat{f^t}\) (since at odd \(t\), \(\hat{f^{t}}\) is not uniquely defined) is given by
\begin{align}
\hat{f}{^{2k-1}} = f_1, \ \hat{f}{^{2k}} = f_0 \quad k = 1, \ldots T
\end{align}
Hence we have
\begin{align}
\sum_{t=1}^{2T} l(\hat{f}^{t}(x_t), y^t) = 2T
\end{align}
and thus
\begin{align}
\sum_{t=1}^{2T} l(\hat{f}^{t}(x_t), y^t) - \min_{f \in \mathcal{F}} \sum_{i=1}^{T} l(f(x_i), y_i) = T = \Omega(2T)
\end{align}
as desired.
Note here, the choice of \(\hat{f}^{2k-1}\) is sort of arbitrary. However this doesn't really matter. Let \(p\) be the probability we get the next token right. As long as \(p < 1\), on expectation our regret is still \(\Omega(T)\).
Part 2
Big picture: since data is iid, we essentially show that ERM learns \(f^{\star}\)
Let \(\hat{L}^{t}(f) = \frac{1}{t} \sum_{i=1}^{t} l(f(x), y)\) and \(\hat{L}(f) = \hat{L}^{T}(f)\).
Note that since \(f^{t}\) is a function of data \(\mathcal{H}^{t-1}\) and is independent of \((x^t, y^t)\), we have
\begin{align}
\mathbb{E}\left[\sum_{t=1}^T l(\hat{f}^{t}(x_t), y_t) - T \min \hat{L}(f) \right] &= \sum_{t=1}^T L(\hat{f}^{t}) - T \mathbb{E} \left[ \min_{f \in \mathcal{F}} \hat{L}(f) \right]
\end{align}
If we let \(f^\star = \arg\min L(f)\), then note that for all \(t\), we have by proposition \(1\)
\begin{align}
L(\hat{f}^{t}) - L(f^\star) &\lesssim \sqrt{\frac{\log|\mathcal{F}|}{t}} \\
\sum_{t=1}^{T} (L(\hat{f}^{t}) - L(f^\star)) &\lesssim \sqrt{T \log |\mathcal{F}|}
\end{align}
From proof of proposition 1, we know that with probability \(1 - \delta\), for all \(f \in \mathcal{F}\), we have
\begin{align}
L(f) - \hat{L}(f) \leq \sqrt{\frac{1}{2T} \log \frac{|\mathcal{F}|}{\delta}}
\end{align}
By setting \(\hat{f} = \argmin_{f \in \mathcal{F}} \hat{L}(f)\), we have with probability $1 - \delta$
\begin{align}
\hat{L}(\hat{f}) - L(f^\star) &= \hat{L}(\hat{f}) - L(\hat{f}) + L(\hat{f}) - L(f^\star) \\
&\gtrsim - \sqrt{\frac{1}{T} \log \frac{|\mathcal{F}|}{\delta}}
\end{align}
integrating out the tails, we have
\begin{align}
\mathbb{E}[\hat{L}(\hat{f})] - L(f^{\star}) &\gtrsim - \sqrt{\frac{1}{T} \log \frac{|\mathcal{F}|}{\delta}}
\end{align}
Combining gives
\begin{align}
\sum_{t=1}^{T} (L(\hat{f}^{t}) - L(f^\star)) - T \left(\mathbb{E}(\hat{L}(\hat{f})) - L(f^\star)\right) &\lesssim \sqrt{|T| \log \mathcal{F}}
\end{align}
as desired.
Return home