Machine Learning Fundamentals for Economists
Given our understanding of the classic bias-variance tradeoff
The ERM process for complicated problems is not globally convex
Warning: active, incomplete literature in math, CS, and statistics.
But we can still give some intuition:
Without repeating the entire notation, we established our goal as minimizing the difference between the ideal \[ \mathbb{E}_{{\mathcal{D}}\sim \mu^*}\left[\min_{f \in {\mathcal{H}}} R(f, {\mathcal{D}}) - \min_{f \in {\mathcal{F}}} R(f, \mu^*)\right] = {\varepsilon_{\mathrm{app}}}({\mathcal{H}}) + {\varepsilon_{\mathrm{gen}}}({\mathcal{H}}) \]
That is, a classic tradeoff between approximation error and generalization error given some true distribution \(\mu^*\) and some samples \({\mathcal{D}}\)
The solution to overfitting is to keep adding more parameters
from https://arxiv.org/abs/1812.11118
from https://arxiv.org/abs/1812.11118
from https://arxiv.org/pdf/2105.14368.pdf
from https://arxiv.org/pdf/1912.02292.pdf
One interpretation is that ML optimization methods with sufficient parameters find a min-norm interpolating solution. Minimizing ERM: \[ \min_{f \in {\mathcal{H}}} \frac{1}{N}\sum_{n=1}^N \ell(f, x_n, y_n) \]
With large enough \({\mathcal{H}}\) this interpolates, i.e. \(\ell(f, x_n, y_n) = 0\) for all \(n\)
Remember overdetermined LLS and the ridgeless regression
\[ \begin{aligned} \min_{f \in {\mathcal{H}}} &||f||_S\\ &\text{ s.t. } \ell(f, x, y) = 0,\,\text{ for all } (x, y) \in {\mathcal{D}} \end{aligned} \]
See https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L7.pdf
See https://www.cs.ubc.ca/~schmidtm/Courses/440-W22/L7.pdf
from ProbML Book 1 Section 13.5.6
from ProbML Book 1 Section 13.5.6
from https://arxiv.org/pdf/2003.00307.pdf