
Machine Learning Fundamentals for Economists
Uniform bounds on Errors: impossible in general without exponential cost in \(N\) \[ \min_{f\in {\mathcal{F}}}\sup_{(x,y) \in {\mathcal{X}}} \left[(f(x) - y)^2\right] \]
Statistical learning: minimizes regions weighted by a distribution \(\mu^*\)
\[ \min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y) \sim \mu^*}\left[(f(x) - y)^2\right] \]
“A random variable that depends in a Lipschitz way on many independent variables (but not too much on any of them) is essentially constant.” (Ledoux (2001))
\[ {\mathbb{P}_{}\left( {|f(X) - 1| \geq \epsilon} \right)} \leq \exp(-c \epsilon^2 N) \]
for some constant \(c > 0\) independent of \(N\).

From Vershynin (2018): 2D Gaussian samples vs. random 2D projections of high-dimensional samples
Definition: Lipschitz Function
A function \(f: {\mathbb{R}}^N \to {\mathbb{R}}\) is \(L\)-Lipschitz if for all \(x, y \in {\mathbb{R}}^N\):
\[ |f(x) - f(y)| \leq L \|x - y\|_2 \]
Proposition: Concentration for Lipschitz Functions of Gaussians
Let \(X \sim \mathcal{N}(0_N, I_N)\) and let \(f: {\mathbb{R}}^N \to {\mathbb{R}}\) be \(L\)-Lipschitz. Then:
\[ {\mathbb{P}_{}\left( {|f(X) - \mathbb{E}[f(X)]| \geq t} \right)} \leq 2\exp\left(-\frac{t^2}{2L^2}\right) \]
The Gaussian concentration inequality requires:
Violating (1) or (2) can break concentration, while violating (3) prevents improvement with dimension.
Definition: Operator Norm and Spectral Radius
The operator norm (spectral norm) of \(A \in {\mathbb{R}}^{N \times N}\):
\[ {\left\| {A} \right\|_{\mathrm{op}}} = \sup_{x \neq 0} \frac{\|Ax\|_2}{\|x\|_2} \]
The spectral radius of \(A\):
\[ {\varrho\!\left( {A} \right)} = \max_i |\lambda_i(A)| \]
Consider \(Z \sim \mathcal{N}(0, \Sigma)\) with general covariance \(\Sigma\)
Write \(Z = \Sigma^{1/2} X\) where \(X \sim \mathcal{N}(0, I_N)\)
For an \(L\)-Lipschitz function \(f\), define \(g(X) = f(\Sigma^{1/2} X)\)
\[ |g(x) - g(y)| = |f(\Sigma^{1/2} x) - f(\Sigma^{1/2} y)| \leq L \|\Sigma^{1/2}(x-y)\|_2 \]
Using operator norm: \(\|\Sigma^{1/2}(x-y)\|_2 \leq {\left\| {\Sigma^{1/2}} \right\|_{\mathrm{op}}} \|x-y\|_2 = \sqrt{{\varrho\!\left( {\Sigma} \right)}} \|x-y\|_2\)
Proposition: Non-Isotropic Gaussian Concentration
For \(Z \sim \mathcal{N}(0, \Sigma)\) and \(f: {\mathbb{R}}^N \to {\mathbb{R}}\) that is \(L\)-Lipschitz:
\[ {\mathbb{P}_{}\left( {|f(Z) - \mathbb{E}[f(Z)]| \geq t} \right)} \leq 2\exp\left(-\frac{t^2}{2L^2 {\varrho\!\left( {\Sigma} \right)}}\right) \]
With \({\varrho\!\left( {\Sigma} \right)} = \sigma^2 N\), the concentration bound becomes:
\[ {\mathbb{P}_{}\left( {|f(Z) - \mathbb{E}[f(Z)]| \geq t} \right)} \leq 2\exp\left(-\frac{t^2}{2L^2 {\varrho\!\left( {\Sigma} \right)}}\right) = 2\exp\left(-\frac{t^2}{2L^2 \sigma^2 N}\right) \]
Concentration gets worse with dimension!
Intuition: With perfect correlation, all \(N\) variables move together
This is why independence (or weak dependence) is crucial for concentration
Consider \(f(X) = X_1\) with \(X \sim \mathcal{N}(0, I_N)\)
Gradient: \(\nabla f = (1, 0, \ldots, 0)^\top\), so \(L = 1\)
Concentration bound:
\[ {\mathbb{P}_{}\left( {|X_1| \geq t} \right)} \leq 2e^{-t^2/2} \]
No improvement with dimension! The bound is the same for \(N = 10\) or \(N = 10{,}000\)
\(X_1 \sim \mathcal{N}(0, 1)\) regardless of \(N\)—adding more coordinates doesn’t help
Compare with \(\bar{X} = \frac{1}{N}\sum_{i=1}^N X_i\)
Gradient: \(\nabla \bar{X} = \frac{1}{N}(1, 1, \ldots, 1)^\top\)
Lipschitz constant: \(L = \|\nabla \bar{X}\|_2 = \frac{1}{\sqrt{N}}\)
Concentration bound:
\[ {\mathbb{P}_{}\left( {|\bar{X}| \geq t} \right)} \leq 2\exp\left(-\frac{t^2 N}{2}\right) \]
Improves with dimension! Exponentially tighter as \(N\) grows
\(\bar{X} \sim \mathcal{N}(0, 1/N)\): variance shrinks as \(1/N\)
When Does Concentration Improve with \(N\)?
Concentration improves with dimension when no single coordinate dominates:



Proposition: Johnson–Lindenstrauss (JL) Lemma
For any \(\epsilon \in (0,1)\), \(\delta \in (0,1)\), and points \(\{x_1,\dots,x_M\}\subset\mathbb{R}^N\), let \(R \in \mathbb{R}^{k \times N}\) be a random matrix (e.g., Gaussian or Rademacher entries) scaled by \(1/\sqrt{k}\). If
\[ k = \mathrm{O}\left(\epsilon^{-2} \log(M^2/\delta)\right), \]
then with probability at least \(1-\delta\), for all \(i,j\):
\[ (1-\epsilon)\|x_i-x_j\|_2^2 \;\le\; \|R x_i - R x_j\|_2^2 \;\le\; (1+\epsilon)\|x_i-x_j\|_2^2. \]
Each individual distance concentrates:
\[ {\mathbb{P}_{}\left( {\left|\|Rx\|_2^2 - \|x\|_2^2\right| > \epsilon \|x\|_2^2} \right)} \lesssim e^{-c \epsilon^2 k} \]
Union bound over \(\mathrm{O}(M^2)\) pairs: multiply failure probability by \(M^2\)
To keep total failure \(\leq \delta\): need \(e^{-c\epsilon^2 k} \cdot M^2 \leq \delta\)
Solving: \(k \gtrsim \epsilon^{-2}(2\log M + \log(1/\delta)) = \epsilon^{-2} \log(M^2/\delta)\)
Dimension \(N\) appears nowhere! Randomness “averages out” across all \(N\) coordinates
Important Limitation
JL preserves distances among a fixed set of points \(\{x_1, \ldots, x_M\}\).
It provides no guarantee for distances involving a new point \(x_{M+1}\).
| Algorithm type | What is preserved | Economic application |
|---|---|---|
| JL / Random Projections | Pairwise geometry / inner products | High-dimensional moment inequalities |
| Hutchinson Trace | Spectral trace / quadratic forms | Variance decomposition |
| Randomized SVD / Sketching | Low-rank structure | Factor models, macro-finance |
| Stochastic Trace Estimation | Operator traces | Continuous-time asset pricing |
We now look at canonical economic applications.
\[ \mathbb{E}[\hat{\beta}^{\top} A \hat{\beta}] = \beta^{\top} A \beta + \sigma^2 \cdot \mathrm{Tr}(B) \]
To debias: estimate \(\mathrm{Tr}(B)\) via Hutchinson’s trick with Rademacher vectors \(z_j \in \{\pm1\}^N\):
\[ \mathrm{Tr}(B) = \mathbb{E}_{z}[z^\top B z] \approx \frac{1}{m}\sum_{j=1}^m z_j^\top (B z_j) \]
Consider a high-dimensional diffusion on state \(X_t \in \mathbb{R}^N\):
\[ dX_t = \mu(X_t)\,dt + \sigma(X_t)\,dW_t, \quad X_t \in \mathbb{R}^N, \quad dW_t \in \mathbb{R}^K\quad \text{Brownian motion} \]
The infinitesimal generator, \(\mathcal{A}\), for this process is
\[ \mathcal{A}f(X) = \mu(X)^\top \nabla f(X) + \frac{1}{2} {\mathrm{Tr}\left( {\sigma(X) \sigma(X)^\top\nabla^2 f(X)} \right)} \]
Simple HJBE for asset pricing,
\[ \rho V(X) = u(X) + \mathcal{A}V(X) = u(X) + \mu(X)^\top \nabla V(X) + \frac{1}{2} {\mathrm{Tr}\left( {\sigma(X) \sigma(X)^\top \nabla^2 V(X)} \right)}, \]
Goal: Compute \({\mathrm{Tr}\left( {\sigma(X)\sigma(X)^{\top} \nabla^2 V(X)} \right)}\)
Hutchinson estimator: For \(z \sim \mathcal{N}(0, I_K)\), let \(w = \sigma(X) z \in \mathbb{R}^N\)
\[ {\mathrm{Tr}\left( {\sigma(X)^\top \nabla^2 V(X) \, \sigma(X)} \right)} = \mathbb{E}_{w\sim \mathcal{N}(0, \sigma(X) \sigma(X)^\top)}[w^\top \nabla^2 V(X) \, w] \]
Estimate with \(m\) samples (each with \(\mathrm{O}(N)\) cost): \[ {\mathrm{Tr}\left( {\sigma(X)^\top \nabla^2 V(X) \, \sigma(X)} \right)} \approx \frac{1}{m} \sum_{j=1}^m w_j \cdot \left(\nabla^2 V(X) \cdot w_j\right), \quad w_j \sim \mathcal{N}(0, \sigma(X) \sigma(X)^\top) \]
Goal: Estimate discrete-choice models with extremely high-dimensional regressors \(X^t \in \mathbb{R}^d\)
Use random-projections to reduce dimension from \(d\) to \(k \ll d\)
Estimation relies on cyclic monotonicity of choice probabilities
Cyclic Monotonicity Inequalities: For any cycle \(t_1,\dots,t_L\):
\[ \sum_{\ell=1}^L \big(X^{t_{\ell+1}}\beta - X^{t_\ell}\beta \big)^\top p^{t_\ell} \;\le\; 0. \]
This yields a finite collection of inner-product inequalities in \(\beta\).
Apply a random projection \(\widetilde X^t = R X^t\), with \(k \ll d\).
Note
Essence: One random projection must preserve many inequalities simultaneously — this is exactly the JL regime.