Machine Learning Fundamentals for Economists
\[ \min_x \| Ax -b \|^2 \]
\[ A^\top A x = A^\top b \quad \Rightarrow \quad x = (A^\top A)^{-1} A^\top b \]
\[ (QR)^\top (QR) x = (QR)^\top b \]
\[ R^\top Q^\top Q R x = R^\top Q^\top b \]
\[ R x = Q^\top b \]
lstsq and similar functions, QR is used for numerical stabilityN, M = 10, 3
A = np.random.randn(N, M) + np.random.randn(N, 1)
b = np.random.randn(N)
# QR-based solution
x_lstsq, residuals, rank, s = lstsq(A, b)
# Normal equations (less stable but instructive)
x_normal = solve(A.T @ A, A.T @ b)
print(f"lstsq solution: {x_lstsq}")
print(f"Normal eqn sol: {x_normal}")lstsq solution: [-0.4229563 0.63714154 -0.16902314]
Normal eqn sol: [-0.4229563 0.63714154 -0.16902314]
Solution: [-3.05555556 0.11111111 3.27777778]
Rank: 2, Columns: 3
\[ \min_x \| x \|_2^2 \quad \text{s.t.} Ax = b \]
\[ \lim_{\lambda \to 0}\left[\min_x \| Ax -b \|_2^2 + \lambda \| x \|_2^2\right] \]
Let \(\Omega\) be an open subset of \({\mathbb{R}}^n\). A function \(u: \Omega \rightarrow {\mathbb{R}}\)
\(L^p(\Omega)\) space: A function \(f: \Omega \rightarrow {\mathbb{R}}\) is in \(L^p(\Omega)\) if:
\[ \int_\Omega |f|^p \, dx < \infty \]
A function \(u\) belongs to the Sobolev space \(W^{k,p}(\Omega)\) if:
\[ u \in L^p(\Omega) \]
and all its weak derivatives up to order \(k\) are also in \(L^p(\Omega)\).
A function \(\phi \in W^{k,p}(\Omega)\) is said to be a weak derivative of \(u\) if:
\[ \int_\Omega u D^\alpha \phi \, dx = (-1)^{|\alpha|} \int_\Omega \phi D^\alpha u \, dx \]
for all multi-indices \(\alpha\) with \(|\alpha| \leq k\).
The Sobolev Norm for a function \(u \in W^{k,p}(\Omega)\) is defined as some variation on:
\[ \| u \|_{W^{k,p}(\Omega)} = \left( \sum_{|\alpha| \leq k} \int_\Omega |D^\alpha u|^p \, dx \right)^{1/p} \]
where \(\alpha\) is a multi-index and \(D^\alpha u\) represents the weak derivative of \(u\).
You can choose whatever terms you want in \(\alpha\)
The key one to keep in mind is \(W^{1,2}(\Omega)\), which is the space of functions with square-integrable first derivatives.
\[ \| u \|_{W^{1,2}(\Omega)} = \left( \int_\Omega |\nabla u|^2 \, dx \right)^{1/2} \]
\[ \|f\|_{W^{1,2}} = \|\beta\|_2 \]
Where \(\|\beta\|_2\) is euclidean norm of the vector, or the Frobenius norm of the matrix
Now lets reinterpret our “ridgeless regression” with a
\[ \lim_{\lambda \to 0}\left[\min_{\beta} \| A\beta -b \|_2^2 + \lambda \| \beta \|_2^2\right] \]
Recall that this was also
\[ \min_{\beta} \| \beta \|_2^2 \quad \text{s.t.}\, A\beta = b \]
This isn’t just least squares
\[ \begin{aligned} \min_{\beta} & \,\| \beta \|_2^2\\ &\text{s.t.} A\beta = b \end{aligned} \]
The min-norm solution to LLS is the closest projection to the column space of the data
For a given norm it is the unique solution to a well-specified problem which can often be interpreted through appealing to simplicity
It is also the unique “most stable” solution for a given norm. Loosely,
Another interpretation we will apply to ML and nonlinear models: min-norm solutions are the ones least sensitive to data perturbations
This will also come up with Bayesian statistics, if we apply a prior which is asymptotically non-informative, the min-norm solution is the MAP solution
\[ \min_{\beta} \underbrace{\frac{1}{2}\left[\| X\beta -y \|_2^2 + \lambda \| \beta \|_2^2\right]}_{\equiv f(X;\beta)} \]
\[ \nabla^2 f(X;\beta^*) = X^{\top} X + \lambda I \]
Positive definite if \(x^{\top} A x > 0,\) (“semi-definite” if \(\geq 0\)) for all \(x \neq 0\)
With the spectral decomposition of symmetric \(A\)
\[ A = Q \Lambda Q^{\top} \]
In more abstract and infinite dimensional spaces a linear operator \(A(x)\) operator is positive definite if \(x \cdot A(x) > 0\) for all \(x \neq 0\)
Eigenvalues: [1.38196601 3.61803399] (all positive)

Eigenvalues: [0. 2.] (one is zero!)

Eigenvalues: [-1. 1.] (mixed signs = indefinite)

Eigenvalues after regularization: [0.1 2.1] (both positive now)

x_vals = np.linspace(-1, 1, 100)
X1, X2 = np.meshgrid(x_vals, x_vals)
Z = 0.5 * (P[0,0]*X1**2 + 2*P[0,1]*X1*X2 + P[1,1]*X2**2)
plt.figure(figsize=(6, 5))
cs = plt.contour(X1, X2, Z, levels=20, cmap='viridis')
plt.clabel(cs, inline=True, fontsize=8)
plt.xlabel('x1'); plt.ylabel('x2')
plt.title(r'$f(x) = \frac{1}{2} x^\top P x$ with $\Lambda = (1, 1)$')
plt.gca().set_aspect('equal')
\[ \min_{x} \frac{1}{2}x^{\top} P x \]
\[ x^{i+1} = x^i - \eta \nabla f(x^i) = x^i - \eta P x^i \]
def plot_gd_steps(N_steps, x_0, Lambda, eta):
Q = np.array([[np.sqrt(2)/2, np.sqrt(2)/2],
[-np.sqrt(2)/2, np.sqrt(2)/2]])
P = Q @ np.diag(Lambda) @ Q.T
gd_step = lambda x: x - eta * P @ x
x_vals = np.linspace(-1, 1, 100)
X1, X2 = np.meshgrid(x_vals, x_vals)
Z = 0.5 * (P[0,0]*X1**2 + 2*P[0,1]*X1*X2 + P[1,1]*X2**2)
plt.figure(figsize=(6, 5))
plt.contour(X1, X2, Z, levels=20, cmap='viridis')
x_current = np.array(x_0)
for i in range(N_steps):
x_next = gd_step(x_current)
plt.arrow(x_current[0], x_current[1], x_next[0] - x_current[0],
x_next[1] - x_current[1],head_width=0.03, head_length=0.02, fc='red', ec='red')
x_current = x_next
plt.xlabel('x1'); plt.ylabel('x2')
plt.gca().set_aspect('equal')
return plt.gcf()Text(0.5, 1.0, '$\\Lambda = (1, 1)$: Well-conditioned')

Text(0.5, 1.0, '$\\Lambda = (1, 0.5)$: Moderately conditioned')

Text(0.5, 1.0, '$\\Lambda = (1, 0.1)$: Poorly conditioned')

Text(0.5, 1.0, '$\\Lambda = (1, 0.01)$: Terribly conditioned')

Text(0.5, 1.0, '$\\Lambda = (1, 0)$: Singular (semi-definite)')

Geometry, not dimensionality, the key to understanding a large class of algorithms (anything you would use in high-dimensions)
\[ \text{cond}(A) = \left|\frac{\lambda_{\max}(A)}{\lambda_{\min}(A)}\right| \]
Regularization: ridge \(\alpha \|x\|_2^2\) then spectrum becomes \(\lambda_i + \alpha\)
See more in Mark Schmidt’s Notes on Gradient Descent
cond(X) = 1.34
cond(X.T @ X) = 1.80
cond(X_col.T @ X_col) = 1.65e+16
lx.NormalCG() solver for least squares via iterative methods