Machine Learning Fundamentals for Economists
\[ \min_{\theta} \mathcal{L}(\theta) \]
Will briefly introduce
\[ \theta_{t+1} = \theta_t - \eta_t \nabla \mathcal{L}(\theta_t) \]
Skipping a million details, see ProbML Book 1 Section 8.2.2 and Mark Schmidt’s basic and more advanced notes
For strictly convex problems this converges to the global minima, though sufficient conditions include Robbins-Monro \(\lim_{T\to \infty} \eta_T = 0\) and
\[ \lim_{T\to\infty}\frac{\sum_{t=1}^T \eta_t}{\sum_{t=1}^T \eta_t^2} = 0 \]
For problems that not globally convex this may go to local optima, but if the function is locally strictly convex then it will converge to a local optima
For other types of functions (e.g., invex) it may still converge to the “right” solution in some important sense
As we saw analyzing LLS, badly conditioned problems converge slowly with iterative methods
We can precondition a problem as we did with linear systems, and it has the same stationary point
Choose some \(C_t\) for preconditioned gradient descent \[ \theta_{t+1} = \theta_t - \eta_t C_t \nabla \mathcal{L}(\theta_t) \]
We saw before that the Hessian tells us the geometry, so the optimal preconditioner must be related to \(\nabla^2 \mathcal{L}(\theta_t)\)
\[ \theta_{t+1} = \theta_t - \eta_t \left[\nabla^2 \mathcal{L}(\theta_t) \right]^{-1}\nabla \mathcal{L}(\theta_t) \]
\[ \begin{aligned} \hat{\theta}_{t+1} &= \theta_t + \beta_t(\theta_t - \theta_{t-1})\\ \theta_{t+1} &= \hat{\theta}_{t+1} - \eta_t \nabla \mathcal{L}(\hat{\theta}_{t+1}) \end{aligned} \]
\[ \min_{\theta}\left[\mathcal{L}(\theta) + \frac{\alpha}{2} ||\theta||^2\right] \]
\[ \theta_{t+1} = \theta_t - \eta_t \left[\nabla \mathcal{L}(\theta_t) + \alpha \theta_t\right] \]
In practice, it is impossible to calculate the full gradient for large datasets
In GD, the gradient provided the direction of steepest descent
Consider an algorithm with a \(g_t\) as an unbiased estimate of the gradient
\[ \begin{aligned} \theta_{t+1} = \theta_t - \eta_t g_t\\ \mathbb{E}[g_t] = \nabla \mathcal{L}(\theta_t) \end{aligned} \]
Remember: we don’t need the actual value of the objective function to optimize it!
Will also add additional regularization, which can help with generalization
\[ \min_{\theta}\overbrace{\mathbb{E}_{q(z)} \tilde{\mathcal{L}}(\theta, z)}^{\equiv \mathcal{L}(\theta)} \]
\[ \nabla \mathcal{L}(\theta) = \mathbb{E}_{q(z)}\left[\nabla \tilde{\mathcal{L}}(\theta, z)\right] \]
\[ \mathbb{E}_{q(z)}\left[ \nabla \tilde{\mathcal{L}}(\theta_t, z_t) \right] = \nabla \mathcal{L}(\theta_t) \]
\[ \theta_{t+1} = \theta_t - \eta_t \nabla \tilde{\mathcal{L}}(\theta_t, z_t) \]
Consider a special case of the loss function which is the sum of \(N\) terms. For example with empirical risk minimization used in LLS/etc.
\[ \mathcal{L}(\theta) = \frac{1}{N}\sum_{n=1}^N \tilde{\mathcal{L}}(\theta, z_n) \equiv \frac{1}{N}\sum_{n=1}^N \ell(\theta, x_n, y_n) \]
In this case, the randomness of \(z_t\) is which data point is chosen
\[ \theta_{t+1} = \theta_t - \eta_t \nabla_{\theta} \ell(\theta_t, x_t, y_t) \]
\[ \mathbb{E}\left[\nabla_{\theta} \ell(\theta_t, x_t, y_t)- \nabla \mathcal{L}(\theta_t)\right]^2 \]
\[ \frac{1}{|B|}\sum_{n \in B} \nabla_{\theta} \ell(\theta_t, x_n, y_n) \]
Algorithm is to draw \(B_t\) indices at each step and execute SGD \[ \begin{aligned} g_t &\equiv \frac{1}{|B_t|}\sum_{n \in B_t} \nabla_{\theta} \ell(\theta_t, x_n, y_n)\\ \theta_{t+1} &= \theta_t - \eta_t g_t \end{aligned} \]
Note that we never need to calculate \(\mathcal{L}(\theta_t)\) directly, so can write our code to all operate on batches \(B_t\)
Then layer other tricks on top (e.g., momentum, preconditioning, etc.)
A standard way to do this for Empirical Risk Minimization/Regressions/etc. is to split it into three parts:
Not all problems will have this structure (in particular, a “validation” set).
Some common software components for optimization are
\[ \min_{\theta} \frac{1}{N} \sum_{n=1}^N \left[y_n - x_n \cdot \theta\right]^2 \]
\[ y \sim N(x \cdot \theta, \sigma^2) \]
batch_size[tensor([[ 0.6299, -0.0860],
[ 1.0579, 0.2490],
[-0.4264, 1.3422],
[ 1.8625, 0.7344],
[-1.1870, -0.9154],
[ 1.1389, -1.5414],
[-0.1945, 0.3964],
[-0.9229, -1.5121]]), tensor([ 0.7331, 0.7783, -1.9794, 1.0244, -0.1371, 2.9250, -0.6530, 0.8181])]
nn.Module in Pytorch. Special case of Neural Networksparameters for your underlying model(s)for ... in train_loader: will repeat until the end of the data and continue to the next epoch (i.e., pass through data)step using the optimizer, which is unaware of epochs/batches/etc.for epoch in range(300):
for X_batch, Y_batch in train_loader:
optimizer.zero_grad()
loss = residuals(model, X_batch, Y_batch) # primal
loss.backward() # backprop/reverse-mode AD
# Now the model.parameters have gradients updated, so...
optimizer.step() # Update the optimizers internal parameters
print(f"||theta - theta_hat|| = {torch.norm(theta - model.weight.squeeze())}") ||theta - theta_hat|| = 8.908539894036949e-05
batch_size chunks at a timeepochval_loss is collected and displayed at the end of each epochThis adds in support for Weights and Biases, and also demonstrates the use of a custom nn.Module for the hypothesis class
pip install -r requirements.txtwandb login in terminal to connect to your accountYou will then be able to run these files and see results on wandb.ai
--model.batch_size=32 and --model.lr=0.001 etcval_loss as a HPO objectivewandb sweep lectures/examples/linear_regression_pytorch_sweep.yamlwandb agent <sweep_id> with returned sweep idwandb agent <sweep_id> on multiple computers to run in parallel