Machine Learning Fundamentals for Economists
Let \(x \in \mathbb{R}^M\) be observations and \(X \in \mathbb{R}^{N \times M}\) be the “design” matrix
To solve \(\min_{\theta}\left\{||X \theta - y||^2 + \lambda||\theta||^2\right\}\) for \(\lambda \geq 0\), form normal eqs.
\[ \theta = (X^{\top}X + \lambda I)^{-1} X^{\top} y \]
Without proof, turns out you can rearrange this expression to get the “other” normal equations
\[ \theta = X^{\top} (X X^{\top} + \lambda I)^{-1} y \]
Since \(M \ll N\) typically, often a bigger system to solve
\[ K_{ij} \equiv [X X^{\top}]_{ij} = \mathcal{K}(x_i, x_j)\,\text{ for i = 1, ..., N and j = 1, ..., N} \]
\[ \tilde{K}_{ij} \equiv \mathcal{K}(\tilde{x}_i, x_j),\,\text{ for i = 1, ..., P and j = 1, ..., N} \]
\[ \tilde{y} = \tilde{K} (K + \lambda I)^{-1} y \]
https://members.cbio.mines-paristech.fr/~jvert/svn/kernelcourse/slides/master2017/master2017.pdf
Generalizing our previous case, \(\mathcal{K}(x,x') \equiv (x \cdot x')^2\), to polynomial order \(p\) with cross-products \[ \mathcal{K}(x,x') \equiv (x \cdot x' + c)^p \]
The dimensionality of the underlying \(\phi(x)\) is combinatorial in \(p\)
But now we can just evaluate pairwise \(\mathcal{K}(x,x')\), just an inner product in \(M\) independent of \(p\)!
The cost will be that we need to store the \(N \times N\) matrix of pairwise comparisons
Our canonical example is \(\mathcal{K}(x,x') = x \cdot x'\), which is a measure of similarity between \(x\) and \(x'\). Note that \[ ||x - x'||_2 = \sqrt{\mathcal{K}(x, x) - 2 \mathcal{K}(x,x') + \mathcal{K}(x', x')} \]
Fix the lengths \(||x||_2 = ||x'||_2 = 1\) then notice that \(||x - x'||_2 = \sqrt{2 (1 - \mathcal{K}(x,x'))}\)
As \(\mathcal{K}(x,x')\) increases, the distance between \(x\) and \(x'\) decreases
Also notice that when \(\mathcal{K}(x,x') = 0\) (i.e. orthogonal since \(x \cdot x' = 0\)) we maximize the distance between \(x\) and \(x'\)
\[ \mathcal{K}(x, x';\ell) \equiv \exp(-\frac{||x - x'||_2^2}{2 \ell^2}) \]
Unsurprisingly, since Gaussian random variables can be characterized by the first two moments, so can GPs
Let the mean function be
\[ m(x) = \mathbb{E}[f(x)] \]
And the covariance function (kernel) be
\[ \mathcal{K}(x,x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))] \]
Then we can denote a GP as
\[ f(x) \sim GP(m(x), \mathcal{K}(x,x')) \]
\[ \begin{aligned} f_X\,|\,X &\sim \mathcal{N}(\mu_X, K_{XX})\\ \mu_X &\equiv [m(x_i)]_{i=1}^N\\ K_{XX} &\equiv [\mathcal{K}(x_i, x_j)]_{i,j=1}^N \end{aligned} \]
Let \(X\) and \(f_X\) be observable (without noise in simplest case)
Let \(\tilde{X}\) be new data where we want to forecast the distribution of \(f_{\tilde{X}}\)
See ProbML Book 2 Section 18.3 for more details
\[ \begin{aligned} f_{\tilde{X}}\,|\,X, f_X, \tilde{X} &\sim \mathcal{N}(\bar{\mu}_{\tilde{X}}, \Sigma_{\tilde{X}})\\ \bar{\mu}_{\tilde{X}} &\equiv \mu_{\tilde{X}} + K_{X\tilde{X}}^{\top} K_{XX}^{-1} (f_X - \mu_X)\\ \Sigma_{\tilde{X}} &\equiv K_{\tilde{X}\tilde{X}} - K_{X\tilde{X}}^{\top} K_{XX}^{-1} K_{X\tilde{X}} \end{aligned} \]
Can adjust for noisy observation \(y = f(x) + \sigma \epsilon\) for \(\epsilon \sim \mathcal{N}(0,I)\) easily
\[ \min_{f \in \mathcal{F}} \left\{\sum_{n=1}^N (y_n - f(x_n))^2 + \frac{\lambda}{2}||f||^2\right\} \]
\[ \min_{f \in \mathcal{H}_{\mathcal{K}}} \left\{\sum_{n=1}^N(y_n - f(x_n))^2 + \frac{\lambda}{2}||f||^2_{\mathcal{H}_{\mathcal{K}}}\right\} \]
See ProbML Book 2 Section 18.3.7.3 for more details
The representer theorem says that a solution to ERM in a function space
\[ f^* = \operatorname{argmin}_{f \in \mathcal{H}_{\mathcal{K}}} \left\{\sum_{n=1}^N\ell(f, x_n, y_n) + \frac{\lambda}{2}||f||^2_{\mathcal{H}_{\mathcal{K}}}\right\} \]
\[ f^*(x) = \sum_{n=1}^N \alpha_n \mathcal{K}(x, x_n) \]
Recall the min-norm, ridgeless OLS: \[ \lim_{\lambda \to 0} \operatorname{argmin}_{\alpha} ||\alpha \cdot X - y||^2_2 + \lambda ||\alpha||^2_2 = \operatorname{argmin}_{\alpha} ||\alpha||_2^2 \text{ s t. } \alpha \cdot X = y \]
With interpolating problem (i.e., \(\mathbf{r}(\cdot) = 0\) possible) take \(\lambda \to 0\) in Representer
\[ \begin{aligned} \min_{\mathbf{\alpha}, \eta} &\left\{\mathbf{\alpha}^{\top}K_{XX}(\theta)\mathbf{\alpha}\right\}\\ \text{s.t. }\, & \mathbf{r}(K_{XX}(\theta)\cdot \mathbf{\alpha}+ \mathbf{m}(\eta, \theta);\theta, \eta) = 0 \end{aligned} \]
gpytorch.means.ConstantMean(), but could be function of \(x\)class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # Includes GaussianLikelihood parameters
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
print(mll)ExactMarginalLogLikelihood(
(likelihood): GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
(model): ExactGPModel(
(likelihood): GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
(mean_module): ConstantMean()
(covar_module): ScaleKernel(
(base_kernel): RBFKernel(
(raw_lengthscale_constraint): Positive()
)
(raw_outputscale_constraint): Positive()
)
)
)
Iter=0, Loss=0.9389403462409973
Iter=10, Loss=0.5049580931663513
Iter=20, Loss=0.1683662384748459
Iter=30, Loss=-0.04883788898587227
Iter=40, Loss=-0.040030136704444885
Iter=50, Loss=-0.05462135374546051
Iter=60, Loss=-0.059855423867702484
Iter=70, Loss=-0.059758834540843964
Iter=80, Loss=-0.06091545149683952
Iter=90, Loss=-0.06081642210483551
GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
with torch.no_grad(), gpytorch.settings.fast_pred_var():
test_x = torch.linspace(0, 1, 51)
observed_pred = likelihood(model(test_x))
fig, ax = plt.subplots()
lower, upper = observed_pred.confidence_region()
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) 