Machine Learning Fundamentals for Economists
Let \(x \in \mathbb{R}^M\) be observations and \(X \in \mathbb{R}^{N \times M}\) be the “design” matrix
To solve \(\min_{\theta}\left\{||X \theta - y||^2 + \lambda||\theta||^2\right\}\) for \(\lambda \geq 0\), form normal eqs.
\[ \theta = (X^{\top}X + \lambda I)^{-1} X^{\top} y \]
Without proof, turns out you can rearrange this expression to get the “other” normal equations
\[ \theta = X^{\top} (X X^{\top} + \lambda I)^{-1} y \]
Since \(M \ll N\) typically, often a bigger system to solve
\[ K_{ij} \equiv [X X^{\top}]_{ij} = \mathcal{K}(x_i, x_j)\,\text{ for i = 1, ..., N and j = 1, ..., N} \]
\[ \tilde{K}_{ij} \equiv \mathcal{K}(\tilde{x}_i, x_j),\,\text{ for i = 1, ..., P and j = 1, ..., N} \]
\[ \tilde{y} = \tilde{K} (K + \lambda I)^{-1} y \]
https://members.cbio.mines-paristech.fr/~jvert/svn/kernelcourse/slides/master2017/master2017.pdf
Generalizing our previous case, \(\mathcal{K}(x,x') \equiv (x \cdot x')^2\), to polynomial order \(p\) with cross-products \[ \mathcal{K}(x,x') \equiv (x \cdot x' + c)^p \]
The dimensionality of the underlying \(\phi(x)\) is combinatorial in \(p\)
But now we can just evaluate pairwise \(\mathcal{K}(x,x')\), just an inner product in \(M\) independent of \(p\)!
The cost will be that we need to store the \(N \times N\) matrix of pairwise comparisons
Our canonical example is \(\mathcal{K}(x,x') = x \cdot x'\), which is a measure of similarity between \(x\) and \(x'\). Note that \[ ||x - x'||_2 = \sqrt{\mathcal{K}(x, x) - 2 \mathcal{K}(x,x') + \mathcal{K}(x', x')} \]
Fix the lengths \(||x||_2 = ||x'||_2 = 1\) then notice that \(||x - x'||_2 = \sqrt{2 (1 - \mathcal{K}(x,x'))}\)
As \(\mathcal{K}(x,x')\) increases, the distance between \(x\) and \(x'\) decreases
Also notice that when \(\mathcal{K}(x,x') = 0\) (i.e. orthogonal since \(x \cdot x' = 0\)) we maximize the distance between \(x\) and \(x'\)
\[ \mathcal{K}(x, x';\ell) \equiv \exp(-\frac{||x - x'||_2^2}{2 \ell^2}) \]
Unsurprisingly, since Gaussian random variables can be characterized by the first two moments, so can GPs
Let the mean function be
\[ m(x) = \mathbb{E}[f(x)] \]
And the covariance function (kernel) be
\[ \mathcal{K}(x,x') = \mathbb{E}[(f(x) - m(x))(f(x') - m(x'))] \]
Then we can denote a GP as
\[ f(x) \sim GP(m(x), \mathcal{K}(x,x')) \]
\[ \begin{aligned} f_X\,|\,X &\sim \mathcal{N}(\mu_X, K_{XX})\\ \mu_X &\equiv [m(x_i)]_{i=1}^N\\ K_{XX} &\equiv [\mathcal{K}(x_i, x_j)]_{i,j=1}^N \end{aligned} \]
Let \(X\) and \(f_X\) be observable (without noise in simplest case)
Let \(\tilde{X}\) be new data where we want to forecast the distribution of \(f_{\tilde{X}}\)
See ProbML Book 2 Section 18.3 for more details
\[ \begin{aligned} f_{\tilde{X}}\,|\,X, f_X, \tilde{X} &\sim \mathcal{N}(\bar{\mu}_{\tilde{X}}, \Sigma_{\tilde{X}})\\ \bar{\mu}_{\tilde{X}} &\equiv \mu_{\tilde{X}} + K_{X\tilde{X}}^{\top} K_{XX}^{-1} (f_X - \mu_X)\\ \Sigma_{\tilde{X}} &\equiv K_{\tilde{X}\tilde{X}} - K_{X\tilde{X}}^{\top} K_{XX}^{-1} K_{X\tilde{X}} \end{aligned} \]
Can adjust for noisy observation \(y = f(x) + \sigma \epsilon\) for \(\epsilon \sim \mathcal{N}(0,I)\) easily
\[ \min_{f \in \mathcal{F}} \left\{\sum_{n=1}^N (y_n - f(x_n))^2 + \frac{\lambda}{2}||f||^2\right\} \]
\[ \min_{f \in \mathcal{H}_{\mathcal{K}}} \left\{\sum_{n=1}^N(y_n - f(x_n))^2 + \frac{\lambda}{2}||f||^2_{\mathcal{H}_{\mathcal{K}}}\right\} \]
See ProbML Book 2 Section 18.3.7.3 for more details
The representer theorem says that a solution to ERM in a function space
\[ f^* = \operatorname{argmin}_{f \in \mathcal{H}_{\mathcal{K}}} \left\{\sum_{n=1}^N\ell(f, x_n, y_n) + \frac{\lambda}{2}||f||^2_{\mathcal{H}_{\mathcal{K}}}\right\} \]
\[ f^*(x) = \sum_{n=1}^N \alpha_n \mathcal{K}(x, x_n) \]
Recall the min-norm, ridgeless OLS: \[ \lim_{\lambda \to 0} \operatorname{argmin}_{\alpha} ||\alpha \cdot X - y||^2_2 + \lambda ||\alpha||^2_2 = \operatorname{argmin}_{\alpha} ||\alpha||_2^2 \text{ s t. } \alpha \cdot X = y \]
With interpolating problem (i.e., \(\mathbf{r}(\cdot) = 0\) possible) take \(\lambda \to 0\) in Representer
\[ \begin{aligned} \min_{\mathbf{\alpha}, \eta} &\left\{\mathbf{\alpha}^{\top}K_{XX}(\theta)\mathbf{\alpha}\right\}\\ \text{s.t. }\, & \mathbf{r}(K_{XX}(\theta)\cdot \mathbf{\alpha}+ \mathbf{m}(\eta, \theta);\theta, \eta) = 0 \end{aligned} \]
gpytorch.means.ConstantMean(), but could be function of \(x\)class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(ExactGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ConstantMean()
self.covar_module = gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel())
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)likelihood = gpytorch.likelihoods.GaussianLikelihood()
model = ExactGPModel(train_x, train_y, likelihood)
model.train()
likelihood.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1) # Includes GaussianLikelihood parameters
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
print(mll)ExactMarginalLogLikelihood(
(likelihood): GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
(model): ExactGPModel(
(likelihood): GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
(mean_module): ConstantMean()
(covar_module): ScaleKernel(
(base_kernel): RBFKernel(
(raw_lengthscale_constraint): Positive()
)
(raw_outputscale_constraint): Positive()
)
)
)
Iter=0, Loss=0.950839102268219
Iter=10, Loss=0.5059625506401062
Iter=20, Loss=0.16909293830394745
Iter=30, Loss=-0.05211631953716278
Iter=40, Loss=-0.04419197142124176
Iter=50, Loss=-0.05923301726579666
Iter=60, Loss=-0.06417526304721832
Iter=70, Loss=-0.06441757082939148
Iter=80, Loss=-0.06555411964654922
Iter=90, Loss=-0.06544952094554901
GaussianLikelihood(
(noise_covar): HomoskedasticNoise(
(raw_noise_constraint): GreaterThan(1.000E-04)
)
)
with torch.no_grad(), gpytorch.settings.fast_pred_var():
test_x = torch.linspace(0, 1, 51)
observed_pred = likelihood(model(test_x))
fig, ax = plt.subplots()
lower, upper = observed_pred.confidence_region()
ax.plot(train_x.numpy(), train_y.numpy(), 'k*')
ax.plot(test_x.numpy(), observed_pred.mean.numpy(), 'b')
# Shade between the lower and upper confidence bounds
ax.fill_between(test_x.numpy(), lower.numpy(), upper.numpy(), alpha=0.5)
ax.set_ylim([-3, 3])
ax.legend(['Observed Data', 'Mean', 'Confidence']) 