Machine Learning Fundamentals for Economists
GEMINI_API_KEY as an environment variableWe will showcase a few examples using the Gemini API.
Generative AI uses algorithms to create new content, such as text, images, audio, and video, that resembles the data it was trained on.
Generative AI refers to a class of machine learning models capable of producing novel, realistic data instances that resemble a given training dataset, including text, images, audio, and synthetic data useful for econometric applications.
In Gemini, “System Context” is a specific parameter in the configuration, ensuring the model adheres to the persona throughout the generation.
Generative artificial intelligence encompasses algorithms and models capable of producing novel, realistic data instances that resemble a given training dataset, thereby enabling the creation of new content across various modalities, such as text, images, and audio.
prompt = """
A comic-book stylized visualization
of mapping to an embedding manifold, showing data
points clustering in lower dimensions."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
Statistical learning studies how, given finite samples of random variables drawn from an underlying joint distribution, we can infer functions or probabilistic models that generalize beyond the observed sample
Observed “data” are realizations from an unknown population (data-generating) distribution
\[ (x, y) \sim \mu^* \]
In supervised learning, one object of interest is the conditional distribution \(y \mid x\)
Many problems in ML, econometrics, and numerical analysis can be framed as finding a function \(f \in {\mathcal{F}}\) (e.g., a function, policy, or operator) such that \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \underbrace{\mathbb{E}_{(x,y)\sim \mu^*} \left[\ell(f, x, y)\right]}_{\equiv R(f, \mu^*)}. \]
One canonical example evaluates the squared error loss of “prediction” \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y)\sim \mu^*} \left[\|y - f(x)\|_2^2\right]. \]
More generally, \(f\) may parameterize a full conditional distribution of \(y\) given \(x\) \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y)\sim \mu^*} \left[-\log \mathbb{P}_f(y \mid x)\right]. \]
Equivalently, minimizes the expected KL divergence (see KL Divergence) \[ \mathbb{E}_{x\sim\mu^*} \mathrm{KL}\!\left(\mu^*(y\mid x)\,\|\,\mathbb{P}_f(y\mid x)\right), \]
In many economic and numerical problems there is no target variable \(y\)
Instead, the goal is to find a function \(f \in {\mathcal{F}}\) satisfying conditions at each state \(x\) \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{x\sim \mu^*} \left[\ell(f, x)\right]. \]
If \(\ell({f^*}, x) = 0\) for all \(x \in {\mathcal{X}}\), then \({f^*}\) solves the functional equation pointwise
Frequently, we will assume IID draws, \({\mathcal{D}}\overset{\mathrm{iid}}{\sim}\mu^*\), but this can be relaxed
The empirical counterpart to \(\arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y)\sim \mu^*} \left[\ell(f, x, y)\right]\) is
\[ {\theta^*}\equiv \arg\min_{\theta \in \Theta}\underbrace{\frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \ell(f_{\theta}, x, y)}_{\equiv \hat{R}(\theta,{\mathcal{D}})} \]
The MLE case uses the negative log-likelihood loss \(\ell(f_\theta, x, y) = -\log \mathbb{P}_{\theta}(y \mid x)\)
The ERM objective becomes
\[ {\theta^*}= \arg\min_{\theta \in \Theta} \frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \left[-\log \mathbb{P}_{\theta}(y \mid x)\right] \]
\[ {\theta^*}= \arg\min_{\theta \in \Theta} \frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \left[-\log \mathbb{P}_{\theta}(y \mid x)\right] + \lambda \|\theta\|_1 \]
Recall With \(x \in {\mathcal{X}}\), population risk minimization is \(\arg\min_{f \in {\mathcal{F}}}\mathbb{E}_{x\sim \mu^*} \left[\ell(f, x)\right]\)
Then the empirical problem is
\[ {\theta^*}= \arg\min_{\theta \in \Theta} \frac{1}{|{\mathcal{D}}|}\sum_{x \in {\mathcal{D}}} \ell(f_{\theta}, x) \]
The optimization problem is a means-to-and-end:
TOPIC How well does a \(f_{{\theta^*}}\) approximate the population risk minimizer?
Crucially, this is not the same goal as minimizing the uniform error \[ \arg\min_{f \in {\mathcal{F}}} \max_{(x,y)\in {\mathcal{X}}\times{\mathcal{Y}}} \left[\ell(f, x, y)\right]. \]
Population Risk
\[ {f^*}= \arg\min_{f \in {\mathcal{F}}} \underbrace{\mathbb{E}_{(x,y)\sim \mu^*} \left[\ell(f, x, y)\right]}_{\equiv R(f, \mu^*)} \]
Empirical Risk
\[ {\theta^*}= \arg\min_{\theta \in \Theta}\underbrace{\frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \ell(f_{\theta}, x, y)}_{\equiv \hat{R}(\theta,{\mathcal{D}})} \]
\[ \small \mathbb{E}_{{\mathcal{D}}\overset{\mathrm{iid}}{\sim}\mu^*}\left[\min_{\theta \in \Theta} \hat{R}(\theta, {\mathcal{D}}) - \min_{f \in \mathcal{F}} R(f, \mu^*)\right] = \underbrace{ R({f_{{\theta^*}}}, \mu^*) - R({f^*}, \mu^*)}_{\equiv {\varepsilon_{\mathrm{app}}}({f_{{\theta^*}}})} + \underbrace{\mathbb{E}_{{\mathcal{D}}\overset{\mathrm{iid}}{\sim}\mu^*}\left[\hat{R}(\theta^*, {\mathcal{D}}) - R({f_{{\theta^*}}}, \mu^*)\right]}_{\equiv {\varepsilon_{\mathrm{gen}}}({f_{{\theta^*}}})} \]
\[ \mathbb{P}\left[x_{T+1}\,|\,x_T, \ldots, x_1\right] \]
Frontier LLMs circa 2025: of \(K \approx 100{,}000\), context windows of \(T \approx 1{-}2\) million
GPT-4 class models: \(|\Theta| \approx 1{-}2\) trillion parameters approximating
\[ \mathbb{P} : \{1, \ldots, K\}^{T+1} \to [0,1], \quad K^{T+1} \approx (10^5)^{10^6} = 10^{5 \times 10^6} \]
Paraphrasing Belkin (2023): like reconstructing an entire library from a molecule of ink
They cannot possibly work on the entire space, or directly estimate \(\mathbb{P}\) as a table
Suppose \(x \in {\mathbb{R}}\) and we want to approximate \(f(x)\) with a polynomial of degree \(d\)
For some polynomial basis \(T_1(x), \ldots, T_d(x)\) (e.g., monomials, Chebyshev, Legendre, etc.) \[ \phi(x) = \begin{bmatrix}1 & T_1(x) & T_2(x) & \cdots & T_d(x)\end{bmatrix}^{\top} \]
Approximate \(f_{\theta}(x)\) with \(h_{\theta}(z) = W^{\top} z\), where \(W \in {\mathbb{R}}^{d+1}\) and \(W \in \theta\) \[ f_{\theta}(x) \equiv W^{\top} \phi(x) = \sum_{i=0}^d W_i T_i(x) \]
Recall cases of \(\mathbb{P}_{\theta}(y = k \mid x)\) for \(k \in \{1, \ldots, K\}\)
Stack the into a vector using \(\mathrm{softmax}: {\mathbb{R}}^K \to {\mathbb{R}}^K\) and pointwise \(\exp(\cdot)\)
\[ \mathrm{softmax}(z) \equiv \frac{\exp(z)}{\mathbf{1}^{\top}\exp(z)} \in {\mathbb{R}}^K, \quad \text{where } \mathbf{1}^{\top} \mathrm{softmax}(z) = 1 \]
Nesting the transformations with \(\phi_{\theta} : {\mathcal{X}}\to {\mathbb{R}}^L\) and head \(h_{\theta} : {\mathbb{R}}^L \to {\mathbb{R}}^K\)
\[ \mathbb{P}_{\theta}(y \mid x) = h_{\theta}(\phi_{\theta}(x)) \in {\mathbb{R}}^K \]
LLMs roughly build a latent representation \(\phi(\cdot)\) and \(h(\cdot)\)
\[ \mathbb{P}_{\theta}(y \mid x) = h_{\theta}(\phi_{\theta}(x)) \in {\mathbb{R}}^K \]
Once learned, \(\phi_{\theta}(\cdot)\) is useful for downstream tasks: embeddings, imputation, classification, etc.
The mapping to outputs \(h(\cdot)\) is often “shallow” (e.g., linear or low-order polynomial)
In contrast, the transformation \(\phi(\cdot)\) into representation space is usually “deep”
\[ \phi \equiv \phi_L \circ \cdots \circ \phi_1 \]
As data becomes richer, unstructured, and higher-dimensional, transformations become harder to design manually
TOPIC Neural networks compose simple nonlinear functions to learn complicated transformations
prompt = """
A stylized illustration of representation learning as a
smooth change of variables: a tangled, high-dimensional
data manifold being unfolded into a flat, low-dimensional
coordinate system. The left side shows intertwined curves
and knots; the right side shows clean, orthogonal axes
with separated clusters. Etching / wood-cut / scientific
engraving style, high contrast, minimal color palette."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
prompt = """
An artistic visualization of representation learning
where entangled threads of data are transformed into
independent latent factors. On the left, a dense braid
of overlapping fibers; on the right, parallel strands
aligned along clear axes. Emphasize symmetry, order,
and factorization. Rendered in a vintage wood-engraving
or linocut style."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
prompt = """
A visual metaphor for representation learning as
information compression: raw, noisy data clouds
are compressed through a narrow bottleneck into
a compact latent space that preserves structure.
Before-and-after panels. Use engraved, chalkboard,
or woodcut academic illustration style."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
prompt = """
A two-panel educational illustration explaining
representation learning. Left panel: high-dimensional,
entangled observations with overlapping features. Right
panel: low-dimensional latent representation with
disentangled, interpretable axes. Clean academic diagram
style with subtle wood-engraving texture."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
See here for more details.
<Ctrl+Shift+P> or <Cmd+Shift+P> on mac and type > Git: Clone and choose https://github.com/jlperla/grad_econ_ML_notebooksuv syncSee here for more details.
winget install julia -s msstore in a terminalcurl -fsSL https://install.julialang.org | sh in a terminalOpen the command palette with <Ctrl+Shift+P> or <Cmd+Shift+P> on mac and type > Git: Clone and choose https://github.com/jlperla/grad_econ_ML_notebooks
Instantiate packages by running VS Code terminal
] instantiate, where ] enters package modeThen use VS Code to open any of the notebooks in that folder
Note: the same clone’d repo can work for both Julia and Python
Use client.chats.create() to manage state. The chat object automatically tracks history so you don’t have to pass it back manually.
chat = client.chats.create(model=model, config=config)
res1 = chat.send_message("Describe the concept of generative AI in one sentence.")
print(f"Step 1: {res1.text}\n")
# Contextual follow-up
res2 = chat.send_message("Explain in one sentence how that relates to sampling from probability distributions.")
print(f"Step 2: {res2.text}")Step 1: Generative artificial intelligence refers to a class of algorithms capable of generating novel, realistic, and often complex data instances that resemble a training dataset, effectively learning the underlying distribution of that data and sampling from it to create new content.
Step 2: Generative AI models, having learned the probability distribution of the training data, function by effectively drawing samples from this learned distribution to create new data instances.
Let \(\mu^*(y \mid x)\) denote the true conditional distribution, and \(\mathbb{P}_f(y \mid x)\) the model-implied conditional distribution
Take the the expected negative log-likelihood, condition on \(x\) and use the LIE \[ \begin{aligned} \mathbb{E}_{(x,y)\sim\mu^*} \left[-\log \mathbb{P}_f(y \mid x)\right] &= \mathbb{E}_{x\sim\mu^*} \left[ \mathbb{E}_{y\sim\mu^*(\cdot\mid x)} \left[-\log \mathbb{P}_f(y \mid x)\right] \right]. \end{aligned} \]
Add and subtract \(\log \mu^*(y \mid x)\) inside the inner expectation \[ \begin{aligned} &= \mathbb{E}_{x\sim\mu^*} \Big[ \underbrace{ \mathbb{E}_{y\sim\mu^*(\cdot\mid x)} \left[ \log \frac{\mu^*(y \mid x)}{\mathbb{P}_f(y \mid x)} \right] }_{\mathrm{KL}(\mu^*(y\mid x)\,\|\,\mathbb{P}_f(y\mid x))} & + \underbrace{ \mathbb{E}_{y\sim\mu^*(\cdot\mid x)} \left[-\log \mu^*(y \mid x)\right] }_{\text{does not depend on } f} \Big]. \end{aligned} \]
Therefore, minimizing expected log-loss is equivalent to minimizing KL \[ \mathbb{E}_{(x,y)\sim\mu^*} \left[-\log \mathbb{P}_f(y \mid x)\right] = \mathbb{E}_{x\sim\mu^*} \left[ \mathrm{KL}\!\left( \mu^*(y\mid x)\,\|\,\mathbb{P}_f(y\mid x) \right) \right] + \text{constant}. \]