Machine Learning Fundamentals for Economists
GEMINI_API_KEY as an environment variableWe will showcase a few examples using the Gemini API.
Generative AI uses algorithms to create new content, such as text, images, audio, and video, that resembles existing data but is not simply a copy of it.
Generative AI refers to algorithms and models that can create new, original content, such as text, images, audio, or code, based on the patterns and structures they learn from training data.
In Gemini, “System Context” is a specific parameter in the configuration, ensuring the model adheres to the persona throughout the generation.
Generative artificial intelligence refers to a class of machine learning models capable of generating new, original content, such as text, images, audio, or other data, that resembles the data on which they were trained.
prompt = """
A comic-book stylized visualization
of mapping to an embedding manifold, showing data
points clustering in lower dimensions."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
Statistical learning studies how, given finite samples of random variables drawn from an underlying joint distribution, we can infer functions or probabilistic models that generalize beyond the observed sample
Observed “data” are realizations from an unknown population (data-generating) distribution
\[ (x, y) \sim \mu^* \]
In supervised learning, one object of interest is the conditional distribution \(y \mid x\)
Many problems in ML, econometrics, and numerical analysis can be framed as finding a function \(f \in {\mathcal{F}}\) (e.g., a function, policy, or operator) such that \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \underbrace{\mathbb{E}_{(x,y)\sim \mu^*} \left[\ell(f, x, y)\right]}_{\equiv R(f, \mu^*)}. \]
One canonical example evaluates the squared error loss of “prediction” \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y)\sim \mu^*} \left[\|y - f(x)\|_2^2\right]. \]
More generally, \(f\) may parameterize a full conditional distribution of \(y\) given \(x\) \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y)\sim \mu^*} \left[-\log \mathbb{P}_f(y \mid x)\right]. \]
Equivalently, minimizes the expected KL divergence (see KL Divergence) \[ \mathbb{E}_{x\sim\mu^*} \mathrm{KL}\!\left(\mu^*(y\mid x)\,\|\,\mathbb{P}_f(y\mid x)\right), \]
In many economic and numerical problems there is no target variable \(y\)
Instead, the goal is to find a function \(f \in {\mathcal{F}}\) satisfying conditions at each state \(x\) \[ {f^*} = \arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{x\sim \mu^*} \left[\ell(f, x)\right]. \]
If \(\ell({f^*}, x) = 0\) for all \(x \in {\mathcal{X}}\), then \({f^*}\) solves the functional equation pointwise
Frequently, we will assume IID draws, \({\mathcal{D}}\overset{\mathrm{iid}}{\sim}\mu^*\), but this can be relaxed
The empirical counterpart to \(\arg\min_{f \in {\mathcal{F}}} \mathbb{E}_{(x,y)\sim \mu^*} \left[\ell(f, x, y)\right]\) is
\[ {\theta^*}\equiv \arg\min_{\theta \in \Theta}\underbrace{\frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \ell(f_{\theta}, x, y)}_{\equiv \hat{R}(\theta,{\mathcal{D}})} \]
The MLE case uses the negative log-likelihood loss \(\ell(f_\theta, x, y) = -\log \mathbb{P}_{\theta}(y \mid x)\)
The ERM objective becomes
\[ {\theta^*}= \arg\min_{\theta \in \Theta} \frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \left[-\log \mathbb{P}_{\theta}(y \mid x)\right] \]
\[ {\theta^*}= \arg\min_{\theta \in \Theta} \frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \left[-\log \mathbb{P}_{\theta}(y \mid x)\right] + \lambda \|\theta\|_1 \]
Recall With \(x \in {\mathcal{X}}\), population risk minimization is \(\arg\min_{f \in {\mathcal{F}}}\mathbb{E}_{x\sim \mu^*} \left[\ell(f, x)\right]\)
Then the empirical problem is
\[ {\theta^*}= \arg\min_{\theta \in \Theta} \frac{1}{|{\mathcal{D}}|}\sum_{x \in {\mathcal{D}}} \ell(f_{\theta}, x) \]
The optimization problem is a means-to-and-end:
TOPIC How well does a \(f_{{\theta^*}}\) approximate the population risk minimizer?
Crucially, this is not the same goal as minimizing the uniform error \[ \arg\min_{f \in {\mathcal{F}}} \max_{(x,y)\in {\mathcal{X}}\times{\mathcal{Y}}} \left[\ell(f, x, y)\right]. \]
Population Risk
\[ {f^*}= \arg\min_{f \in {\mathcal{F}}} \underbrace{\mathbb{E}_{(x,y)\sim \mu^*} \left[\ell(f, x, y)\right]}_{\equiv R(f, \mu^*)} \]
Empirical Risk
\[ {\theta^*}= \arg\min_{\theta \in \Theta}\underbrace{\frac{1}{|{\mathcal{D}}|}\sum_{(x,y) \in {\mathcal{D}}} \ell(f_{\theta}, x, y)}_{\equiv \hat{R}(\theta,{\mathcal{D}})} \]
\[ \small \mathbb{E}_{{\mathcal{D}}\overset{\mathrm{iid}}{\sim}\mu^*}\left[\min_{\theta \in \Theta} \hat{R}(\theta, {\mathcal{D}}) - \min_{f \in \mathcal{F}} R(f, \mu^*)\right] = \underbrace{ R({f_{{\theta^*}}}, \mu^*) - R({f^*}, \mu^*)}_{\equiv {\varepsilon_{\mathrm{app}}}({f_{{\theta^*}}})} + \underbrace{\mathbb{E}_{{\mathcal{D}}\overset{\mathrm{iid}}{\sim}\mu^*}\left[\hat{R}(\theta^*, {\mathcal{D}}) - R({f_{{\theta^*}}}, \mu^*)\right]}_{\equiv {\varepsilon_{\mathrm{gen}}}({f_{{\theta^*}}})} \]
\[ \mathbb{P}\left[x_{T+1}\,|\,x_T, \ldots, x_1\right] \]
Frontier LLMs circa 2025: of \(K \approx 100{,}000\), context windows of \(T \approx 1{-}2\) million
GPT-4 class models: \(|\Theta| \approx 1{-}2\) trillion parameters approximating
\[ \mathbb{P} : \{1, \ldots, K\}^{T+1} \to [0,1], \quad K^{T+1} \approx (10^5)^{10^6} = 10^{5 \times 10^6} \]
Paraphrasing Belkin (2023): like reconstructing an entire library from a molecule of ink
They cannot possibly work on the entire space, or directly estimate \(\mathbb{P}\) as a table
Suppose \(x \in {\mathbb{R}}\) and we want to approximate \(f(x)\) with a polynomial of degree \(d\)
For some polynomial basis \(T_1(x), \ldots, T_d(x)\) (e.g., monomials, Chebyshev, Legendre, etc.) \[ \phi(x) = \begin{bmatrix}1 & T_1(x) & T_2(x) & \cdots & T_d(x)\end{bmatrix}^{\top} \]
Approximate \(f_{\theta}(x)\) with \(h_{\theta}(z) = W^{\top} z\), where \(W \in {\mathbb{R}}^{d+1}\) and \(W \in \theta\) \[ f_{\theta}(x) \equiv W^{\top} \phi(x) = \sum_{i=0}^d W_i T_i(x) \]
Recall cases of \(\mathbb{P}_{\theta}(y = k \mid x)\) for \(k \in \{1, \ldots, K\}\)
Stack the into a vector using \(\mathrm{softmax}: {\mathbb{R}}^K \to {\mathbb{R}}^K\) and pointwise \(\exp(\cdot)\)
\[ \mathrm{softmax}(z) \equiv \frac{\exp(z)}{\mathbf{1}^{\top}\exp(z)} \in {\mathbb{R}}^K, \quad \text{where } \mathbf{1}^{\top} \mathrm{softmax}(z) = 1 \]
Nesting the transformations with \(\phi_{\theta} : {\mathcal{X}}\to {\mathbb{R}}^L\) and head \(h_{\theta} : {\mathbb{R}}^L \to {\mathbb{R}}^K\)
\[ \mathbb{P}_{\theta}(y \mid x) = h_{\theta}(\phi_{\theta}(x)) \in {\mathbb{R}}^K \]
LLMs roughly build a latent representation \(\phi(\cdot)\) and \(h(\cdot)\)
\[ \mathbb{P}_{\theta}(y \mid x) = h_{\theta}(\phi_{\theta}(x)) \in {\mathbb{R}}^K \]
Once learned, \(\phi_{\theta}(\cdot)\) is useful for downstream tasks: embeddings, imputation, classification, etc.
The mapping to outputs \(h(\cdot)\) is often “shallow” (e.g., linear or low-order polynomial)
In contrast, the transformation \(\phi(\cdot)\) into representation space is usually “deep”
\[ \phi \equiv \phi_L \circ \cdots \circ \phi_1 \]
As data becomes richer, unstructured, and higher-dimensional, transformations become harder to design manually
TOPIC Neural networks compose simple nonlinear functions to learn complicated transformations
prompt = """
A stylized illustration of representation learning as a
smooth change of variables: a tangled, high-dimensional
data manifold being unfolded into a flat, low-dimensional
coordinate system. The left side shows intertwined curves
and knots; the right side shows clean, orthogonal axes
with separated clusters. Etching / wood-cut / scientific
engraving style, high contrast, minimal color palette."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
prompt = """
An artistic visualization of representation learning
where entangled threads of data are transformed into
independent latent factors. On the left, a dense braid
of overlapping fibers; on the right, parallel strands
aligned along clear axes. Emphasize symmetry, order,
and factorization. Rendered in a vintage wood-engraving
or linocut style."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
prompt = """
A visual metaphor for representation learning as
information compression: raw, noisy data clouds
are compressed through a narrow bottleneck into
a compact latent space that preserves structure.
Before-and-after panels. Use engraved, chalkboard,
or woodcut academic illustration style."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
prompt = """
A two-panel educational illustration explaining
representation learning. Left panel: high-dimensional,
entangled observations with overlapping features. Right
panel: low-dimensional latent representation with
disentangled, interpretable axes. Clean academic diagram
style with subtle wood-engraving texture."""
response = client.models.generate_content(
model="gemini-2.5-flash-image",
contents=prompt,
config=types.GenerateContentConfig(
response_modalities=["IMAGE"]
)
)
generated_img = response.parts[0].as_image()
display(IPImage(data=generated_img.image_bytes,
format='png'))
See here for more details.
<Ctrl+Shift+P> or <Cmd+Shift+P> on mac and type > Git: Clone and choose https://github.com/jlperla/grad_econ_ML_notebooksuv syncSee here for more details.
winget install julia -s msstore in a terminalcurl -fsSL https://install.julialang.org | sh in a terminalOpen the command palette with <Ctrl+Shift+P> or <Cmd+Shift+P> on mac and type > Git: Clone and choose https://github.com/jlperla/grad_econ_ML_notebooks
Instantiate packages by running VS Code terminal
] instantiate, where ] enters package modeThen use VS Code to open any of the notebooks in that folder
Note: the same clone’d repo can work for both Julia and Python
Use client.chats.create() to manage state. The chat object automatically tracks history so you don’t have to pass it back manually.
chat = client.chats.create(model=model, config=config)
res1 = chat.send_message("Describe the concept of generative AI in one sentence.")
print(f"Step 1: {res1.text}\n")
# Contextual follow-up
res2 = chat.send_message("Explain in one sentence how that relates to sampling from probability distributions.")
print(f"Step 2: {res2.text}")Step 1: Generative artificial intelligence refers to a class of machine learning models that learn the underlying patterns and structure of input data and subsequently generate new data instances that plausibly could have been drawn from the same distribution.
Step 2: Generative AI models, after learning the probability distribution of the training data, effectively sample from that learned distribution to create new data points.
Let \(\mu^*(y \mid x)\) denote the true conditional distribution, and \(\mathbb{P}_f(y \mid x)\) the model-implied conditional distribution
Take the the expected negative log-likelihood, condition on \(x\) and use the LIE \[ \begin{aligned} \mathbb{E}_{(x,y)\sim\mu^*} \left[-\log \mathbb{P}_f(y \mid x)\right] &= \mathbb{E}_{x\sim\mu^*} \left[ \mathbb{E}_{y\sim\mu^*(\cdot\mid x)} \left[-\log \mathbb{P}_f(y \mid x)\right] \right]. \end{aligned} \]
Add and subtract \(\log \mu^*(y \mid x)\) inside the inner expectation \[ \begin{aligned} &= \mathbb{E}_{x\sim\mu^*} \Big[ \underbrace{ \mathbb{E}_{y\sim\mu^*(\cdot\mid x)} \left[ \log \frac{\mu^*(y \mid x)}{\mathbb{P}_f(y \mid x)} \right] }_{\mathrm{KL}(\mu^*(y\mid x)\,\|\,\mathbb{P}_f(y\mid x))} & + \underbrace{ \mathbb{E}_{y\sim\mu^*(\cdot\mid x)} \left[-\log \mu^*(y \mid x)\right] }_{\text{does not depend on } f} \Big]. \end{aligned} \]
Therefore, minimizing expected log-loss is equivalent to minimizing KL \[ \mathbb{E}_{(x,y)\sim\mu^*} \left[-\log \mathbb{P}_f(y \mid x)\right] = \mathbb{E}_{x\sim\mu^*} \left[ \mathrm{KL}\!\left( \mu^*(y\mid x)\,\|\,\mathbb{P}_f(y\mid x) \right) \right] + \text{constant}. \]