Graduate Quantitative Economics and Datascience
API keys
tab and create a keyOPENAI_API_KEY
to this value (see here)OPENAI_API_KEY
from environment variable)model=
might be chosen for speed, cost, etc.enc = tiktoken.encoding_for_model(openai_embedding_model)
print(f"Tokens in vocabulary for {openai_embedding_model}: {enc.n_vocab}")
print(f"'Hello world' -> {enc.encode('Hello world')}")
print(f"'hello world' -> {enc.encode('hello world')}")
print(f"'hello world 67' -> {enc.encode('hello world 67')}")
Tokens in vocabulary for text-embedding-3-large: 100277
'Hello world' -> [9906, 1917]
'hello world' -> [15339, 1917]
'hello world 67' -> [15339, 1917, 220, 3080]
resp = llm.invoke([
SystemMessage(content="You provide 2 short bullet points, technical answers."),
HumanMessage(content="What is an embedding?")
])
resp_text = resp.content
print(resp_text)
- An embedding is a mathematical representation of items (such as words, sentences, or images) in a continuous vector space, where similar items are mapped to nearby points in that space.
- Embeddings are commonly used in machine learning and natural language processing to capture semantic meaning and relationships, enabling algorithms to process and analyze complex data more effectively.
SystemMessage
→ OpenAI “system” role (sets behavior/instructions)HumanMessage
→ OpenAI “user” role (your prompt/query)AIMessage
→ OpenAI “assistant” role (model responses)llm.invoke([...])
and related APIs you pass the entire history of the conversation each time.resp = llm.invoke([
SystemMessage(content="You provide 2 short bullet points, technical answers."),
HumanMessage(content="What is an embedding?"),
AIMessage(content=resp_text),
HumanMessage(content="What is the relationship to latent spaces?")
])
print(resp.content)
- Latent spaces are abstract, lower-dimensional representations of data that capture essential features, often used in models like autoencoders and generative adversarial networks (GANs), where embeddings can serve as points within this latent space.
- Both embeddings and latent spaces aim to reduce dimensionality while preserving important information and relationships, allowing for effective data representation and manipulation in machine learning tasks.
def cos_sim(a, b):
denom = np.linalg.norm(a) * np.linalg.norm(b)
return np.dot(a, b) / denom
print(f"cos_sim([1, 0], [0, 1]) = {cos_sim(np.array([1, 0]), np.array([0, 1]))}")
print(f"cos_sim([1, 0], [0, 2]) = {cos_sim(np.array([1, 0]), np.array([0, 2]))}")
print(f"cos_sim([1, 0], [1, 0]) = {cos_sim(np.array([1, 0]), np.array([1, 0]))}")
print(f"cos_sim([1, 0], [2, 0]) = {cos_sim(np.array([1, 0]), np.array([2, 0]))}")
print(f"cos_sim([1, 0], [-2, 0]) = {cos_sim(np.array([1, 0]), np.array([-2, 0]))}")
cos_sim([1, 0], [0, 1]) = 0.0
cos_sim([1, 0], [0, 2]) = 0.0
cos_sim([1, 0], [1, 0]) = 1.0
cos_sim([1, 0], [2, 0]) = 1.0
cos_sim([1, 0], [-2, 0]) = -1.0
embed_bank = embedder.embed_query("bank")
embed_banks = embedder.embed_query("banks")
embed_river = embedder.embed_query("river")
embed_money = embedder.embed_query("money")
print(f"sim(bank, banks) = {cos_sim(embed_bank, embed_banks):.4f}")
print(f"sim(bank, river) = {cos_sim(embed_bank, embed_river):.4f}")
print(f"sim(bank, money) = {cos_sim(embed_bank, embed_money):.4f}")
print(f"sim(river, money) = {cos_sim(embed_river, embed_money):.4f}")
sim(bank, banks) = 0.7882
sim(bank, river) = 0.4143
sim(bank, money) = 0.4347
sim(river, money) = 0.3747
text-embedding-3-large
) are trained with self-supervised objectives to produce general-purpose semantic vectors.e_1 = embedder.embed_query("The man bites the dog")
e_2 = embedder.embed_query("The dog chased the man")
e_3 = embedder.embed_query("The man was chased by the dog")
print(f"sim(e_1, e_2) = {cos_sim(e_1, e_2):.4f}")
print(f"sim(e_1, e_3) = {cos_sim(e_1, e_3):.4f}")
print(f"sim(e_2, e_3) = {cos_sim(e_2, e_3):.4f}")
sim(e_1, e_2) = 0.5414
sim(e_1, e_3) = 0.5226
sim(e_2, e_3) = 0.7911
e_1 = embedder.embed_query("The man bites the dog")
e_2 = embedder.embed_query("The dog chased the man")
e_3 = embedder.embed_query("The man was chased by the dog")
e_4 = embedder.embed_query("The man chased the dog")
print(f"sim(e_1, e_2) = {cos_sim(e_1, e_2):.4f}")
print(f"sim(e_1, e_3) = {cos_sim(e_1, e_3):.4f}")
print(f"sim(e_2, e_3) = {cos_sim(e_2, e_3):.4f}")
print(f"sim(e_3, e_4) = {cos_sim(e_3, e_4):.4f}")
sim(e_1, e_2) = 0.5414
sim(e_1, e_3) = 0.5226
sim(e_2, e_3) = 0.7911
sim(e_3, e_4) = 0.8598
sim(e_3, e_4) > sim(e_3, e_2)
even though they seem to have the exact opposite meaning!# Sample a small list of common words and embed them via OpenAIEmbeddings
sampled_tokens = [
"bank", "river", "money", "finance", "water", "dog", "cat", "animal", "pet", "tree",
"forest", "city", "village", "road", "car", "bus", "train", "doctor", "nurse", "hospital",
"school", "student", "teacher", "book", "library", "music", "guitar", "piano", "art", "painting",
"computer", "algorithm", "data", "model", "economics", "market", "price", "inflation", "policy",
"riverbank", "beach", "mountain", "valley", "ocean", "lake", "software", "hardware", "network", "cloud"
]
embeddings = embedder.embed_documents(sampled_tokens) # list[list[float]]
embeddings = np.array(embeddings)
print(f"collected {embeddings.shape[0]} embeddings of dimension {embeddings.shape[1]}")
collected 49 embeddings of dimension 3072
tsne = TSNE(n_components=2, random_state=0) # approx with 2D embedding
embeddings_2d = tsne.fit_transform(embeddings)
# Plot the t-SNE results
plt.figure()
for i, token in enumerate(sampled_tokens):
x, y = embeddings_2d[i]
plt.scatter(x, y)
plt.text(x+0.1, y+0.1, token, fontsize=9)
plt.title('t-SNE visualization of OpenAI text embeddings')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.show()
Often want to map a latent space to a probability distribution over \(K\) outcomes
Given a set of \(K\) outcomes, define the logit/softmax mapping a vector \(y \in \mathbb{R}^K\) to a probability distribution over \(K\) outcomes as \[ \mathrm{softmax}(y)_i = \frac{e^{y_i}}{\sum_{j=1}^K e^{y_j}}, \quad i = 1, \ldots, K \]
Typically this is combined with a mapping of a latent space, \(z \in \mathbb{R}^L\), with some matrix \(W \in \mathbb{R}^{K \times L}\) so that \(\mathrm{softmax}(W z)\) is a probability distribution over \(K\) outcomes
\[ \mathbb{P}\left(x = k \,|\, z\right) = \mathrm{softmax}(W z)_k \]
- An embedding is a representation of discrete objects, such as words or items, in a continuous vector space, allowing for mathematical operations and capturing semantic relationships.
- It is commonly used in machine learning and natural language processing (NLP) to facilitate tasks like similarity measurement and clustering by translating high-dimensional data into lower-dimensional spaces.
- Embeddings can be considered a type of representation within a latent space, where the latent space captures the underlying structure and relationships of the data in a compressed format.
- Latent spaces are often learned through techniques like autoencoders or generative models, where embeddings serve as points in this space, reflecting the intrinsic properties of the original data.