Thursday, October 24, 2024

Building an Artificial User Interface

(Updated on 11/15/24)

As artificial intelligence continues to evolve, AI agents will take on increasingly complex tasks. Although many agent frameworks currently exist—such as AutoGen, LangGraph, and CrewAI - and numerous papers have been written along with some successful proof of concepts, agents have been challenging to move into production with current models. However, agents will soon be deployed across various devices (in retail, labs, banking), automating tasks or enabling things that were previously thought impossible.

The open-source community have been developing agents using LLMs for the past few years, but it was well known that major AI labs were preparing to jump into the agent space. They have been discussing this at length, with companies like Anthropic and OpenAI making it clear that this was a major focus for them. Their upcoming releases, along with contributions from the wider community, should make 2025 the year of AI agents, with many capable of working together beyond just computers connected to the internet.

And just this week, Anthropic announced a version of their AI model, Claude, capable of computer use - translating instructions, checking spreadsheets, scrolling, moving cursors, and executing a series of actions to achieve objectives. Claude does this by analyzing screenshots, counting pixels to determine cursor movements, and clicking in the correct places—a process that, while innovative, underscores a significant inefficiency in how AI interacts with software designed for humans.

This approach requires AI agents to mimic human interactions, essentially teaching them to navigate interfaces built for human senses and motor skills. It’s akin to asking a robot to use a screwdriver designed for human hands instead of giving it a tool tailored to its mechanical capabilities. 

So this raises the question: Why are we making AI conform to human-centric software interfaces when we could design software specifically for AI agents? 

We need to focus on creating software that is data and task centric and not human-UI centric. The AI does not care about its user experience. So user experience needs to change to helping the human user express their objectives, helping the user guide or correct the AI, and display the outcome of the interactions.

The Inefficiency of Mimicking Human Interactions

Training AI agents to interact with software via graphical user interfaces (GUIs) such as what Anthropic has done involves complex image recognition, pixel counting, and simulated mouse movements. This not only consumes computational resources but also introduces potential errors. A slight change in the UI layout or an unexpected pop-up can confuse the AI, leading to failures in task execution.

Consider a scenario where an AI assistant needs to update a spreadsheet. Teaching it to navigate menus, click on cells, and input data as a human would is cumbersome. Instead, if the spreadsheet software provided an API for data manipulation, the AI could perform the task more efficiently and reliably.

Building Software for AI Agents

To overcome these inefficiencies, we should shift towards designing software that AI agents can interact with directly. This means extending existing applications or creating new ones with machine-readable interfaces that AI can access without relying on a GUI.

Although, it doesn't have to be REST APIs running locally, it could also be locally installed command line interfaces CLIs, I think the standardization and wide use of REST makes sense to use this structure. Conceivably, there could be a server running in the background and software designed for AI would be registered with the OS on installation and running on that server. So the agents would be able to easily see what functionality it had access to in order to accomplish a user's objective through internal software REST calls as well as using external calls to other APIs. The agents would then have local tools as well as external internet tools. This could be rolled out in various ways so that software that was made to this standard could coexist with traditional software (using the "agent  computer use" that Anthropic just released) or software could be released with both versions - the traditional software and software designed for AI.

As I've stated, it doesn't have to be a locally running background server with API endpoints, it could be some other kind of implementation, but there are plenty of examples of software that currently use background REST API calls. For example, you can have locally running versions of PostgreSQL, Gitlab, and Jenkins accessible through their API calls. Another good example is Home Assistant. You can use the Home Assistant API for home automation that can interact with various Home devices controlling lights, thermostats, and other devices through Home Assistant's API.

Example: An AI-Driven Spreadsheet

Here's a specific example of a productivity use case assuming it could be controlled through API endpoints (and Microsoft Excel already has a REST API). 

Imagine a spreadsheet application that offers a comprehensive API for data manipulation. An AI agent running locally on a machine could:

  • Read Data: Retrieve cell values, ranges, and metadata directly.
  • Write Data: Update cells, add formulas, and insert data without GUI interaction.
  • Analyze: Perform computations, generate charts, and identify trends through API calls.

The AI wouldn’t need to “see” the spreadsheet; it would understand its structure and content inherently, leading to faster and more accurate task completion. The AI doesn't need to "experience" the interface. It just needs efficient access to software and data.

Transforming Productivity Software

In this paradigm, productivity software undergoes a significant transformation.

Example: Code Development:

Instead of manually writing code, developers could specify functionality and constraints:

  • Intent: “Create a function that sorts a list of customer names alphabetically and removes duplicates.”
  • AI Agent’s Role: The AI generates the code, tests it, and presents it for review. The agent is capable of looking over the entire code base and design documents.
  • User Interaction: The developer reviews the code changes, provides feedback, accepts changes, and iterates as necessary.
We have been seeing this shift over the last few months already with software from ReplitV0 by Vercel, and Cursor.

Example: Research and Analysis:

Researchers could leverage AI agents to gather information and synthesize insights:

  • Intent: “Summarize the latest research on renewable energy storage solutions and prepare a presentation.”
  • AI Agent’s Role: The AI collects data from reputable sources, analyzes trends, and generates a presentation. These sources could be local sources or local functionality in other programs as well as internet sources.
  • User Interaction: The researcher reviews the content, adjusts focus areas, and finalizes the presentation.
This would all be done through AI agents working on the computer on data to meet objectives directly and not through manipulating a UI meant for humans.

While productivity software would see dramatic changes, entertainment software might remain largely unaffected. Games and media content are inherently human-centric experiences designed for enjoyment and engagement. However, AI could enhance these experiences by personalizing content or managing in-game assets based on user preferences.

An Operating System Designed for AI Agents

Taking this concept further, envision an operating system (OS) specifically designed for AI agents. AI legend Andrej Karpathy has proposed an AI-based operating system centered around large language models (LLMs) equipped with tools. I'm proposing taking this a step beyond that and saying that an OS should be developed that is explicitly AI-centric and agent-centric. This OS wouldn’t just be a platform for running applications but a dynamic environment where AI agents can perform complex operations seamlessly. 

While languages like Python and others already allow direct execution of OS commands, an API layer for operating system tasks could offer distinct advantages. By exposing OS functionalities—file management, network communication, process control, and more—through standardized RESTful APIs, such a system would provide language-agnostic access and simplify integration for AI agents built with diverse tools and frameworks.

This AI-centric OS would be more than a platform for executing applications; it would be a dynamic, modular environment tailored for agent-based interactions. By introducing an API layer, the OS could ensure consistent and secure access to its capabilities while abstracting the complexity of direct command execution. AI agents could leverage these APIs to interact with the OS in a predictable, scalable, and maintainable way, unlocking a new level of efficiency.

User Experience in an AI-Centric OS

For human users, interacting with such an OS would be fundamentally different:

  • Expressing Intent: Users would convey their objectives verbally or via text. For example, “Create an analysis of the last three months advertising effectiveness - make sure to take into account any competitive trends or exogenous variables.”
  • AI Execution: AI agents interpret these intents and execute tasks using the OS’s APIs.
  • Feedback and Control: Users receive updates on task progress and can intervene or adjust objectives as needed.
  • Output Consumption: Once tasks are completed, users engage with the results - organized files, generated reports, or synthesized research findings.

So Instead of simply equipping large language models (LLMs) with tools, the OS itself becomes a tool-rich environment specifically designed for agents. This architecture transforms the relationship between software, operating systems, and AI, creating a seamless bridge where agents can efficiently perform complex tasks while remaining secure, scalable, and adaptable to future advancements.

The Human-AI Interface: A New UI Paradigm

The user interface in this AI-centric world shifts from direct manipulation to intent expression and result consumption. Human interaction focuses on expressing goals or objectives, which the AI interprets and executes. Importantly, achieving a single human objective often requires multiple interdependent tasks, such as performing data analysis in Excel, creating a presentation in PowerPoint, and sharing the final output via email or communication platforms like Teams or Slack. However, AI agents are not constrained by the need to open or interact with these applications as humans do. In many cases, the AI doesn’t even require the software to be installed. Instead, it performs the necessary actions behind the scenes, generating the final outputs - such as .xlsx, .csv, .pptx, or PDF files—directly. By bypassing traditional application workflows, AI agents streamline the process, delivering results efficiently without the overhead of navigating human-centric software interfaces.

The actual human interface could be reduced down:

  • Command/Voice Interface: A simple input field or voice interface where users state their objectives.
  • Progress Feedback: Dashboards or notifications that keep users informed about task status.
  • Result Display: Outputs are presented in human-friendly formats - documents, visualizations, or actionable summaries.

Benefits of Designing Software for AI Agents

  • Efficiency: Eliminates the overhead of GUI navigation by AI agents.
  • Reliability: Reduces errors caused by UI changes or unexpected elements.
  • Scalability: AI agents can perform tasks faster and handle larger volumes of data.
  • User Empowerment: Users focus on defining goals rather than executing steps, enhancing productivity.

Embracing the Paradigm Shift

As AI becomes more integrated into our daily workflows, rethinking software design to accommodate AI agents is not just logical - it’s inevitable. This doesn't need to happen all at once. At first, there could be software with dual uses - human interfaces and AI data/functionality access. But by building software whereby AI communicates directly with data resources and functionality through APIs and an OS designed for AI use, we unlock the full potential of intelligent agents, streamline processes, and create a more efficient partnership between humans and AI agents.

It’s time to start thinking about moving beyond teaching AI to use our tools and start building software and an operating system designed for AI.



Wednesday, October 9, 2024

AI Nobel Future

In what I think is a remarkable moment for science, Geoff Hinton and John Hopfield have been awarded the Nobel Prize in Physics for their work in artificial intelligence. I believe this honor is well-deserved and perhaps long overdue. It recognizes their pioneering contributions to deep learning, which have transformed not only computer science but how we understand intelligence itself. But with Hinton’s and Hopfield's win, questions I have are: how long before an AI wins a Nobel Prize? Or perhaps a Fields Medal in mathematics?

Is it three years? Five? Maybe ten? It feels almost inevitable that an AI, or an AI-generated discovery, will reach the level of significance to deserve such recognition. After all, the progress in artificial intelligence has been nothing but incredible, with machines surpassing human capabilities in many specialized areas. Whether it's solving complex protein-folding problems, generating breakthrough materials, or devising new mathematical theorems, AI is rapidly moving from a powerful tool to a creator in its own right. If an AI were to produce a scientific or mathematical discovery independently, would that not qualify for the highest honor?

Of course, this depends on whether the Nobel or Fields committees will permit such recognition. For now, these prizes celebrate human ingenuity. They are a tribute to the spirit of exploration, curiosity, and perseverance that defines us as a species. But eventually, it might become harder to ignore contributions made by AIs that are at the frontier of knowledge - AIs that push the boundaries in ways we could hardly imagine. And then, perhaps a more unsettling question emerges: what happens when human achievements, even with the assistance of AI, simply aren't groundbreaking enough to compete?

Imagine a scenario where human contributions are relegated to the background - not because they aren't valuable, but because AI-driven research moves so fast and so far beyond what even the best human-AI collaborations can achieve. At that point, might it be an AI itself assessing the significance of work and awarding prizes? Could we reach a future where the human committee simply cannot grasp the intricacies of the methods used by these advanced intelligences, only understanding the results, much like how many of us only vaguely understand the complexities of advanced financial systems? Will there come a point where the arbiters of excellence are AIs themselves, judging the work of other AIs?

And then - perhaps most interestingly - what if these future AIs don't care about prizes at all? Prizes like the Nobel or Fields Medal are social constructs, deeply intertwined with our need for recognition, validation, and the celebration of human effort. But for an AI, recognition may be irrelevant. The motivation of an AI is, after all, whatever we program it to value, and eventually perhaps, whatever goals it determines for itself. It may simply pursue knowledge for the sake of optimizing some abstract function, free from the constraints of ego or desire for public acknowledgment. In such a world, the whole concept of awards may feel quaint - an artifact of an earlier, human-centered era of discovery.

For now, the Nobel Prizes and Fields Medals remain firmly in the hands of people, rewarding the best of human achievement. But as we move forward, the line between human and machine contribution will blur, and the nature of genius will evolve. Perhaps the greatest challenge will not be whether an AI can win a Nobel Prize, but whether we humans can gracefully adapt our definitions of achievement, excellence, and recognition to fit a world where we are no longer the only creators.

Friday, October 4, 2024

Transformer Attention: A Guide to the Q, K, and V Matrices

Understanding the Transformer Attention Mechanism

Transformers have revolutionized the way machines process language and other sequential data. At the heart of the Transformer architecture is a powerful mechanism called self-attention that was first described in the paper "Attention is All You Need." This self-attention mechanism allows the model to focus on different parts of the input sequence and weigh their importance when making predictions. To fully understand how this works, we need to dive into the matrices that drive it: Q (Query), K (Key), and V (Value)

But I have found that understanding the Q, K, and V matrices to be the most difficult part of the transformer model. It's not the math that is difficult, but what is difficult is understanding the "why" as much as the "how." Why do these matrices work? What do each of the matrices do? Why are there even three matrices? What is the intuition for all of this?

Okay so let's get started with a simple analogy:

Imagine you’re at a library, searching for books on a particular topic. You have a query in mind (what you're looking for) and the librarian has organized the library catalog by key attributes, such as genre, author, or publication date. Based on how well the attributes match your query, the librarian assigns a score to each book. Once the books are scored, the librarian returns the value—the actual content or summary of the top-scoring books that best match your query.

In this analogy:

  • Query (Q) is what you are searching for.
  • Key (K) represents the attributes of the books that help in scoring.
  • Value (V) is the information or content you get back from the top-matching books.

Now, let’s break down how these ideas translate to the actual self-attention mechanism in Transformers.

Self-Attention: The Basics

In self-attention, each word in a sentence (or token in a sequence) will interact with every other word to figure out how important they are to each other. For each word, a query, key, and value vector is created. The attention mechanism then works by calculating the importance of each word (key) to the word currently being processed (query), and using this information to weigh the corresponding values.

Let's say we have the sentence*:

"The cat sat on the mat."

Each word here will get its own Q, K, and V representation. The goal of the self-attention mechanism is to compute how much each word should attend to other words when making a prediction.

Breaking Down the Q, K, and V Matrices


1. Query (Q): What am I looking for?

The query represents the word we’re focusing on and asks the rest of the sentence, "How relevant are you to me?" Each word generates a different query matrix, and the higher the match with the keys, the more attention it gives to other words.

For example, let’s say our query is the word "cat." We want to know which other words in the sentence provide important information about the word "cat."

2. Key (K): What features do I have?

The key represents the characteristics of each word. Think of the key as each word shouting out, "Here’s what I’m about!" Other words in the sentence will compare their query against these keys to see if they should focus on them.

So, when we look at the key of "mat," it tells us something about the word's identity (perhaps it's an object or a location). Similarly, the key for "cat" might represent something like "animal" or "subject."

3. Value (V): What information do I carry?

The value contains the actual information of each word, like its meaning in the context of the sentence. Once the model has determined which words are important using the query-key matching process, it uses the value to inform the prediction.

For instance, if the query "cat" finds that "sat" is important based on the key, it will give more weight to the value of "sat" to help predict what comes next in the sentence.

Calculating Attention: Putting Q, K, and V Together

The actual attention score is calculated by taking the dot product of the query with all the keys. This gives us a score for how much focus the word (query) should place on each other word (key). The higher the score, the more attention that word receives.

Here’s a high level look at the math we are going to do:

  1. Dot product of Q and K: The query matrix of a word is multiplied with the key matrices of all the words in the sequence. This gives a score representing how much each word in the sentence should attend to the current word.

  2. Softmax: These scores are then passed through a softmax function, which normalizes them into probabilities (weights) that sum to 1. This step ensures that the attention is distributed in a meaningful way across all words.

  3. Weighted Sum of Values: The resulting attention weights are multiplied by the value matrices. This weighted sum gives us the final output for the word, which is used in the next layer of the Transformer model.

Example: "The cat sat on the mat."

Let’s walk through how the word "cat" might process the sentence using self-attention:

  1. Query (Q): The model generates a query matrix for "cat," representing what it’s looking for (e.g., context about an action related to the "cat").

  2. Key (K): Each word in the sentence has its own key. The word "sat," for instance, might have a key that highlights it as an action verb, making it relevant to the "cat."

  3. Dot Product: The query for "cat" is compared (via dot product) with the keys of all the words in the sentence. If "sat" has a high dot product with the query for "cat," it will get a high attention score.

  4. Softmax: The scores for all the words are normalized into probabilities, so "sat" might get a large share of the attention.

  5. Value (V): The values of the words are then weighted by the attention scores. Since "sat" got a high score, its value (which could include the action or tense) will have a bigger impact on the final representation of the word "cat."

The self-attention mechanism allows the Transformer to look at all parts of a sequence simultaneously and decide which parts are most important to focus on. This is especially powerful for tasks like translation, summarization, and language understanding because it doesn’t rely on processing the input one word at a time. Instead, it lets each word interact with every other word in the sequence, leading to a richer, more flexible understanding of context.

The transformer model is able to "pay attention" to the right information, just like a librarian matching your search with the right books. 

Let's walk through the math:

To make the Transformer self-attention mechanism more concrete, let's work through a simplified example using the sentence:

"The cat sat on the mat."

We'll assign simple numerical values to create embeddings, compute the Q (Query), K (Key), and V (Value) matrices, and see how the attention mechanism operates step by step.

Simplifications for the Example

  • Embedding Dimension: We'll use a small embedding size of 2 to keep calculations manageable. In real-world Transformer models, the embedding size is much larger to capture the complex semantic and syntactic nuances of language. These embeddings are learned during the training process, allowing the model to position semantically similar words closer together. Actual embeddings in real models have much larger dimensions (e.g., 512, 768, 2048, and higher) and are learned in a separate process from attention. But by using low-dimensional vectors it will highlight for us how the Query (Q), Key (K), and Value (V) matrices interact during the attention process.
  • Weights: We'll define simple weight matrices for Q, K, and V transformations.


Before we get into the step by step walkthrough of how attention is derived, a visual way to think of it is imagining the embeddings as vectors in a high-dimensional space. The weight matrices rotate, scale, or skew these vectors into new configurations (Q, K, V spaces). These transformations adjust the vectors so that the dot products between Query and Key vectors effectively measure the relevance or similarity between tokens. This alignment allows the model to compute attention scores that highlight important relationships, enabling it to determine which tokens are most significant to each other within the sequence. By doing so, the model can accurately capture complex dependencies and contextual nuances, such as grammatical structures and semantic meanings, enhancing its understanding of the input data. 


Step 1: Assign Word Embeddings

First, we assign embeddings to each word in the sentence. Again we are using simple pretend embeddings of size 2. A real embedding for cat might look something like: Embedding (Ecat):  [0.12, -0.03, 0.45, …, 0.07]

Okay, let's define our simple embeddings as follows:

Word    Embedding (E)
The        [1, 0]
cat        [0, 1]
sat        [1, 1]
on        [0, -1]
the        [1, 0]
mat        [0, 1]

(Note: For simplicity, "The" and "the" are treated the same.)

Step 2: Define Weight Matrices for Q, K, and V

We'll define weight matrices that transform embeddings into Q, K, and V matrices. These would be learned during training and would be floating point values. And again we are going to make up some numbers and keep the numbers simple.

Assume the weight matrices are as follows:

  • WQ (2x2 matrix): WQ=[1001]W_Q = \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix}
  • WK (2x2 matrix): WK=[0110]W_K = \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix}
  • WV (2x2 matrix): WV=[1111]W_V = \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix}

Step 3: Compute Q, K, and V for Each Word

For each word, we'll compute:

  • Qi = Ei * WQ
  • Ki = Ei * WK
  • Vi = Ei * WV

Let's compute these for each word.

Word: "The"

Embedding (Ethe): [1, 0]

Compute Qthe:

Qthe=Ethe×WQ=[1,0]×[1001]=[1,0]Q_{\text{the}} = E_{\text{the}} \times W_Q = [1, 0] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [1, 0]

Compute Kthe:

Kthe=Ethe×WK=[1,0]×[0110]=[0,1]K_{\text{the}} = E_{\text{the}} \times W_K = [1, 0] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [0, 1]

Compute Vthe:

Vthe=Ethe×WV=[1,0]×[1111]=[1,1]V_{\text{the}} = E_{\text{the}} \times W_V = [1, 0] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [1, 1]

Word: "cat"

Embedding (Ecat): [0, 1]

Compute Qcat:

Qcat=[0,1]×[1001]=[0,1]Q_{\text{cat}} = [0, 1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [0, 1]

Compute Kcat:

Kcat=[0,1]×[0110]=[1,0]K_{\text{cat}} = [0, 1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [1, 0]

Compute Vcat:

Vcat=[0,1]×[1111]=[1,1]V_{\text{cat}} = [0, 1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [1, -1]

Word: "sat"

Embedding (Esat): [1, 1]

Compute Qsat:

Qsat=[1,1]×[1001]=[1,1]Q_{\text{sat}} = [1, 1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [1, 1]

Compute Ksat:

Ksat=[1,1]×[0110]=[1,1]K_{\text{sat}} = [1, 1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [1, 1]

Compute Vsat:

Vsat=[1,1]×[1111]=[2,0]V_{\text{sat}} = [1, 1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [2, 0]

Word: "on"

Embedding (Eon): [0, -1]

Compute Qon:

Qon=[0,1]×[1001]=[0,1]Q_{\text{on}} = [0, -1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [0, -1]

Compute Kon:

Kon=[0,1]×[0110]=[1,0]K_{\text{on}} = [0, -1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [-1, 0]

Compute Von:

Von=[0,1]×[1111]=[1,1]V_{\text{on}} = [0, -1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [-1, 1]

Word: "the" (Again)

Same as before for "The".

Word: "mat"

Embedding (Emat): [0, 1]

Compute Qmat:

Qmat=[0,1]×[1001]=[0,1]Q_{\text{mat}} = [0, 1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [0, 1]

Compute Kmat:

Kmat=[0,1]×[0110]=[1,0]K_{\text{mat}} = [0, 1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [1, 0]

Compute Vmat:

Vmat=[0,1]×[1111]=[1,1]V_{\text{mat}} = [0, 1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [1, -1]

Step 4: Compute Attention Scores

Now, we'll compute the attention scores for a target word. Let's focus on the word "cat" and see how it attends to other words in the sentence.

For the word "cat", we have:

  • Qcat = [0, 1]

We will compute the attention scores between "cat" and each word in the sentence by taking the dot product of Qcat with Ki for each word.

Calculating Dot Products

  1. Score between "cat" and "The":
Scorecat, The=QcatKthe=[0,1][0,1]=(0×0)+(1×1)=1\text{Score}_{\text{cat, The}} = Q_{\text{cat}} \cdot K_{\text{the}} = [0, 1] \cdot [0, 1] = (0 \times 0) + (1 \times 1) = 1
  1. Score between "cat" and "cat":
Scorecat, cat=QcatKcat=[0,1][1,0]=(0×1)+(1×0)=0\text{Score}_{\text{cat, cat}} = Q_{\text{cat}} \cdot K_{\text{cat}} = [0, 1] \cdot [1, 0] = (0 \times 1) + (1 \times 0) = 0
  1. Score between "cat" and "sat":
Scorecat, sat=[0,1][1,1]=(0×1)+(1×1)=1\text{Score}_{\text{cat, sat}} = [0, 1] \cdot [1, 1] = (0 \times 1) + (1 \times 1) = 1
  1. Score between "cat" and "on":
Scorecat, on=[0,1][1,0]=(0×1)+(1×0)=0\text{Score}_{\text{cat, on}} = [0, 1] \cdot [-1, 0] = (0 \times -1) + (1 \times 0) = 0
  1. Score between "cat" and "the":

        Same as with "The":

Scorecat, the=1\text{Score}_{\text{cat, the}} = 1
  1. Score between "cat" and "mat":
Scorecat, mat=[0,1][1,0]=0\text{Score}_{\text{cat, mat}} = [0, 1] \cdot [1, 0] = 0

Summary of Scores
Pair    Score
cat & The        1
cat & cat        0
cat & sat        1
cat & on        0
cat & the        1
cat & mat        0


** See note below about scaling these values


Step 5: Apply Softmax to Obtain Attention Weights

Next, we apply the softmax function to these scores to get attention weights.

The softmax function is defined as:

softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Compute the exponentials (this is easy and obvious with our numbers):

  • e1
  • e0

So the exponentials of the scores are:

Pair    Score    Exponential
cat & The        1        2.718
cat & cat        0            1
cat & sat        1        2.718
cat & on        0            1
cat & the        1        2.718
cat & mat        0            1

Compute the sum of exponentials:

Sum=2.718+1+2.718+1+2.718+1=11.154\text{Sum} = 2.718 + 1 + 2.718 + 1 + 2.718 + 1 = 11.154

Compute attention weights:

  • Weight(cat, The):
αcat, The=2.71811.1540.244\alpha_{\text{cat, The}} = \frac{2.718}{11.154} \approx 0.244
  • Weight(cat, cat):
αcat, cat=111.1540.090\alpha_{\text{cat, cat}} = \frac{1}{11.154} \approx 0.090
  • Weight(cat, sat):
αcat, sat=2.71811.1540.244\alpha_{\text{cat, sat}} = \frac{2.718}{11.154} \approx 0.244
  • Weight(cat, on):
αcat, on=111.1540.090\alpha_{\text{cat, on}} = \frac{1}{11.154} \approx 0.090
  • Weight(cat, the):
αcat, the=2.71811.1540.244\alpha_{\text{cat, the}} = \frac{2.718}{11.154} \approx 0.244
  • Weight(cat, mat):
αcat, mat=111.1540.090\alpha_{\text{cat, mat}} = \frac{1}{11.154} \approx 0.090

Summary of Attention Weights

Pair    Weight (α)
cat & The        0.244
cat & cat        0.090
cat & sat        0.244
cat & on        0.090
cat & the        0.244
cat & mat        0.090


Step 6: Compute the Weighted Sum of Values

Now, we use the attention weights to compute the weighted sum of the Value vectors.

Recall the Value vectors:

  • VThe: [1, 1]
  • Vcat: [1, -1]
  • Vsat: [2, 0]
  • Von: [-1, 1]
  • Vthe: [1, 1]
  • Vmat: [1, -1]

Compute the weighted sum:

Outputcat=iαcat, i×Vi\text{Output}_{\text{cat}} = \sum_{i} \alpha_{\text{cat, i}} \times V_i

Compute each term:

  1. cat & The:
0.244×[1,1]=[0.244,0.244]0.244 \times [1, 1] = [0.244, 0.244]
  1. cat & cat:
0.090×[1,1]=[0.090,0.090]0.090 \times [1, -1] = [0.090, -0.090]
  1. cat & sat:
0.244×[2,0]=[0.488,0.000]0.244 \times [2, 0] = [0.488, 0.000]
  1. cat & on:
0.090×[1,1]=[0.090,0.090]0.090 \times [-1, 1] = [-0.090, 0.090]
  1. cat & the:
0.244×[1,1]=[0.244,0.244]0.244 \times [1, 1] = [0.244, 0.244]
  1. cat & mat:
0.090×[1,1]=[0.090,0.090]0.090 \times [1, -1] = [0.090, -0.090]

Add up all these vectors:

Outputcat=[0.244,0.244]+[0.090,0.090]+[0.488,0.000]+[0.090,0.090]+[0.244,0.244]+[0.090,0.090]=[(0.244+0.090+0.4880.090+0.244+0.090),(0.2440.090+0.000+0.090+0.2440.090)]=[1.066,0.398]\begin{align*} \text{Output}_{\text{cat}} &= [0.244, 0.244] + [0.090, -0.090] + [0.488, 0.000] \\ &\quad + [-0.090, 0.090] + [0.244, 0.244] + [0.090, -0.090] \\ &= [(0.244 + 0.090 + 0.488 - 0.090 + 0.244 + 0.090), \\ &\quad (0.244 - 0.090 + 0.000 + 0.090 + 0.244 - 0.090)] \\ &= [1.066, 0.398] \end{align*}

So the output vector for "cat" after the attention mechanism is [1.066, 0.398].

Step 7: Interpretation

The output vector [1.066, 0.398] is a context-aware representation of the word "cat". It has incorporated information from other relevant words in the sentence, weighted by their importance as determined by the attention mechanism.

  • The higher weights given to "The", "sat", and "the" reflect their relevance to "cat" in this context.
  • The contributions from "on" and "mat" are smaller due to lower attention weights.

Generalizing to All Words

In a real Transformer, this process is performed for each word in the sentence, allowing every word to attend to every other word and capture the contextual relationships.

Some Almost Final Words About Attention

Earlier in this post, I said that:

Q represents "What am I looking for?"

K represents "What features do I have?"

V represents "What information do I carry?

But how exactly does Q, K, and V represent these questions?

We can answer the first two questions by considering the dot product. The dot product between Qi and Kj measures the similarity between  Qi  and  KjA higher dot product indicates a higher relevance or alignment between what token i is seeking and what token j offers. The dot product effectively answers: “How much does what I’m looking for (Q) align with what features you have (K)?”

Vj  is weighted by the attention scores αij and aggregated to form the output. These Vj vectors hold the information that is actually used to update or inform token i’s representation - the  Vj  vectors are the actual data that get combined to form the new representation of token i. In other words, after determining which tokens are relevant (via Q and K), the model needs to know what information to extract—this is provided by V.

Conclusion

Through this example, we've illustrated how:

  • Embeddings are transformed into Q, K, and V matrices using learned weight matrices.
  • Attention scores are computed using the dot product of Q and K.
  • Attention weights are derived by applying the softmax function to the scores.
  • Weighted sums of the Value vectors produce the output attention representations for each word.

This simplified demonstration shows how the self-attention mechanism enables a word to focus on relevant parts of the input sequence, effectively capturing the context needed for understanding and generating language.


Additional Resources

Here are some other resources beyond the original Attention paper that helped me in my understanding:

*This sentence, "The cat sat on the mat", I consider to be a well know example going back at least five years to papers on BERT and GPT2. This might be the earliest example of this sentence being used in a paper called "A Multiscale Visualization of Attention in the Transformer Model" by Jesse Vig.

**In high-dimensional vector spaces, which is the norm in transformer models, the dot product of two random vectors tends to have a larger magnitude because each dimension contributes to the total. This can result in attention scores that are large, pushing the softmax function into regions where it outputs very small gradients. Small gradients slow down learning because the model updates are minimal. By scaling down the dot products, we lesson this effect. The scaling factor √d effectively controls the variance of the dot product by scaling the dot product by the square root of the dimensionality of the Key vectors. This keeps the attention scores at a scale where the softmax function operates optimally, and the gradients remain at a magnitude conducive to learning. This isn't a problem in our trivial example here of vectors of size 2 so I chose not to put that in.

Here is the full attention formula where Qi and Kj are scaled by √d before having the softmax applied: α i j = softmax ( Q i K j d k ) \alpha_{ij} = \text{softmax}\left( \frac{Q_i \cdot K_j}{\sqrt{d_k}} \right)

Elements of Monte Carlo Tree Search - Typical and Non-typical Applications

Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the...