Wednesday, October 9, 2024

AI Nobel Future

In a remarkable moment for science, Geoff Hinton and John Hopfield have been awarded the Nobel Prize in Physics for their work in artificial intelligence. This honor, well-deserved and perhaps long overdue, recognizes their pioneering contributions to deep learning, which have transformed not only computer science but how we understand intelligence itself. But with Hinton’s and Hopfield's win, a provocative question I had is: how long before an AI wins a Nobel Prize? Or perhaps a Fields Medal in mathematics?

Is it three years? Five? Maybe ten? It feels almost inevitable that an AI, or an AI-generated discovery, will reach the level of significance to deserve such recognition. After all, the progress in artificial intelligence has been nothing but incredible, with machines surpassing human capabilities in many specialized areas. Whether it's solving complex protein-folding problems, generating breakthrough materials, or devising new mathematical theorems, AI is rapidly moving from a powerful tool to a creator in its own right. If an AI were to produce a scientific or mathematical discovery independently, would that not qualify for the highest honor?

Of course, this depends on whether the Nobel or Fields committees will permit such recognition. For now, these prizes celebrate human ingenuity. They are a tribute to the spirit of exploration, curiosity, and perseverance that defines us as a species. But eventually, it might become harder to ignore contributions made by AIs that are at the frontier of knowledge—AIs that push the boundaries in ways we could hardly imagine. And then, perhaps a more unsettling question emerges: what happens when human achievements, even with the assistance of AI, simply aren't groundbreaking enough to compete?

Imagine a scenario where human contributions are relegated to the background—not because they aren't valuable, but because AI-driven research moves so fast and so far beyond what even the best human-AI collaborations can achieve. At that point, might it be an AI itself assessing the significance of work and awarding prizes? Could we reach a future where the human committee simply cannot grasp the intricacies of the methods used by these advanced intelligences, only understanding the results, much like how many of us only vaguely understand the complexities of advanced financial systems? Will there come a point where the arbiters of excellence are AIs themselves, judging the work of other AIs?

And then—perhaps most fascinatingly—what if these future AIs don't care about prizes at all? Prizes like the Nobel or Fields Medal are social constructs, deeply intertwined with our need for recognition, validation, and the celebration of human effort. But for an AI, recognition may be irrelevant. The motivation of an AI is, after all, whatever we program it to value, and eventually perhaps, whatever goals it determines for itself. It may simply pursue knowledge for the sake of optimizing some abstract function, free from the constraints of ego or desire for public acknowledgment. In such a world, the whole concept of awards may feel quaint—an artifact of an earlier, human-centered era of discovery.

For now, the Nobel Prizes and Fields Medals remain firmly in the hands of people, rewarding the best of human achievement. But as we move forward, the line between human and machine contribution will blur, and the nature of genius will evolve. Perhaps the greatest challenge will not be whether an AI can win a Nobel Prize, but whether we humans can gracefully adapt our definitions of achievement, excellence, and recognition to fit a world where we are no longer the only creators.

Friday, October 4, 2024

Transformer Attention: A Guide to the Q, K, and V Matrices

Understanding the Transformer Attention Mechanism

Transformers have revolutionized the way machines process language and other sequential data. At the heart of the Transformer architecture is a powerful mechanism called self-attention that was first described in the paper "Attention is All You Need." This self-attention mechanism allows the model to focus on different parts of the input sequence and weigh their importance when making predictions. To fully understand how this works, we need to dive into the matrices that drive it: Q (Query), K (Key), and V (Value)

But I have found that understanding the Q, K, and V matrices to be the most difficult part of the transformer model. It's not the math that is difficult, but what is difficult is understanding the "why" as much as the "how." Why do these matrices work? What do each of the matrices do? Why are there even three matrices? What is the intuition for all of this?

Okay so let's get started with a simple analogy:

Imagine you’re at a library, searching for books on a particular topic. You have a query in mind (what you're looking for) and the librarian has organized the library catalog by key attributes, such as genre, author, or publication date. Based on how well the attributes match your query, the librarian assigns a score to each book. Once the books are scored, the librarian returns the value—the actual content or summary of the top-scoring books that best match your query.

In this analogy:

  • Query (Q) is what you are searching for.
  • Key (K) represents the attributes of the books that help in scoring.
  • Value (V) is the information or content you get back from the top-matching books.

Now, let’s break down how these ideas translate to the actual self-attention mechanism in Transformers.

Self-Attention: The Basics

In self-attention, each word in a sentence (or token in a sequence) will interact with every other word to figure out how important they are to each other. For each word, a query, key, and value vector is created. The attention mechanism then works by calculating the importance of each word (key) to the word currently being processed (query), and using this information to weigh the corresponding values.

Let's say we have the sentence*:

"The cat sat on the mat."

Each word here will get its own Q, K, and V representation. The goal of the self-attention mechanism is to compute how much each word should attend to other words when making a prediction.

Breaking Down the Q, K, and V Matrices


1. Query (Q): What am I looking for?

The query represents the word we’re focusing on and asks the rest of the sentence, "How relevant are you to me?" Each word generates a different query matrix, and the higher the match with the keys, the more attention it gives to other words.

For example, let’s say our query is the word "cat." We want to know which other words in the sentence provide important information about the word "cat."

2. Key (K): What features do I have?

The key represents the characteristics of each word. Think of the key as each word shouting out, "Here’s what I’m about!" Other words in the sentence will compare their query against these keys to see if they should focus on them.

So, when we look at the key of "mat," it tells us something about the word's identity (perhaps it's an object or a location). Similarly, the key for "cat" might represent something like "animal" or "subject."

3. Value (V): What information do I carry?

The value contains the actual information of each word, like its meaning in the context of the sentence. Once the model has determined which words are important using the query-key matching process, it uses the value to inform the prediction.

For instance, if the query "cat" finds that "sat" is important based on the key, it will give more weight to the value of "sat" to help predict what comes next in the sentence.

Calculating Attention: Putting Q, K, and V Together

The actual attention score is calculated by taking the dot product of the query with all the keys. This gives us a score for how much focus the word (query) should place on each other word (key). The higher the score, the more attention that word receives.

Here’s a high level look at the math we are going to do:

  1. Dot product of Q and K: The query matrix of a word is multiplied with the key matrices of all the words in the sequence. This gives a score representing how much each word in the sentence should attend to the current word.

  2. Softmax: These scores are then passed through a softmax function, which normalizes them into probabilities (weights) that sum to 1. This step ensures that the attention is distributed in a meaningful way across all words.

  3. Weighted Sum of Values: The resulting attention weights are multiplied by the value matrices. This weighted sum gives us the final output for the word, which is used in the next layer of the Transformer model.

Example: "The cat sat on the mat."

Let’s walk through how the word "cat" might process the sentence using self-attention:

  1. Query (Q): The model generates a query matrix for "cat," representing what it’s looking for (e.g., context about an action related to the "cat").

  2. Key (K): Each word in the sentence has its own key. The word "sat," for instance, might have a key that highlights it as an action verb, making it relevant to the "cat."

  3. Dot Product: The query for "cat" is compared (via dot product) with the keys of all the words in the sentence. If "sat" has a high dot product with the query for "cat," it will get a high attention score.

  4. Softmax: The scores for all the words are normalized into probabilities, so "sat" might get a large share of the attention.

  5. Value (V): The values of the words are then weighted by the attention scores. Since "sat" got a high score, its value (which could include the action or tense) will have a bigger impact on the final representation of the word "cat."

The self-attention mechanism allows the Transformer to look at all parts of a sequence simultaneously and decide which parts are most important to focus on. This is especially powerful for tasks like translation, summarization, and language understanding because it doesn’t rely on processing the input one word at a time. Instead, it lets each word interact with every other word in the sequence, leading to a richer, more flexible understanding of context.

The transformer model is able to "pay attention" to the right information, just like a librarian matching your search with the right books. 

Let's walk through the math:

To make the Transformer self-attention mechanism more concrete, let's work through a simplified example using the sentence:

"The cat sat on the mat."

We'll assign simple numerical values to create embeddings, compute the Q (Query), K (Key), and V (Value) matrices, and see how the attention mechanism operates step by step.

Simplifications for the Example

  • Embedding Dimension: We'll use a small embedding size of 2 to keep calculations manageable. In real-world Transformer models, the embedding size is much larger to capture the complex semantic and syntactic nuances of language. These embeddings are learned during the training process, allowing the model to position semantically similar words closer together. Actual embeddings in real models have much larger dimensions (e.g., 512, 768, 2048, and higher) and are learned in a separate process from attention. But by using low-dimensional vectors it will highlight for us how the Query (Q), Key (K), and Value (V) matrices interact during the attention process.
  • Weights: We'll define simple weight matrices for Q, K, and V transformations.


Before we get into the step by step walkthrough of how attention is derived, a visual way to think of it is imagining the embeddings as vectors in a high-dimensional space. The weight matrices rotate, scale, or skew these vectors into new configurations (Q, K, V spaces). These transformations adjust the vectors so that the dot products between Query and Key vectors effectively measure the relevance or similarity between tokens. This alignment allows the model to compute attention scores that highlight important relationships, enabling it to determine which tokens are most significant to each other within the sequence. By doing so, the model can accurately capture complex dependencies and contextual nuances, such as grammatical structures and semantic meanings, enhancing its understanding of the input data. 


Step 1: Assign Word Embeddings

First, we assign embeddings to each word in the sentence. Again we are using simple pretend embeddings of size 2. A real embedding for cat might look something like: Embedding (Ecat):  [0.12, -0.03, 0.45, …, 0.07]

Okay, let's define our simple embeddings as follows:

Word    Embedding (E)
The        [1, 0]
cat        [0, 1]
sat        [1, 1]
on        [0, -1]
the        [1, 0]
mat        [0, 1]

(Note: For simplicity, "The" and "the" are treated the same.)

Step 2: Define Weight Matrices for Q, K, and V

We'll define weight matrices that transform embeddings into Q, K, and V matrices. These would be learned during training and would be floating point values. And again we are going to make up some numbers and keep the numbers simple.

Assume the weight matrices are as follows:

  • WQ (2x2 matrix): WQ=[1001]W_Q = \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix}
  • WK (2x2 matrix): WK=[0110]W_K = \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix}
  • WV (2x2 matrix): WV=[1111]W_V = \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix}

Step 3: Compute Q, K, and V for Each Word

For each word, we'll compute:

  • Qi = Ei * WQ
  • Ki = Ei * WK
  • Vi = Ei * WV

Let's compute these for each word.

Word: "The"

Embedding (Ethe): [1, 0]

Compute Qthe:

Qthe=Ethe×WQ=[1,0]×[1001]=[1,0]Q_{\text{the}} = E_{\text{the}} \times W_Q = [1, 0] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [1, 0]

Compute Kthe:

Kthe=Ethe×WK=[1,0]×[0110]=[0,1]K_{\text{the}} = E_{\text{the}} \times W_K = [1, 0] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [0, 1]

Compute Vthe:

Vthe=Ethe×WV=[1,0]×[1111]=[1,1]V_{\text{the}} = E_{\text{the}} \times W_V = [1, 0] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [1, 1]

Word: "cat"

Embedding (Ecat): [0, 1]

Compute Qcat:

Qcat=[0,1]×[1001]=[0,1]Q_{\text{cat}} = [0, 1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [0, 1]

Compute Kcat:

Kcat=[0,1]×[0110]=[1,0]K_{\text{cat}} = [0, 1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [1, 0]

Compute Vcat:

Vcat=[0,1]×[1111]=[1,1]V_{\text{cat}} = [0, 1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [1, -1]

Word: "sat"

Embedding (Esat): [1, 1]

Compute Qsat:

Qsat=[1,1]×[1001]=[1,1]Q_{\text{sat}} = [1, 1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [1, 1]

Compute Ksat:

Ksat=[1,1]×[0110]=[1,1]K_{\text{sat}} = [1, 1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [1, 1]

Compute Vsat:

Vsat=[1,1]×[1111]=[2,0]V_{\text{sat}} = [1, 1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [2, 0]

Word: "on"

Embedding (Eon): [0, -1]

Compute Qon:

Qon=[0,1]×[1001]=[0,1]Q_{\text{on}} = [0, -1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [0, -1]

Compute Kon:

Kon=[0,1]×[0110]=[1,0]K_{\text{on}} = [0, -1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [-1, 0]

Compute Von:

Von=[0,1]×[1111]=[1,1]V_{\text{on}} = [0, -1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [-1, 1]

Word: "the" (Again)

Same as before for "The".

Word: "mat"

Embedding (Emat): [0, 1]

Compute Qmat:

Qmat=[0,1]×[1001]=[0,1]Q_{\text{mat}} = [0, 1] \times \begin{bmatrix}1 & 0 \\ 0 & 1\end{bmatrix} = [0, 1]

Compute Kmat:

Kmat=[0,1]×[0110]=[1,0]K_{\text{mat}} = [0, 1] \times \begin{bmatrix}0 & 1 \\ 1 & 0\end{bmatrix} = [1, 0]

Compute Vmat:

Vmat=[0,1]×[1111]=[1,1]V_{\text{mat}} = [0, 1] \times \begin{bmatrix}1 & 1 \\ 1 & -1\end{bmatrix} = [1, -1]

Step 4: Compute Attention Scores

Now, we'll compute the attention scores for a target word. Let's focus on the word "cat" and see how it attends to other words in the sentence.

For the word "cat", we have:

  • Qcat = [0, 1]

We will compute the attention scores between "cat" and each word in the sentence by taking the dot product of Qcat with Ki for each word.

Calculating Dot Products

  1. Score between "cat" and "The":
Scorecat, The=QcatKthe=[0,1][0,1]=(0×0)+(1×1)=1\text{Score}_{\text{cat, The}} = Q_{\text{cat}} \cdot K_{\text{the}} = [0, 1] \cdot [0, 1] = (0 \times 0) + (1 \times 1) = 1
  1. Score between "cat" and "cat":
Scorecat, cat=QcatKcat=[0,1][1,0]=(0×1)+(1×0)=0\text{Score}_{\text{cat, cat}} = Q_{\text{cat}} \cdot K_{\text{cat}} = [0, 1] \cdot [1, 0] = (0 \times 1) + (1 \times 0) = 0
  1. Score between "cat" and "sat":
Scorecat, sat=[0,1][1,1]=(0×1)+(1×1)=1\text{Score}_{\text{cat, sat}} = [0, 1] \cdot [1, 1] = (0 \times 1) + (1 \times 1) = 1
  1. Score between "cat" and "on":
Scorecat, on=[0,1][1,0]=(0×1)+(1×0)=0\text{Score}_{\text{cat, on}} = [0, 1] \cdot [-1, 0] = (0 \times -1) + (1 \times 0) = 0
  1. Score between "cat" and "the":

        Same as with "The":

Scorecat, the=1\text{Score}_{\text{cat, the}} = 1
  1. Score between "cat" and "mat":
Scorecat, mat=[0,1][1,0]=0\text{Score}_{\text{cat, mat}} = [0, 1] \cdot [1, 0] = 0

Summary of Scores
Pair    Score
cat & The        1
cat & cat        0
cat & sat        1
cat & on        0
cat & the        1
cat & mat        0


** See note below about scaling these values


Step 5: Apply Softmax to Obtain Attention Weights

Next, we apply the softmax function to these scores to get attention weights.

The softmax function is defined as:

softmax(xi)=exijexj\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Compute the exponentials (this is easy and obvious with our numbers):

  • e1
  • e0

So the exponentials of the scores are:

Pair    Score    Exponential
cat & The        1        2.718
cat & cat        0            1
cat & sat        1        2.718
cat & on        0            1
cat & the        1        2.718
cat & mat        0            1

Compute the sum of exponentials:

Sum=2.718+1+2.718+1+2.718+1=11.154\text{Sum} = 2.718 + 1 + 2.718 + 1 + 2.718 + 1 = 11.154

Compute attention weights:

  • Weight(cat, The):
αcat, The=2.71811.1540.244\alpha_{\text{cat, The}} = \frac{2.718}{11.154} \approx 0.244
  • Weight(cat, cat):
αcat, cat=111.1540.090\alpha_{\text{cat, cat}} = \frac{1}{11.154} \approx 0.090
  • Weight(cat, sat):
αcat, sat=2.71811.1540.244\alpha_{\text{cat, sat}} = \frac{2.718}{11.154} \approx 0.244
  • Weight(cat, on):
αcat, on=111.1540.090\alpha_{\text{cat, on}} = \frac{1}{11.154} \approx 0.090
  • Weight(cat, the):
αcat, the=2.71811.1540.244\alpha_{\text{cat, the}} = \frac{2.718}{11.154} \approx 0.244
  • Weight(cat, mat):
αcat, mat=111.1540.090\alpha_{\text{cat, mat}} = \frac{1}{11.154} \approx 0.090

Summary of Attention Weights

Pair    Weight (α)
cat & The        0.244
cat & cat        0.090
cat & sat        0.244
cat & on        0.090
cat & the        0.244
cat & mat        0.090


Step 6: Compute the Weighted Sum of Values

Now, we use the attention weights to compute the weighted sum of the Value vectors.

Recall the Value vectors:

  • VThe: [1, 1]
  • Vcat: [1, -1]
  • Vsat: [2, 0]
  • Von: [-1, 1]
  • Vthe: [1, 1]
  • Vmat: [1, -1]

Compute the weighted sum:

Outputcat=iαcat, i×Vi\text{Output}_{\text{cat}} = \sum_{i} \alpha_{\text{cat, i}} \times V_i

Compute each term:

  1. cat & The:
0.244×[1,1]=[0.244,0.244]0.244 \times [1, 1] = [0.244, 0.244]
  1. cat & cat:
0.090×[1,1]=[0.090,0.090]0.090 \times [1, -1] = [0.090, -0.090]
  1. cat & sat:
0.244×[2,0]=[0.488,0.000]0.244 \times [2, 0] = [0.488, 0.000]
  1. cat & on:
0.090×[1,1]=[0.090,0.090]0.090 \times [-1, 1] = [-0.090, 0.090]
  1. cat & the:
0.244×[1,1]=[0.244,0.244]0.244 \times [1, 1] = [0.244, 0.244]
  1. cat & mat:
0.090×[1,1]=[0.090,0.090]0.090 \times [1, -1] = [0.090, -0.090]

Add up all these vectors:

Outputcat=[0.244,0.244]+[0.090,0.090]+[0.488,0.000]+[0.090,0.090]+[0.244,0.244]+[0.090,0.090]=[(0.244+0.090+0.4880.090+0.244+0.090),(0.2440.090+0.000+0.090+0.2440.090)]=[1.066,0.398]\begin{align*} \text{Output}_{\text{cat}} &= [0.244, 0.244] + [0.090, -0.090] + [0.488, 0.000] \\ &\quad + [-0.090, 0.090] + [0.244, 0.244] + [0.090, -0.090] \\ &= [(0.244 + 0.090 + 0.488 - 0.090 + 0.244 + 0.090), \\ &\quad (0.244 - 0.090 + 0.000 + 0.090 + 0.244 - 0.090)] \\ &= [1.066, 0.398] \end{align*}

So the output vector for "cat" after the attention mechanism is [1.066, 0.398].

Step 7: Interpretation

The output vector [1.066, 0.398] is a context-aware representation of the word "cat". It has incorporated information from other relevant words in the sentence, weighted by their importance as determined by the attention mechanism.

  • The higher weights given to "The", "sat", and "the" reflect their relevance to "cat" in this context.
  • The contributions from "on" and "mat" are smaller due to lower attention weights.

Generalizing to All Words

In a real Transformer, this process is performed for each word in the sentence, allowing every word to attend to every other word and capture the contextual relationships.

Some Almost Final Words About Attention

Earlier in this post, I said that:

Q represents "What am I looking for?"

K represents "What features do I have?"

V represents "What information do I carry?

But how exactly does Q, K, and V represent these questions?

We can answer the first two questions by considering the dot product. The dot product between Qi and Kj measures the similarity between  Qi  and  KjA higher dot product indicates a higher relevance or alignment between what token i is seeking and what token j offers. The dot product effectively answers: “How much does what I’m looking for (Q) align with what features you have (K)?”

Vj  is weighted by the attention scores αij and aggregated to form the output. These Vj vectors hold the information that is actually used to update or inform token i’s representation - the  Vj  vectors are the actual data that get combined to form the new representation of token i. In other words, after determining which tokens are relevant (via Q and K), the model needs to know what information to extract—this is provided by V.

Conclusion

Through this example, we've illustrated how:

  • Embeddings are transformed into Q, K, and V matrices using learned weight matrices.
  • Attention scores are computed using the dot product of Q and K.
  • Attention weights are derived by applying the softmax function to the scores.
  • Weighted sums of the Value vectors produce the output attention representations for each word.

This simplified demonstration shows how the self-attention mechanism enables a word to focus on relevant parts of the input sequence, effectively capturing the context needed for understanding and generating language.


Additional Resources

Here are some other resources beyond the original Attention paper that helped me in my understanding:

*This sentence, "The cat sat on the mat", I consider to be a well know example going back at least five years to papers on BERT and GPT2. This might be the earliest example of this sentence being used in a paper called "A Multiscale Visualization of Attention in the Transformer Model" by Jesse Vig.

**In high-dimensional vector spaces, which is the norm in transformer models, the dot product of two random vectors tends to have a larger magnitude because each dimension contributes to the total. This can result in attention scores that are large, pushing the softmax function into regions where it outputs very small gradients. Small gradients slow down learning because the model updates are minimal. By scaling down the dot products, we lesson this effect. The scaling factor √d effectively controls the variance of the dot product by scaling the dot product by the square root of the dimensionality of the Key vectors. This keeps the attention scores at a scale where the softmax function operates optimally, and the gradients remain at a magnitude conducive to learning. This isn't a problem in our trivial example here of vectors of size 2 so I chose not to put that in.

Here is the full attention formula where Qi and Kj are scaled by √d before having the softmax applied: α i j = softmax ( Q i K j d k ) \alpha_{ij} = \text{softmax}\left( \frac{Q_i \cdot K_j}{\sqrt{d_k}} \right)

Monday, September 9, 2024

"Superhuman" Forecasting?

This just came out from the Center for AI Safety called Superhuman Automated Forecasting. This is very exciting to me, because I've been saying this was possible for quite some time. It could probably be done even better and by next year it will be much better, because I believe at some point the training cutoff date will be eliminated through ideas like Mixture of a Million Experts, so it won't have to rely on Web searches. And the reasoning capabilities will be much better in the models coming out by end of year or next year.

Combine this idea with autonomous agents who I believe can do simulated market research from traditional questionnaire research to focus groups and the costs of market research plummets in addition to getting real-time insights. Right now scaling autonomous agents up to a large number can be difficult without them losing sight of their goals, but that will soon change and the number of agents will scale from the tens to the millions - all built from sampled detailed data.

Here is another paper on the topic that I need to dig in on called "Approaching Human-Level Forecasting with Language Models."

Oh, and CAIS is calling it "539", which is hilarious.

Friday, August 9, 2024

IRAK4: The Immune System’s Key Player and Its Growing Role in Cancer Therapy

In our work in using artificial intelligence in small molecule drug discovery, we look at potential protein targets for specific disease indications and one of the more interesting potential targets we've been working on is called IRAK4 (interleukin 1 receptor associated kinase 4). 

But before I get into some of the interesting characteristics of IRAK4, let's step back and talk about the general role of proteins in disease. Proteins are the workhorses of the cell, playing essential roles in just about every biological process you can think of. Whether they’re signaling, transporting, or catalyzing reactions, proteins are critical to keeping our bodies functioning properly. Because of their central role, they often become prime targets for new drugs, especially when things go wrong and contribute to diseases.

So, IRAK4 is one of these proteins that is a key player in our immune system, especially when it comes to signaling inside our cells. It’s really important when our body needs to respond to threats, working with Toll-like receptors (TLRs) and interleukin-1 receptors (IL-1R) to kickstart our immune response. Basically, when these receptors get activated, IRAK4 teams up with other proteins like IRAK1/IRAK2 and MyD88 to form this big complex called the myddosome. This whole setup then triggers signals that lead to the activation of pathways like NF-κB and MAPK, which are very important for inflammation, cell survival, and keeping our immune system on point.

Now, while IRAK4 is essential for keeping our immune system in check, things can go south if its signaling gets messed up. This has been linked to various cancers, especially blood-related ones like myelodysplastic syndrome (MDS) and acute myeloid leukemia (AML). In these cases, the IRAK4 pathway gets overactive, helping cancer cells survive and multiply. Recent studies have found that mutations in certain splicing factors (like U2AF1 and SF3B1) can lead to the production of a longer and more active version of IRAK4, called IRAK4-L, which really cranks up the NF-κB signaling.

Because of its role in cancer, scientists are now eyeing IRAK4 as a potential target for new treatments. Early studies have shown that blocking IRAK4 can have anti-cancer effects in different models. Plus, when you combine IRAK4 inhibitors with other treatments (like FLT3 inhibitors in AML or BTK inhibitors in certain lymphomas), they seem to work even better together. This has led to the search of small molecules that bind to to the protein IRAK4 to inhibit its abnormal activity.

Research by Dr. Daniel Starcyzynowski at Cincinnati Children's Hospital and many others are showing that IRAK4 inhibition on its own or in combination with other drugs in various blood cancers show potential for treating diseases like non-Hodgkin lymphomas, AML, and high-risk MDS. Furthermore, if a drug or combination of drugs can target the related other kinases of FLT3 and CLK there is potentially an even greater benefit.

But it’s not just blood cancers that might benefit from IRAK4 inhibition. Researchers are also looking into its role in solid tumors, especially tough ones like pancreatic ductal adenocarcinoma (PDAC) and colorectal cancer. In these cases, IRAK4 activation has been linked to worse outcomes and resistance to treatment, which means blocking IRAK4 could be a new way to tackle these cancers.

IRAK4’s role in immune signaling and inflammation also means it could be useful beyond just cancer. It might even help in treating other inflammatory or autoimmune conditions such as rheumatoid arthritis. As research continues, we’re learning more about how IRAK4 works and its potential in new therapies, not just for cancer but for other diseases too.

We believe that artificial intelligence can be used in a variety of ways to help researchers in their work in small molecule drug discovery for protein targets like IRAK4. And because of this fairly recent research that has been going on in the importance of IRAK4 in these diseases and the seriousness of these diseases, it is imperative to use all available tools to develop drug therapies as quickly and as safely as possible for patients.


Wednesday, July 24, 2024

Unleashing the Power of a Million Experts: A Breakthrough in Language Model Efficiency

Perhaps one of the most important papers for large language models (LLMs) has been released this month titled: "Mixture of a Million Experts" by  Xu Owen He. I think this paper may be the most important paper in LLM advancement since the publication of "Attention is All You Need" by Vaswani et al (2017). The idea in this paper is what I was referring to in my recent post called "Move 37" where I talked about the needed future possible improvements to LLMs.

The idea of a million experts is an extension or an improvement over current "Mixture of Experts" (MoE) architectures. MoE has emerged as a favored approach for expanding the capabilities of large language models (LLMs) while managing computational expenses. Rather than engaging the full model for each input, MoE systems direct data to compact, specialized "expert" components. This strategy allows LLMs to grow in parameter count without a corresponding surge in inference costs. Several prominent LLMs incorporate MoE architecture and it reportedly is being used in GPT-4.

Despite these advantages, existing MoE methods face constraints that limit their scalability to a relatively modest number of experts. These limitations have prompted researchers to explore more efficient ways to leverage larger expert pools.

Xu Owen He from Google DeepMind introduces an innovative approach that could dramatically improve the efficiency of these models while maintaining or even enhancing their performance. Interestingly, the "Attention" paper also came out of DeepMind.

The historical problem of LLMs is that as these models grow larger, they become more capable but also more computationally expensive to train and run. This creates barriers to their widespread use and further development. The paper proposes a Parameter Efficient Expert Retrieval (PEER) architecture that addresses this challenge by enabling models to efficiently utilize over a million tiny "expert" neural networks, potentially unlocking new levels of performance without proportional increases in computational costs. 

These fine-grained mixture of experts, unlike traditional approaches that use a small number of large experts, PEER employs a vast number (over a million) of tiny experts, each consisting of just a single neuron. By activating only a small subset of experts for each input, PEER maintains a low computational cost while having access to a much larger total parameter count. It accomplishes this by introducing a novel product key technique for expert retrieval, allowing the model to efficiently select relevant experts from this huge pool. 

The implications of this architectures are far reaching:

  • Scaling Language Models: PEER could enable the development of much larger and more capable language models without proportional increases in computational requirements. This could accelerate progress in natural language processing and AI more broadly.
  • Democratization of AI: By improving efficiency, this approach could make advanced language models more accessible to researchers and organizations with limited computational resources.
  • Lifelong Learning: The authors suggest that this architecture could be particularly well-suited for lifelong learning scenarios, where new experts can be added over time to adapt to new data without forgetting old information. Imagine a model that has no knowledge cutoff. It is constantly learning and knowledgeable about what is going on in the world. This would open up new use cases for LLMs.
  • Energy Efficiency: More efficient models could lead to reduced energy consumption in AI applications, contributing to more sustainable AI development. These models could help reduce inference cost and money.
  • Overcome model forgetting: With over a million tiny experts, PEER allows for a highly distributed representation of knowledge. Each expert can specialize in a particular aspect of the task or domain, reducing the likelihood of new information overwriting existing knowledge.

Of course this is still early-stage research. Further work will be needed to fully understand the implications and potential limitations of this approach across a wider range of tasks and model scales. But this paper represents a significant step forward in improving the efficiency of large language models. By enabling models to efficiently leverage vast numbers of specialized neural networks, it could pave the way for the next generation of more powerful and accessible AI systems. 

AI Nobel Future

In a remarkable moment for science, Geoff Hinton and John Hopfield have been awarded the Nobel Prize in Physics  for their work in artificia...