Thursday, May 25, 2023

Andrej Karpathy's Explanations of GPT

For deeply understanding GPT models like the ones from OpenAI, I think there is no one who explains it better than Andrej Karpathy. Andrej Karpathy was part of the original group at OpenAI, then left to become director of AI at Telsa, but since February he is back at OpenAI. He also makes YouTube videos explaining the underlying technology. 

I want to point out two recent videos that he did that if you want to really understand how something like ChatGPT works then these are essential.

First, is the talk he just did at Microsoft Build. There were some important announcements at MS Build 2023 regarding AI and I encourage you to check them out, but the talk by Andrej even though it wasn’t an announcement talk should be really valuable for anyone using a GPT based tool. At a high level, he explains how GPT works and then in the second part of his talk he explains why different types of prompts work.


I generally have a problem using the term “prompt engineering” as it’s not engineering and getting what you want from an AI is often just common sense.  But admittedly it does involve understanding how GPT works as an AI assistant versus communicating to a human assistant.  Andrej explains prompt strategies like ‘chain of thought”, using an ensemble of multiple attempts, reflection, and simulating how humans reason towards an optimal solution. He also talks a little bit about AutoGPT, the hype surrounding it but that it’s “inspirationally interesting.” He also mentions a paper that just came out that talks about using a tree search in prompting called “Tree of Thought.”

The second video is from Andrej’s YouTube channel. It is called “Let's build GPT: from scratch, in code, spelled out.” He has other videos on his site called the “Makemore” series that are also really good, but this one is THE BEST explanation of GPT of how a transformer/attention based model like GPT works. All of the models today are what they are because of the seminal paper put out by Google called “Attention Is All You Need” and also because of other refinements like reinforcement learning.

But if you want to understand how these models actually work and how they are built but found other explanations too general or found diagrams like this baffling or unsatisfactory then this is the explanation.



Especially if the way you understand something like this is to see actual working code being built up. Understanding his video does take knowledge of Python and some knowledge of PyTorch. But even if you haven’t done much in PyTorch, you can follow along. Plus this is the perfect opportunity to build something in PyTorch if you haven’t before. His explanations are extremely clear. He goes through step by step building a simple GPT model which you can do regardless of how powerful or not powerful your machine is or what your access to a GPU might be.

So the first talk I’m recommending is more at a high level of understanding of GPT, but the second talk is more technical if you want a deeper understanding of the engineering. Both talks are excellent and I know it’s a bold statement because there are a lot of people who also are really good at this, but I think besides being a brilliant engineer Andrej Karpathy is the best at educating people right now in AI.

Wednesday, May 10, 2023

Mojo: Is it the Last Programming Paradigm We Need?

I think it’s safe to say that anyone who has been writing code for any length of time has had to learn multiple languages. This is partially because of the nature of who we are, meaning anyone who is doing some kind of engineering or science, because we are always learning and on the lookout for that new language that handles different situations better than our current “main” language. Maybe it’s to go faster, use memory better or safer, or works with a new OS or hardware. Maybe it's a new language that is specialized for front end or back end, works close to the metal or is abstracted from the hardware, or it just follows new and better software design principles. 

Regardless of the reason, constantly learning changes in software languages is a requirement for working in software technology. So the question is if it's really necessary for yet another language - that language being Mojo from a company named Modular.

Before we get into why I think the answer could be yes, a little history. My very first language was Fortran, because at the time it was the best language, meaning fastest, for doing any kind of mathematical or scientific programming. At that time, there were very few packaged libraries. If you wanted special clustering options for a K-means cluster, you wrote that from scratch. If you wanted to convert a string to a number, you looped through each place in the array, subtracted 48 from it and built the number up. And there was always at least one person in an organization who knew Assembler for those heavily travelled pieces of functionality. Good times.

But I quickly moved on from Fortran and became involved with C/C++ because “modern language”, “Windows programming!”, and Object Oriented Programming, because OOP was going to be the ultimate way of structuring large, complicated ideas into code and the OOP paradigm ruled across many different languages for many years (well, until functional programming challenged some of its core ideas).

From C/C++, I went to C# and .Net for many years (which I loved), and of course Java, a myriad of JavaScript-based front-end languages, and many others. All great languages, but none of these compared to doing Python which has been my language of choice for many years. With Python and its easy-to-read syntax, productivity, and all of its supporting libraries; it’s really like having a superpower. 

This is all to set up why I hope a new language called Mojo, can take Python to an even higher level and maybe take myself and others off the rollercoaster of changing programming paradigms – at least for a while.

So what is Mojo?

Mojo is being put out by a company called Modular led by Chris Lattner. If that name sounds familiar, it’s because he’s the person who created Swift, LLVM, and Clang. LLVM has many parts, but at its core it’s a low-level, assembly-like representation called intermediate representation (IR) that many, many languages use or can use – including C#, Scala, Ruby, Julia, Kotlin, Rust, and the list goes on and on. 

LLVM is an important part of the Mojo story because even though Mojo is not based on LLVM, Mojo is based on a newer technology that grew out of LLVM called MLIR which is also another project led by Chris Lattner. One of the many things MLIR does is that it can abstract away the targeting of a variety of hardware – things like threads, vectors, GPUs, TPUs – things that are really important for modern AI programming.

All of this is to say that Mojo has some major history of years of technical expertise in compiler design behind it. 

But what does Mojo mean to the regular data engineer or scientist who doesn’t care about all of the details of how it compiles and just wants to get stuff done? 

The short answer would be that Mojo is (or hopes to be) a superset of Python that is much faster than Python that also makes it much easier to target typical machine learning hardware.  

A “superset” of Python? Yes, that means that all existing Python code and libraries will work without changing anything. As a superset language it may bring to mind C++ and TypeScript as supersets to C and JavaScript respectively. Although I’m not sure that the comparison will turn out completely accurate because for one TypeScript has some idiosyncrasies in creating TypeScript code that some would argue whether it’s correct or even important to call it a superset. And for C++, I think the transition for Python programmers to implement Mojo code might be easier than the pure C programmer implementing C++ code in their C code – but this all remains to be seen.


Advantages of Mojo over Python:

Outside of the aforementioned advantage of Mojo being a superset of Python, the main advantage is speed - pure and simple. One of the advantages of Python is that it’s interpreted, which increase productivity and ease of use, but it comes at a price of being really slow compared to languages like C/C++ and Rust.

Enter Mojo. As you can see in this demo, under the right circumstances Mojo can be up to 35,000x faster than Python. In the video, you can see Jeremy Howard, who is an advisor on Mojo, step through different optimizations that speed up a Python use case. But even if you don’t do all of the optimizations, you can see starting at 1:27 taking Python code without changing anything except running that same code using the Mojo compiler he got over an 8x speed up.

There are many speed up opportunities, too many to list, but it’s important to know that Mojo can explicitly do parallelism in a way that Python simply can’t do and it can eloquently take advantage of different hardware types like GPUs because of MLIR. 

And also because of MLIR, it’s not just targeting GPU, it could potentially take advantage of any emerging hardware – which is why it could have real staying power.

Finally, another important advantage is that Mojo can do better memory management similar to Rust and allows for strict type checking, which can eliminate many errors.


Reservations about Mojo:

Okay, this all sounds great, but what are the potential downsides. Well, there are a few, none of which I believe are big enough to dissuade anyone from trying out Mojo.

  • It’s not open source – yet

This I think is the biggest concern. Their intention is to open source it “soon” once it reaches a certain level. Their rationale is that they want to iron out the fundamentals and that they can move faster on it with their dedicated team before open sourcing it, which I can understand. I’m not an open source absolutist, but the concern is that the best way for a new language to break through the noise and reach wide adoption is for it to be open source. And maybe it might have been better to wait until they were ready to open source it before making the announcement.

  • It’s not generally available yet 

You can sign up for it and be put on a waitlist to have access to try it out on their notebook server. For a new language, it’s important that anyone can download and test it out locally on their machine, but even once you get access you can't run it locally yet. So again, it might have been better for them to wait for the release announcement until this was possible.

  • Productionizing

I think with any new language or even new versions of Python, there are always questions of how this will work in an existing pipeline and infrastructure. Not a huge problem, but something to note.


So what should you do now? 

If you are intrigued by Mojo, then you should read through their documentation and definitely sign up to get access here: https://www.modular.com/mojo. It doesn’t mean that the great Python speed ups that have been happening in Python 3.1x should be ignored or discounted. Or even if you have tried to solve some speed problems with Rust code talking to Python or using Julia that these efforts should be discarded.

But Mojo is something to have on the radar. If they are able to accomplish what they are setting out to do,  it could turn out to really define not just AI programming but be a new programming paradigm for a whole host of applications.

 

Thursday, May 4, 2023

Hyena: Toward Better Context Memory in AI models

One of the main purposes of this blog is to document the progress to artificial general intelligence (AGI) – a form of artificial intelligence (AI) that is capable of performing any intellectual task that a human being can do. Although the definitions of AGI vary, AGI would be able to reason, learn, perceive, understand natural language, and adapt to new situations, among other abilities. Unlike most of the current AI systems, which are designed to perform specific tasks, AGI would be able to apply its intelligence to a wide range of tasks, much like humans can. AGI is sometimes referred to as "Strong AI" or "Human-Level AI" because it is expected to have human-like cognitive abilities.

I will be talking about what I believe are the requirements to AGI, the path to get there, and what I perceive as the timeline in an upcoming post. But suffice it to say for now the timeline is shorter than what many people have said in the past.

For GPT based systems specifically like GPT 3.5 and GPT 4 which already have significant generalized capabilities, they don’t rise to the level of AGI - yet. One of the necessary requirements that I believe for AGI for any such system is to have significant short-term memory and long-term memory. In order for an AI to have human comparable capabilities it needs to have a “sense” of the past, present, and future and to be able to plan - based on its past.

There are two elements of memory: 
  • The short-term memory of the current conversation (token length).
  • The ability to refer to information from past conversations.
I want to focus in this post on some recent developments in short-term memory, but for now it needs to be mentioned that there has been a tremendous amount of progress in long-term memory with technologies using vector indexing like Pinecone, LlamaIndex, and LangChain to mention just a few.

A powerful technique to achieve a level of long-term memory is vector indexes and vector storage. Vector indexing is a technique used to represent words and phrases as vectors in a high-dimensional space. Vector indexing aims to capture complex meanings by representing the phrases as vectors in a high-dimensional space, much like individual words. The vectors are learned using machine learning algorithms that analyze large amounts of text data, and they encode the semantics of the phrases by their relative positions in the vector space.

By storing semantic meaning of past conversations from vector based storage, those past conversations can be easily retrieved and used. One can expect the quality and usage of implementations of long-term memory to keep improving over the next months and years.

For short-term memory for large language models (LLMs), they have a limited number of tokens they can use in their current context. GPT 3.5 has a context limit of 4,096 tokens. GPT 4 has an increased context limit of 32,768 tokens. Because of this increase in context tokens, GPT 4 is a significant improvement in holding the thread of a conversation and being able to give a much deeper data to have a conversation. However, it is still limited and not what one would expect to be human comparable. But what if the length of that context could be doubled to 64,000, or increased to 100,000 or even to 1,000,000 tokens in the current context?

The problem with increasing the context is that it becomes increasingly costly and slower as more tokens are added to the context. The increase is quadratic. This means that the time it takes for GPT to generate a response will increase at an accelerating rate with the amount of input data provided. If the amount of data provided becomes too large, either the program's ability to generate responses will slow down, or more GPU chips will be required to speed up the process, which can result in a significant increase in computing demands. This can occur when there is an excessive amount of data in the prompt or when there are prolonged conversations with the program over an extended period.

Current GPT architecture – the reason that they are so effective is because of throwing more and more parameters at these large language models (LLMs) and the discovery of attention-based transformer models in 2017 based on the landmark AI paper, 'Attention is All you Need.' In that paper, Google scientist Ashish Vaswani and colleagues introduced the world to Google's Transformer AI program. The transformer became the basis for every one of the recent large language models.

However, as amazing as these attention-based models are, their drawback is the quadratic cost. Although there have been several attempts to reduce this cost, a recent paper by Michael Poli et al, of Stanford proposed a sub-quadratic solution called Hyena. 

In their paper called 'Hyena Hierarchy: Towards Larger Convolutional Language Models’ they describe a new architecture borrowing on some old ideas of convolutionary networks and signal processing. The new network does away with using attention and replaces it with “parametrized long convolutions and data-controlled gating.” The result is an increasing rate of improvement over increasing length of context. They report a 100X speed up at a sequence length of 64K.

But how does such a radically different approach compare to the stunning results of models like GPT 3.5 and GPT 4.0? In their paper, they describe comparable results in a variety of tests such as SuperGLUE using a fraction of the parameters. But as much as the authors are trying to make an apples to apples comparison of their model to the biggest attention based models, it is difficult to make such a comparison until they are truly put into practice even though the benchmark tests are encouraging. And I believe that it is possible that the best result is not a one or the other approach but a mix of the attention based models and an innovative approach like the Hyena model.

Regardless of the progression of this model, the Hyena model is an example of the type of innovation needed to take GPT models to another level.

Elements of Monte Carlo Tree Search - Typical and Non-typical Applications

Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the...