For deeply understanding GPT models like the ones from OpenAI, I think there is no one who explains it better than Andrej Karpathy. Andrej Karpathy was part of the original group at OpenAI, then left to become director of AI at Telsa, but since February he is back at OpenAI. He also makes YouTube videos explaining the underlying technology.
Thursday, May 25, 2023
Andrej Karpathy's Explanations of GPT
Wednesday, May 10, 2023
Mojo: Is it the Last Programming Paradigm We Need?
I think it’s safe to say that anyone who has been writing code for any length of time has had to learn multiple languages. This is partially because of the nature of who we are, meaning anyone who is doing some kind of engineering or science, because we are always learning and on the lookout for that new language that handles different situations better than our current “main” language. Maybe it’s to go faster, use memory better or safer, or works with a new OS or hardware. Maybe it's a new language that is specialized for front end or back end, works close to the metal or is abstracted from the hardware, or it just follows new and better software design principles.
Regardless of the reason, constantly learning changes in software languages is a requirement for working in software technology. So the question is if it's really necessary for yet another language - that language being Mojo from a company named Modular.
Before we get into why I think the answer could be yes, a little history. My very first language was Fortran, because at the time it was the best language, meaning fastest, for doing any kind of mathematical or scientific programming. At that time, there were very few packaged libraries. If you wanted special clustering options for a K-means cluster, you wrote that from scratch. If you wanted to convert a string to a number, you looped through each place in the array, subtracted 48 from it and built the number up. And there was always at least one person in an organization who knew Assembler for those heavily travelled pieces of functionality. Good times.
But I quickly moved on from Fortran and became involved with C/C++ because “modern language”, “Windows programming!”, and Object Oriented Programming, because OOP was going to be the ultimate way of structuring large, complicated ideas into code and the OOP paradigm ruled across many different languages for many years (well, until functional programming challenged some of its core ideas).
From C/C++, I went to C# and .Net for many years (which I loved), and of course Java, a myriad of JavaScript-based front-end languages, and many others. All great languages, but none of these compared to doing Python which has been my language of choice for many years. With Python and its easy-to-read syntax, productivity, and all of its supporting libraries; it’s really like having a superpower.
This is all to set up why I hope a new language called Mojo, can take Python to an even higher level and maybe take myself and others off the rollercoaster of changing programming paradigms – at least for a while.
So what is Mojo?
Mojo is being put out by a company called Modular led by Chris Lattner. If that name sounds familiar, it’s because he’s the person who created Swift, LLVM, and Clang. LLVM has many parts, but at its core it’s a low-level, assembly-like representation called intermediate representation (IR) that many, many languages use or can use – including C#, Scala, Ruby, Julia, Kotlin, Rust, and the list goes on and on.
LLVM is an important part of the Mojo story because even though Mojo is not based on LLVM, Mojo is based on a newer technology that grew out of LLVM called MLIR which is also another project led by Chris Lattner. One of the many things MLIR does is that it can abstract away the targeting of a variety of hardware – things like threads, vectors, GPUs, TPUs – things that are really important for modern AI programming.
All of this is to say that Mojo has some major history of years of technical expertise in compiler design behind it.
But what does Mojo mean to the regular data engineer or scientist who doesn’t care about all of the details of how it compiles and just wants to get stuff done?
The short answer would be that Mojo is (or hopes to be) a superset of Python that is much faster than Python that also makes it much easier to target typical machine learning hardware.
A “superset” of Python? Yes, that means that all existing Python code and libraries will work without changing anything. As a superset language it may bring to mind C++ and TypeScript as supersets to C and JavaScript respectively. Although I’m not sure that the comparison will turn out completely accurate because for one TypeScript has some idiosyncrasies in creating TypeScript code that some would argue whether it’s correct or even important to call it a superset. And for C++, I think the transition for Python programmers to implement Mojo code might be easier than the pure C programmer implementing C++ code in their C code – but this all remains to be seen.
Advantages of Mojo over Python:
Outside of the aforementioned advantage of Mojo being a superset of Python, the main advantage is speed - pure and simple. One of the advantages of Python is that it’s interpreted, which increase productivity and ease of use, but it comes at a price of being really slow compared to languages like C/C++ and Rust.
Enter Mojo. As you can see in this demo, under the right circumstances Mojo can be up to 35,000x faster than Python. In the video, you can see Jeremy Howard, who is an advisor on Mojo, step through different optimizations that speed up a Python use case. But even if you don’t do all of the optimizations, you can see starting at 1:27 taking Python code without changing anything except running that same code using the Mojo compiler he got over an 8x speed up.
There are many speed up opportunities, too many to list, but it’s important to know that Mojo can explicitly do parallelism in a way that Python simply can’t do and it can eloquently take advantage of different hardware types like GPUs because of MLIR.
And also because of MLIR, it’s not just targeting GPU, it could potentially take advantage of any emerging hardware – which is why it could have real staying power.
Finally, another important advantage is that Mojo can do better memory management similar to Rust and allows for strict type checking, which can eliminate many errors.
Reservations about Mojo:
Okay, this all sounds great, but what are the potential downsides. Well, there are a few, none of which I believe are big enough to dissuade anyone from trying out Mojo.
- It’s not open source – yet
This I think is the biggest concern. Their intention is to open source it “soon” once it reaches a certain level. Their rationale is that they want to iron out the fundamentals and that they can move faster on it with their dedicated team before open sourcing it, which I can understand. I’m not an open source absolutist, but the concern is that the best way for a new language to break through the noise and reach wide adoption is for it to be open source. And maybe it might have been better to wait until they were ready to open source it before making the announcement.
- It’s not generally available yet
You can sign up for it and be put on a waitlist to have access to try it out on their notebook server. For a new language, it’s important that anyone can download and test it out locally on their machine, but even once you get access you can't run it locally yet. So again, it might have been better for them to wait for the release announcement until this was possible.
- Productionizing
I think with any new language or even new versions of Python, there are always questions of how this will work in an existing pipeline and infrastructure. Not a huge problem, but something to note.
So what should you do now?
If you are intrigued by Mojo, then you should read through their documentation and definitely sign up to get access here: https://www.modular.com/mojo. It doesn’t mean that the great Python speed ups that have been happening in Python 3.1x should be ignored or discounted. Or even if you have tried to solve some speed problems with Rust code talking to Python or using Julia that these efforts should be discarded.
But Mojo is something to have on the radar. If they are able to accomplish what they are setting out to do, it could turn out to really define not just AI programming but be a new programming paradigm for a whole host of applications.
Thursday, May 4, 2023
Hyena: Toward Better Context Memory in AI models
I will be talking about what I believe are the requirements to AGI, the path to get there, and what I perceive as the timeline in an upcoming post. But suffice it to say for now the timeline is shorter than what many people have said in the past.
There are two elements of memory:
- The short-term memory of the current conversation (token length).
- The ability to refer to information from past conversations.
A powerful technique to achieve a level of long-term memory is vector indexes and vector storage. Vector indexing is a technique used to represent words and phrases as vectors in a high-dimensional space. Vector indexing aims to capture complex meanings by representing the phrases as vectors in a high-dimensional space, much like individual words. The vectors are learned using machine learning algorithms that analyze large amounts of text data, and they encode the semantics of the phrases by their relative positions in the vector space.
By storing semantic meaning of past conversations from vector based storage, those past conversations can be easily retrieved and used. One can expect the quality and usage of implementations of long-term memory to keep improving over the next months and years.
For short-term memory for large language models (LLMs), they have a limited number of tokens they can use in their current context. GPT 3.5 has a context limit of 4,096 tokens. GPT 4 has an increased context limit of 32,768 tokens. Because of this increase in context tokens, GPT 4 is a significant improvement in holding the thread of a conversation and being able to give a much deeper data to have a conversation. However, it is still limited and not what one would expect to be human comparable. But what if the length of that context could be doubled to 64,000, or increased to 100,000 or even to 1,000,000 tokens in the current context?
Current GPT architecture – the reason that they are so effective is because of throwing more and more parameters at these large language models (LLMs) and the discovery of attention-based transformer models in 2017 based on the landmark AI paper, 'Attention is All you Need.' In that paper, Google scientist Ashish Vaswani and colleagues introduced the world to Google's Transformer AI program. The transformer became the basis for every one of the recent large language models.
In their paper called 'Hyena Hierarchy: Towards Larger Convolutional Language Models’ they describe a new architecture borrowing on some old ideas of convolutionary networks and signal processing. The new network does away with using attention and replaces it with “parametrized long convolutions and data-controlled gating.” The result is an increasing rate of improvement over increasing length of context. They report a 100X speed up at a sequence length of 64K.
Elements of Monte Carlo Tree Search - Typical and Non-typical Applications
Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the...
-
Understanding the Transformer Attention Mechanism Transformers have revolutionized the way machines process language and other sequential da...
-
Creating a chat application that is both easy to build and versatile enough to integrate with open source large language models or proprieta...