Thursday, May 4, 2023

Hyena: Toward Better Context Memory in AI models

One of the main purposes of this blog is to document the progress to artificial general intelligence (AGI) – a form of artificial intelligence (AI) that is capable of performing any intellectual task that a human being can do. Although the definitions of AGI vary, AGI would be able to reason, learn, perceive, understand natural language, and adapt to new situations, among other abilities. Unlike most of the current AI systems, which are designed to perform specific tasks, AGI would be able to apply its intelligence to a wide range of tasks, much like humans can. AGI is sometimes referred to as "Strong AI" or "Human-Level AI" because it is expected to have human-like cognitive abilities.

I will be talking about what I believe are the requirements to AGI, the path to get there, and what I perceive as the timeline in an upcoming post. But suffice it to say for now the timeline is shorter than what many people have said in the past.

For GPT based systems specifically like GPT 3.5 and GPT 4 which already have significant generalized capabilities, they don’t rise to the level of AGI - yet. One of the necessary requirements that I believe for AGI for any such system is to have significant short-term memory and long-term memory. In order for an AI to have human comparable capabilities it needs to have a “sense” of the past, present, and future and to be able to plan - based on its past.

There are two elements of memory: 
  • The short-term memory of the current conversation (token length).
  • The ability to refer to information from past conversations.
I want to focus in this post on some recent developments in short-term memory, but for now it needs to be mentioned that there has been a tremendous amount of progress in long-term memory with technologies using vector indexing like Pinecone, LlamaIndex, and LangChain to mention just a few.

A powerful technique to achieve a level of long-term memory is vector indexes and vector storage. Vector indexing is a technique used to represent words and phrases as vectors in a high-dimensional space. Vector indexing aims to capture complex meanings by representing the phrases as vectors in a high-dimensional space, much like individual words. The vectors are learned using machine learning algorithms that analyze large amounts of text data, and they encode the semantics of the phrases by their relative positions in the vector space.

By storing semantic meaning of past conversations from vector based storage, those past conversations can be easily retrieved and used. One can expect the quality and usage of implementations of long-term memory to keep improving over the next months and years.

For short-term memory for large language models (LLMs), they have a limited number of tokens they can use in their current context. GPT 3.5 has a context limit of 4,096 tokens. GPT 4 has an increased context limit of 32,768 tokens. Because of this increase in context tokens, GPT 4 is a significant improvement in holding the thread of a conversation and being able to give a much deeper data to have a conversation. However, it is still limited and not what one would expect to be human comparable. But what if the length of that context could be doubled to 64,000, or increased to 100,000 or even to 1,000,000 tokens in the current context?

The problem with increasing the context is that it becomes increasingly costly and slower as more tokens are added to the context. The increase is quadratic. This means that the time it takes for GPT to generate a response will increase at an accelerating rate with the amount of input data provided. If the amount of data provided becomes too large, either the program's ability to generate responses will slow down, or more GPU chips will be required to speed up the process, which can result in a significant increase in computing demands. This can occur when there is an excessive amount of data in the prompt or when there are prolonged conversations with the program over an extended period.

Current GPT architecture – the reason that they are so effective is because of throwing more and more parameters at these large language models (LLMs) and the discovery of attention-based transformer models in 2017 based on the landmark AI paper, 'Attention is All you Need.' In that paper, Google scientist Ashish Vaswani and colleagues introduced the world to Google's Transformer AI program. The transformer became the basis for every one of the recent large language models.

However, as amazing as these attention-based models are, their drawback is the quadratic cost. Although there have been several attempts to reduce this cost, a recent paper by Michael Poli et al, of Stanford proposed a sub-quadratic solution called Hyena. 

In their paper called 'Hyena Hierarchy: Towards Larger Convolutional Language Models’ they describe a new architecture borrowing on some old ideas of convolutionary networks and signal processing. The new network does away with using attention and replaces it with “parametrized long convolutions and data-controlled gating.” The result is an increasing rate of improvement over increasing length of context. They report a 100X speed up at a sequence length of 64K.

But how does such a radically different approach compare to the stunning results of models like GPT 3.5 and GPT 4.0? In their paper, they describe comparable results in a variety of tests such as SuperGLUE using a fraction of the parameters. But as much as the authors are trying to make an apples to apples comparison of their model to the biggest attention based models, it is difficult to make such a comparison until they are truly put into practice even though the benchmark tests are encouraging. And I believe that it is possible that the best result is not a one or the other approach but a mix of the attention based models and an innovative approach like the Hyena model.

Regardless of the progression of this model, the Hyena model is an example of the type of innovation needed to take GPT models to another level.

1 comment:

  1. Another paper that was released recently that addresses context memory but in an entirely different way is "Scaling Transformer to 1M tokens and beyond with RMT" (https://arxiv.org/abs/2304.11062) by Bulatiov, Kuratov, and Burtsev. It uses the ideas from Recurrent Neural Networks with the current transformer architecture to remember ideas across segments of context.

    ReplyDelete

"Superhuman" Forecasting?

This just came out from the Center for AI Safety  called Superhuman Automated Forecasting . This is very exciting to me, because I've be...