Wednesday, June 12, 2024

GPT-4o and Google I/O 2024

It's been a few weeks since both the OpenAI announcements and Google I/O which has given us some time to step back and assess the big announcements at both these events.

Google introduced several new AI products and features, including an AI-powered search engine, AI helpers in Workspace apps (Gmail, Drive, Docs, etc.), and a future AI vision called Project Astra. Gemini, Google's AI model, was featured prominently with updates like Gemini Live, Gemini Nano, and Gemini 1.5 Pro, showcasing capabilities in voice interaction, image analysis, and on-device AI functionalities. 

They also announced an expanded context window of 2 million tokens that will be rolled out at some point. Right now the context windows is 1 million tokens which is much larger than either OpenAI's or Anthropic's largest token windows of 128,000 and 200,000 respectively. With 2 million tokens that would be about 64,000 lines of code, 1.4 million words, or a couple hours of video. In a blog post in February, they talked about internally testing a 10 million context window.

If these large context windows can keep track of all of that information without having to do retrieval augmented generation (RAG), then obviously they have major implications from everything code development to video summarization. For example, a developer tasked with a large feature could give a very large code base, design documents, and even mock ups and have it create the code for that major feature with the developer acting more as a code reviewer. And to work with video, large context windows are a necessity.

However, this expansion of the context window is only available in the Gemini 1.5 Pro model for now and then in private preview and then generally available later in the year. And there's the main problem with most of Google's announcements in that they won't be available until some time later in the year. With so many announcements happening, that's far too distant in the future to be touting most of your new AI features.

OpenAI's announcement mainly centered around their multi-modal capabilities and the speed of GPT4o. After using it regularly since the announcement, it does seem to be slightly better than GPT4-Turbo. Having it be able to work across voice, audio, and text with the same API endpoint is very welcome, instead of having a different API model for vision. I have also been using the released Mac App which is a nice convenience.

By this point, everyone has probably seen the voice and vision capabilities of GPT4o, which were crazy. The multi-modal capabilities because it's trained on sound, images, and text together means things like not just improved speed means variations in sound within voice context and understanding emotion and sound/voice speed. Like everyone with OpenAI, they don't say exactly how they accomplished, but it clearly looks like a different architecture from GPT 3.5/4.0. To get a rough idea of how this is probably being done we can look at some other projects like Chameleon from Meta that uses "early fusion" to train on different modalities.

However, this new voice capability wasn't released at the announcement, but it's supposed to be this month. It's half way through June and it's not here yet, but I'm very much looking forward to it when it is released. I currently have code that's based on the existing models that do voice->text->voice, which caused some delay and getting rid of that delay would be a big deal. Of course, it opens up a lot open possibilities beyond just being faster. 

With both Google's and OpenAI's big announcements these are exciting times to be working in AI - and that's not to mention the great things happening at Anthropic and in open source. What will happen in the next six months? Who knows? But I expect Google to release what they previewed at IO and OpenAI to release the voice features. Will OpenAI release a new model before the end of the year? The rumored 5.0 or maybe they call it 4.5? Or they release incremental improvements to 4.0o. No one knows for sure, but I believe OpenAI will release a new model by October or November that is a significant jump in capabilities.

Oh, and that's not to ignore the partnership between OpenAI and Apple - "Apple Intelligence" and everything that partnership means, which immediately caused Elon Musk to throw a tantrum. But that would be a topic for another post.

No comments:

Post a Comment

"Superhuman" Forecasting?

This just came out from the Center for AI Safety  called Superhuman Automated Forecasting . This is very exciting to me, because I've be...