billparker.ai: 2023

Wednesday, December 27, 2023

Reflections on Sam Altman's "What I Wish Someone Had Told Me"

On Dec 21, Sam Altman wrote a blog post titled "What I Wish Someone Had Told Me." He lists 17 points with relevance as a guide to not only starting but also successfully running a venture. What is interesting about the post is that this is not the normal internet clickbait list of "Top 10 Ideas to Start a Business." You can tell these are ideas that he has been thinking about a lot lately.

Here is his list:

Optimism, obsession, self-belief, raw horsepower and personal connections are how things get started.
Cohesive teams, the right combination of calmness and urgency, and unreasonable commitment are how things get finished. Long-term orientation is in short supply; try not to worry about what people think in the short term, which will get easier over time.
It is easier for a team to do a hard thing that really matters than to do an easy thing that doesn’t really matter; audacious ideas motivate people.
Incentives are superpowers; set them carefully.
Concentrate your resources on a small number of high-conviction bets; this is easy to say but evidently hard to do. You can delete more stuff than you think.
Communicate clearly and concisely.
Fight bullshit and bureaucracy every time you see it and get other people to fight it too. Do not let the org chart get in the way of people working productively together.
Outcomes are what count; don’t let good process excuse bad results.
Spend more time recruiting. Take risks on high-potential people with a fast rate of improvement. Look for evidence of getting stuff done in addition to intelligence.
Superstars are even more valuable than they seem, but you have to evaluate people on their net impact on the performance of the organization.
Fast iteration can make up for a lot; it’s usually ok to be wrong if you iterate quickly. Plans should be measured in decades, execution should be measured in weeks.
Don’t fight the business equivalent of the laws of physics.
Inspiration is perishable and life goes by fast. Inaction is a particularly insidious type of risk.
Scale often has surprising emergent properties.
Compounding exponentials are magic. In particular, you really want to build a business that gets a compounding advantage with scale.
Get back up and keep going.
Working with great people is one of the best parts of life.

Let's dive deeper into some key takeaways and explore their implications.

The Genesis of Great Ventures: Optimism and Obsession

Altman begins with the foundational elements of any successful venture: optimism, obsession, self-belief, raw horsepower, and personal connections. This combination is potent. Optimism fuels the journey, obsession drives the focus, and self-belief acts as a shield against naysayers. You have to be a "true believer" and surround yourself with others who are committed to that vision. Because, if you don't believe in the vision, then no one else will.

From Start to Finish: Cohesion, Commitment, and Long-Term Orientation

While starting is about energy and vision, finishing is about execution. Altman emphasizes the importance of cohesive teams, the right mix of calmness and urgency, and an "unreasonable" commitment to the cause. Importantly, he advises a long-term orientation, advocating for a focus on the big picture and not getting bogged down by short-term opinions.

The Power of Audacious Goals and Incentives

I love this. Do hard stuff. Stuff that everyone else thinks is impossible. That's motivating. Altman highlights that teams are more motivated by these challenging yet meaningful tasks rather than by easy, insignificant ones. He also touches upon the superpower of incentives, advising careful consideration in their setup. This aligns with the idea that what gets measured and rewarded gets done.

Resource Allocation and Communication

Stressing the importance of concentrating resources on high-conviction bets, Altman suggests that it's possible to achieve more by doing less but doing it well. He also underscores the importance of clear and concise communication, a vital skill in any leader's toolkit.

Fighting Bureaucracy and Focusing on Outcomes

Yes! In a call to resist bureaucracy and inefficiency, Altman encourages a culture of fighting against these elements. He reminds us that in the end, outcomes are what count, and good processes shouldn't be used to excuse poor results. Good ideas should not follow a chain of command. Everyone has the potential to generate great ideas.

The Importance of Recruitment and Valuing Superstars

Altman places a high emphasis on recruitment, urging leaders to take risks on high-potential individuals and to value evidence of accomplishment as much as intelligence. He also notes the significant impact superstars can have on an organization, provided their net impact is positive.

The Magic of Fast Iteration and the Laws of Business Physics

I would like to hear more about this - what particular experiences he is alluding to here. But I think he is continuing on with his points about the ability to iterate quickly as presented as a crucial business skill, with Altman advising that short-term execution should be swift, even if plans are laid out over decades. I don't know if he's exaggerating here with "decades" - but let's say having plans that have a long term focus. He also cautions against fighting the fundamental principles of business, likening them to the laws of physics. I think these fundamental principles he's referring to here are situations where businesses follow paths of doing things out of a sense of inertia and are not being adaptive.

The Power of Compounding and Perseverance

Again I would like to hear more from his personal experiences here with examples, but Altman is touching on the magical properties of compounding exponentials, especially in business growth, and emphasizes the importance of building a business that gains a compounding advantage with scale. The next part isn't particularly unique, but he ends with a reminder of the importance of resilience, urging readers to get back up and keep going.

Working with Great People

Finally, Altman concludes by stating that one of the best parts of life is working with great people. This not only enhances the quality of work but also enriches the journey.

Conclusion

Sam Altman's post moves beyond the initial steps of starting a business, going through the nuances of nurturing and leading a venture. Altman's advice, while rooted in his extensive experience at Y Combinator and OpenAI resonates, providing a balanced mix of strategy and personal growth. I think it's particularly insightful for those on the entrepreneurial path, offering guidance that is both reflective and grounded in practicality.

I tend to follow what Sam Altman follows fairly closely along with others at OpenAI in order to read the tea leaves about the direction that they believe AI is heading towards. Also, you have to respect a guy who has been able to head up one of the most widely successful software launches in history with a staff that is a fraction of the size of other tech giants. And also a person who has helped create such an innovative environment and an environment of true believers where so many people were willing to quit in the failed attempt to force him out

So that is my interpretation about the post and hopefully he will give more in depth explanations in one of his many speaking engagements or interviews, But suffice it to say, Sam Altman's blog post offers a thoughtful and pragmatic perspective on entrepreneurship.

Sunday, December 17, 2023

AI Book Club

I have a new post in my podcast/YouTube channel here.

I started this podcast almost three years ago, which was significantly before the release of ChatGPT. At the time, there were some open source models out that could be used and Web interfaces like GPT-J and GPT NEO, so there was work being done from Hugging Face, Eleuther.AI and others, but the latest version version from OpenAI was 3.0.

The aim of the podcast was twofold:

1) Measure the progress in large language models

2) Have some fun with LLMs by using them to have a "discussion" such as people might have in discussing books.

I used different LLMs and did some fine tuning to try and give the AIs unique personalities. I created two different AIs, Marie and Charles, interacting with the program I built - none of which was scripted out other than me having an idea of the topics and questions I wanted to ask. I then ran the text of our conversations through Google's text to voice and video synching Python code that created deep fake renditions for the two video avatars.

All of these versions performed fairly well (at least I thought so at the time), but they did have their shortcomings - many of which I outlined in this video in March of 2022. One of the biggest issues was them making up their own facts about the book we were discussing. At that time, I don't remember anyone using the term "hallucinating" as everyone commonly uses the term now with LLMs, but that's what they were doing.

However, they did create some very original and often times surprising discussion.

In this newest podcast, where we discuss Alice in Wonderland, I added a third AI that I called Beth. LLMs are much better now than when I first started out. Even though everyone likes to talk now about LLM hallucinations. they are much more factually oriented now than what they were two years ago.

Before I was very hesitant to ask them about details about the plot, because they might make up parts of the story that didn't happen, so I would steer the conversation to talking about ideas from the plot, because them being creative about their "interpretation" of what I described works, but it doesn't work if they were creating, for example, characters that never existed in the story.

However, now with the newest LLMs they know the story, they can repeat plot points and comment on them and what they create is not the plot but their "ideas" about the plot and the characters. So it's a much better conversation.

In addition, I'm now taking all of the conversations and creating vector based databases and am working on using MemGPT to give them long term memory so that they can have some continuity over the episodes. This will not only give them consistency I hope, but also because they are making up their own back stories when I ask them what they have been up to, they won't contradict themselves in a current video with something they said they were doing in a previous video.

You can view the latest video or any of the videos here.

Saturday, November 18, 2023

Sam Altman Firing

After I wrote this post, about an hour later it was announced that Sam Altman had been fired. Later Greg Brockman and others have quit. Not to speculate too much about what this will mean over the next days, weeks, and months, but suffice it to say it increases the amount of unpredictability around OpenAI in particular and will have ripple effects across the industry. Those affected the most will be those who built solutions based on OpenAI or organizations building in dependencies on OpenAI.

There are many brilliant people at OpenAI, as long as the brain drain doesn't turn into a flood OpenAI should be okay. Since it seems like it's the deceleration people who have won out at OpenAI, we'll just have to wait and see how that affects their product roadmap.

Sam Altman, Greg Brockman, and others will form new companies, maybe together, maybe separately while raising large amounts of money from investors. Whatever the new venture or ventures will be, I don't think it will dissuade him at all in pursuing AGI. It's his life's work and he will surround himself with those who also want to build AGI.

This will not cause a long delay of the arrival of AGI. The splintering may actually accelerate its arrival. And also, AGI was also not dependent on what happened at OpenAI - there are many people working on AGI.

Whether AGI is delayed or accelerated, it will happen.

Friday, November 17, 2023

Chat GPT Agents

On November 6th, Open AI at its Developer Days announced the release of GPT Apps that allows the user to create custom GPTs that can be shared with "low code" or "no code." This feature is only available for the Chat GPT Plus users.

During Developer Days, the demos that they did for custom GPTs were very impressive, but let's dive in and look at what creating a "GPT" actually is and what the future implications could be.

What is an Open AI GPT?

First let's describe what it's not. It's not a separate GPT you are creating. You are also not "fine tuning a model."

On the Open AI screen they simply describe it as customizing a version of ChatGPT for a specific purpose. When you enter the screen to create your GPT, it asks you what kind of GPT you want to create. What then follows is a back and forth of questions that customize the purpose of the GPT. It creates an app name and icon. As you are answering the questions it builds it and you can try it out on the right hand side. This literally takes minutes and is probably the ultimate in a "no code" experience. You can always iterate and conversationally make changes or click the "Configure" tab to make specific edits. It also suggests "conversation starters" to help the user interact with your app.

So it is a fun way to create a very targeted prompt for your purpose that may have been difficult to do. Which might or might not sound impressive, but the experience of doing it is impressive. Plus there are features that put it above just a prompt organizer. For example you can enable searching the Web, using DALL-E to do image creation, and use code interpreter. You can also upload files for the app to use. Using any of these greatly increases the impressiveness of the app. At that point you can share you app with specific people or with the public.

But maybe the most important feature and the one that can separate an app from other apps is the ability use "custom actions." Custom actions allows the app to call APIs. In order to use custom actions turns it from a "no code" to a "low code" experience.

But if it's this easy and a person can build something pretty cool in minutes what's to stop someone else from being able to do it or to copy it? I think there are two things that can make it unique.

1) Use file uploads to augment the app's retrieval. If you work at a company and have the Chat GPT Enterprise that emphasizes security, you might create an app centered around your department or initiative, upload some files for it to use, and then share it with others in your company. What's also nice about the Enterprise version, is you can just choose to share it to your organization and not to the public.

2) By using APIs such as those from Zapier, your app can stand out by giving it capabilities that people with no code abilities would not easily be able to duplicate. Plus, If you take it a step further and you have an external API that you own and that information is the central part of the app, your app has that as a barrier.

App Store

One of the other big announcements, is the announcement of an App Store by the end of this month. We'll see when the actual timing is and few details of how the store will work have been released at this time. Many people are comparing it to the introduction of the Apple App Store, but there are important differences. No one knows how the economics of Open AIs App Store will work or if it's a separate cost, but I doubt it's like Apple's. Also, no one knows what the approval process or what requirements there might be. One of the big questions to me, is if it's this easy to create an app, will the App Store be overrun with submissions, duplicates, or near duplicates? Or will some kind of cost or approval barrier prevent that? We don't know the answers to that yet, but will soon - I expect answers to some of these questions before the end of the year.

My Experiment

With it being so easy to create an app, I had to try it out. And if you have Chat GPT Plus, I encourage you to give it a try of building your own app.

I like to run. My idea as a runner was to create a running app that would tell me given a location the best routes in the area. When I travel, I will often not know the best places to run, which entails me doing a lot of googling.

https://chat.openai.com/g/g-C7mXg4svV-run-planner

So if the app is given a location, it will search the Web across many different sites to look at what are popular running routes. It's interesting watching it search different sites for routes - many of these sites wouldn't have occurred to me to search through if I was doing this manually.

Searching Wheeling IL, it gave me a list of possible routes with some having links for more information. It even told me that with Wheeling having variable weather at different times of the year, some of the trails can be slick so check the weather before going out.

I also added to the app the ability to search for upcoming races in the area and to give training plans for different types of races.

The experience was fun and took me probably less than 30 minutes.

Future Implications

It's going to be very interesting to see how the future plays out for all of this.

First, the App Store, there are a lot of questions surrounding it as mentioned above. But also, I had to wonder how does an App Store align with Open AI's "core values." Recently to much media fanfare Open AI changed their "core values" page with the main difference being -

AGI focus
We are committed to building safe, beneficial AGI that will have a massive positive impact on humanity's future.
Anything that doesn’t help with that is out of scope.

So does having an App Store seem out of scope to building AGI? It seems so to me. Right after they announce their only focus is building AGI, they announce an App Store?

Second, this is another surge in the swell of "low code" and "no code" solutions. This is just going to accelerate as tasks become more automated and expert skills shift downstream or to different use cases.

Third, right now what you can change on the Configure tab is pretty limited. But we can expect that to become much more feature rich in the future. As that part gets built out, it's those new capabilities that could make GPT Apps really take off.

Fourth, one of the other things that Open AI introduced was the Assistant's API that has threading, coding in a sandbox, retrieval, and function calling. Somewhat similarly, what if Chat GPT Apps could themselves be called as autonomous agents? The potential would be endless.

Summary

GPT Apps was just one of the big announcements. The much anticipated Dev Days lived up to the hype for me. And if you have access to Chat GPT Plus, you should try it out.

As big of a year as Open AI had this year, I'm expecting 2024 to be an even bigger year for them - especially with Sam Altman discussing in an interview that they are working on GPT 5 and openly talking about research in building AGI and "how to build superintelligence."

Tuesday, September 26, 2023

Top 10 Languages/Tools for Data Science in 2023

Let's talk top 10 languages/tools for data science in 2023. Obviously, this is going to be very opinionated and everyone's list is going to be different depending on personal experiences in specific industries. But I want to keep this as generalized as possible as to what I think someone should know in 2023. I'm also keeping this list focused on tools or software used in data science and not concepts or algorithms - I think that would make another great list: "Top 10 Data Science Concepts" and I'll try and do that too in a later post.

So why 10? Why not 5 or 20 or even 11? I'm not a big fan of top 10 lists, because 10 is such an arbitrary cutoff, but we can blame top 10 lists on the fact that we have 10 fingers and that humans generally like round numbers. It has always fascinated me that if we had 8 fingers we would be primarily base 8 and how would society be different if we weren't base 10.

To further go down the rabbit hole, the origin of base 10 systems developed independently across different civilizations. But it was Indian mathematicians who perfected its use and along with Arabic traders spread it throughout the known world in the 7th century enabling advances in commerce and science.

But back to my top 10 list:

1) Python

You need to know Python in 2023. It is where the most cutting edge development is happening in AI. You need to know it deeply and not just superficially. And you need to be able to write not just in Jupyter Notebooks, but be able to write Python (.py) scripts. It is THE language for machine learning. But it doesn't just excel for machine learning, it excels in ETL and in general process automation. It's also making inroads into front end development with the advent of PyScript in the last year.

It routinely sits across the top of popularity lists and right now is #1 in the TIOBE index - this is due to it's relative simplicity while also being able to accomplish complex tasks. As a general purpose language it can be applied in tasks well beyond data science. Anyone with "Analyst" in their title, even if they aren't doing data science or machine learning, needs to know Python in 2023 if only just to be able to automate parts of their job.

2) Pandas

Know the Python library Pandas. You won't be able to read through any machine learning code without knowing Pandas really well. Plus, knowing Pandas is like having a super power for data manipulation.

3) Matplotlib

You need to be able to wield a charting library and Matplotlib is the most popular. It is hard to find programs, articles, books, or examples that don't include some kind of Matplotlib chart. I find its syntax to not be the cleanest and there are other good libraries like Plotly and Seaborn to name a few, but it's hard to get by without being able to use Matplotlib.

4) SQL

You need to be able to get your data from somewhere right? And most likely some form of SQL knowledge (whether that's Postgres, MS-SQL, MySQL, Oracle, etc) is going to be required (more on NoSql databases below).

5) Scikit-learn

Scikit-learn should be your bread and butter for what you use to solve most business related problems. Its rich library of machine learning and statistical models can be applied to a wide variety of everyday data analytics. Plus, the documentation is superb.

6) PyTorch

PyTorch has been moving fast up on my list and has overtaken TensorFlow for me and for many others. It's "pythonic" nature and object-oriented approach has some appeal. It's also arguably easier to debug and the PyTorch Foundation does an excellent job of maintaining its ecosystem. But it's the fact that it's increasingly in some of the most cutting edge research at some of the most important companies in AI, especially in large language modeling that is really driving its rise.

7) TensorFlow

I personally love TensorFlow and you should know it for neural networks and I like it's overall structure, but it's losing ground in my mind for some of the reasons outlined above.

8) Streamlit

I'm cheating here a little bit, because I could easily put Gradio here (or Dash for that matter). But I think what's important here is that data scientists need a good way to communicate through applications quickly.

9) Selenium

Being a data scientist is about being able to get data and some of that data might not always be available in your database or through an API, so being able to scrape data is a valuable skill.

10) MongoDB

I'm cheating again here, because I think being familiar with NoSQL databases in general whether that's MongoDB or another version is important and knowing its advantages (and disadvantages). And generally knowing MongoDB shows that you are familiar with JSON (BSON in this case) type data and can parse data that has key/value structure. Because that ability to work with MongoDB is some of the same skills you need to work with API data and Python dictionaries. And being able to work with key/value data and dictionaries in Python is a fundamental skill you should have.

So that's it.

What's not on my list? Well, to mention a few: Julia, Mojo, R, Hadoop, Hive, Javascript, C/C++, Rust, Scala, SAS, SPSS, Matlab, Java, NLP packages such as NLTK, spaCy, Excel, and VBA.

I have a lot of respect for Julia as a language, so it's barely not making the list. Same goes with Mojo. But also because Mojo was just released for Linux. It needs to be rolled out to be able run in Mac/Windows, but it may make the list next year.

I almost put Javascript on the list primarily because as one of the most popular languages it is hard to escape and believing Javascript is just a front end language is wrong. Javascript is everywhere - NodeJS, stored procedures, etc. Plus everyone should know some JS (along with some HTML).

I also didn't single out NumPy as a specific Python package like I did with Pandas. I just thought it was obvious that you should know NumPy as a Python user and that it is really important. It's an arbitrary decision to not list it separately.

I could have easily put one or both NLP packages of NLTK or spaCy on the list and maybe next time I will.

As for the languages of C/C++ and Rust, I could make a strong case for one or both to be on the list for performance reasons. C/C++ is really important in the Python/machine learning ecosystem and there's some great things happening with Rust, such as Polars. Rust and Python work really well together and it's going to be interesting to see how the various performance initiatives turn out in AI considering the current work to make Python faster, languages like Julia and Mojo, or interfacing directly with languages like C/C++ or Rust.

I think that for VBA as a language being used in some analytics the writing is on the wall even though there's a lot of VBA code out there. One reason is Microsoft's decision to include Python in Excel is going to supplant many of the VBA use cases. Plus, you can tell Microsoft has not been putting resources into it. The UI and the error messages are terrible. It looks like it was written in the 90's and seems like it hasn't improved or even changed in years.

Many people will be upset with this, but the trend for R is not good. Its raison d'etre in the 90's was as a reaction to SAS and especially SPSS with their onerous licensing. It became very popular as many people jumped into creating some pretty great statistical packages for it - especially in academic circles. It will continue to have its specialized niches, but it has little future in cutting edge AI research. As a side anecdote, in the data science class I teach, we used to do a very basic introduction to R towards the end with most of the course being Python and other tools. But now the university decided to remove R after evaluating the market, so none of the students coming out of this program will have even a basic familiarity with R.

Excel is not on the list because everyone knows Excel and knowing Excel doesn't make you a data scientist.

I'm leaving off Big Data solutions for now, but if I would put some on the list it would be Spark based ones or proprietary solutions from Snowflake, AWS, Google, Databricks etc. and not MapReduce based solutions that involve Hive or Hadoop.

I'm also leaving off proprietary machine learning solutions like AWS SageMaker for now.

I also didn't discuss editors or IDEs. I use VS Code primarily because I can have one IDE for any language, but I think there's a lot of good choices and if you use PyCharm or any of the others that is fine. I don't think it matters as long as you are comfortable with it. Or if you want to be really cool use VIM.

So that's the list for 2023. I'll revisit in 2024 to see what has changed.

Saturday, September 23, 2023

Processing Monster Datasets with Polars

Over the last year, I've been turning more and more to the Python module Polars for handling large datasets that couldn't fit into memory on a single machine. Before, I had used the library Dask for these occasions, and what I liked most about Dask is the very close syntax to Pandas, but the bottom line is that Polars is just so incredibly fast and probably a better fit on single machine processing than Dask. The Polars syntax isn't difficult and has syntax similarity to Spark. For single machine processing where I didn't need the overhead and the complexity of distributed processing that Dask and PySpark excel at, Polars is a great alternative.

You can find many speed comparisons between Pandas and Polars such as here and here, but I've found it commonly is about 10-100 times faster. And that includes it being faster than the new-ish 2.0 version of Pandas using Pyarrow.

In my work in medicinal chemistry, I've been processing the ADME (absorption, distribution, metabolism, excretion) characteristics of molecules, the synthesizability of molecules, and their patent availability which can be very compute intensive. Millions of these records can be processed in times that would not be practically achievable with Pandas.

Does that mean that you should just write everything in Polars and not use Pandas anymore? No! First, just using Polars is going to be overkill in some cases when your data size is relatively small and it's even possible it would be slower in some edge cases on small datasets. Second, Pandas dataframes are ubiquitous and you can count on every Python module that has ever been built to be able to work with Pandas dataframes. That is not true with Polars dataframes and in those cases you will have to convert the dataframe from Polars to Pandas.

But what exactly is Polars?

Polars is a high-performance dataframe library for Python written in Rust. And that's the key - that's it's written in Rust. It is designed to provide fast and efficient data processing capabilities, even for large datasets that might not fit into memory. Polars combines the flexibility and user-friendliness of Python with the speed and scalability of Rust, making it a compelling choice for a wide range of data processing tasks, including data wrangling and cleaning, data exploration and visualization, machine learning and data science, data pipelines and ETL.

Polars offers a number of features:

Speed: Polars is one of the fastest dataframe libraries available, with performance comparable to Arrow on many tasks. Much of this is due to it being written in Rust and that it can use all of the machine cores in parallel.
Scalability: Polars can handle datasets much larger than your available RAM, thanks to its lazy execution model - so it will optionally not need to load all of the data into memory. I have used it to read in a massive dataset that contained all known patented molecules.
Ease of use: Polars has a familiar and intuitive API that is easy to learn for users with experience with other dataframe libraries, such as Pandas and PySpark.
Feature richness: Polars offers a comprehensive set of features for data processing that parallels those of Pandas or PySpark, including filtering, sorting, grouping, aggregation, and joining.

You can create a Polars DataFrame from a variety of data sources, such as CSV files, Parquet files, and Arrow buffers. Once you have a DataFrame, you can use Polars to perform a wide range of data processing tasks.

For drug discovery, I have used Polars on large data sets to analyze molecular interactions, predict pharmacokinetic properties, and optimize drug design. For example, I have used it to help calculate the binding energy of a molecule to its target where the data sets are large. This can be useful for identifying drug candidates that are likely to have high affinity for their targets.

In these examples below which are common use cases in Pandas, I have the Polars code and then the equivalent Pandas code as a comment underneath when it's different:

import polars as pl
import pandas as pd

# Load data about molecules from a CSV file
df = pl.read_csv('molecules.csv')
# df = pd.read_csv('molecules.csv')

Selecting columns:

# Select specific columns about a molecule
selected_df = df.select(['Id', 'SMILES', "Mol_Wgt', "LogP", "NumHAcceptors", "NumHDonors"])

# Equivalent Pandas code
# selected_df = df[['Protein', 'SMILES', "Mol_Wgt', "LogP", "NumHAcceptors", "NumHDonors"]]

selected_df.head()

Selecting rows:

# Filter the df
filtered_df = df.filter(pl.col('Mol_Wgt') < 500)

# Equivalent Pandas code using .loc
# filtered_df = df.loc[df['Mol_Wgt'] < 500]

Grouping data:

# Group on Protein and calculate the mean Mol Wgt by Protein
grouped_df = df.groupby(by='Protein').agg(pl.col('Mol_Wgt').mean())

# Equivalent Pandas code
# grouped_df = df.groupby(['Protein']].mean()[['Mol_Wgt']]

Sorting rows by a column:

# Group on Protein and calculate the mean Mol Wgt by Protein
sorted_df = df.sort(by='Mol_Wgt')

# Equivalent Pandas code
# sorted_df = df.sort_values('Mol_Wgt')

As you can see, the syntax is not that different and because of the dataframe structure and its expressive syntax. Polars works great for large data sets that can't fit into memory that need processing speed. With all of these considerations, you can see that Polars is a great tool for any data scientist to add.

Tuesday, July 4, 2023

Quantum Computing in Healthcare

In 1981, Richard Feynman gave a talk at a conference called "The Physics of Computation" which was latter detailed in the book Feynman Lectures on Computation. Feynman observed that the world - its physics is quantum and if one wanted to simulate that physics, a quantum computer would be the way to do that. He then went on and outlined many of the main concepts of what is still used in the field of quantum computing.

However, quantum computing was out of reach then for the technology of the 1980's and even now 40+ years later we are still in a nascent stage of quantum computing. The running joke for a long time has been that quantum computing (as well as for fusion energy) is that we are always just 10 years away. Then 10 years later we are again just 10 years away. But recently, there have been some announcements that lead me to believe that we could be arriving at a massive inflection point in turning theory into extremely practical applications. For example, IBM just recently announced that through a partnership with the University of Chicago and the University of Tokyo they plan on building a 100,000 qubit computer within 10 years. Google subsidiary SandboxAQ is looking to do molecular simulations of interactions for drug discovery. And startups looking to use quantum computing in drug discovery include Aqemia doing structure based design of drug candidates and Qubit Pharmaceuticals using quantum computing to do simulations of molecules.

Before I get into what I think will be some of the biggest applications, a brief discussion of what are the characteristics of quantum computing that make it special when compared to classical computing is warranted. And by classical computing, it is meant the everyday binary computing of 0's and 1's that power everything we do.

These 0's and 1's (bits) of classical computing are processed through logical operations, such as AND, OR, and NOT gates, to perform computations. However, in quantum computing, we utilize quantum bits, or qubits, which can exist in a superposition of states.

A qubit can represent both 0 and 1 simultaneously, thanks to the property of superposition. This means that a qubit can be in a state that is a combination of 0 and 1 at the same time. For example, a qubit can be 0 and 1 with a certain probability assigned to each state. This allows quantum computers to perform multiple computations in parallel, exponentially increasing their computational power compared to classical computers.

Furthermore, qubits can also exhibit a property called entanglement. When qubits become entangled, the state of one qubit becomes correlated with the state of another qubit, regardless of the distance between them. This entanglement enables quantum computers to perform operations on multiple qubits simultaneously, leading to the potential for massive computational speedups.

To manipulate qubits and perform computations, quantum computers rely on quantum gates. These gates are analogous to the logical gates in classical computing and allow for operations such as changing the probability distribution of a qubit's states or entangling qubits together.

If you want to dive more into the details of quantum computing, IBM has a great resource that goes through the details of how it works and what's really cool is they show that by using their python-based Qiskit language you can get some real experience in quantum computing using their cloud quantum computers. So you can install it and gets some hands on experience and even upload your code on one of their quantum computers to run.

https://www.ibm.com/topics/quantum-computing

Also, here is a great series again from IBM that explains quantum computing.

"Understanding Quantum Information and Computation"

Quantum computing holds great promise for solving complex problems that are computationally infeasible for classical computers. Certain algorithms, such as Shor's algorithm, can factor large numbers exponentially faster on a quantum computer, posing a potential threat to modern cryptographic systems. Quantum computers also excel in optimization, simulation of quantum systems, and machine learning tasks. Specifically quantum computing could be applied to financial modeling, materials design, and transportation scheduling.

And as a side note, quantum computing can be applied to natural language processing. In October 2021, Cambridge Quantum Computing (CQC) announced the release of the first Quantum Natural Language Processing (QNLP) toolkit and library. This toolkit can translate sentences into quantum circuits. As powerful as transformers/attention models have become with GPT with companies like OpenAI, there will need to be another breakthrough or multiple breakthroughs to reach sophisticated AGI and quantum computing may be one of the factors to enable an acceleration towards artificial super intelligence (ASI).

With all of these possible applications of quantum computing, some of the more intriguing ones are in healthcare. In March of this year, IBM and Cleveland Clinic announced their partnership of the first on-site quantum computer. A few of the applications will include prediction models for cardiovascular disease, drug discovery, and genetics.

It is in these areas of drug discovery and genetics that I believe hold the greatest promise in healthcare.

Drug Discovery

Simulation of molecular behavior: Quantum computers can simulate the behavior of molecules at the quantum level with high precision. This capability is crucial for understanding the interactions between drugs and their target molecules, as well as predicting their effectiveness and potential side effects. By providing more accurate and detailed simulations, quantum computers can accelerate the drug discovery process and improve the success rate of drug candidates.
Drug optimization and molecular design: Quantum computing can aid in optimizing drug molecules and designing new ones. Quantum algorithms can explore vast chemical space to identify compounds with specific properties, such as high potency, selectivity, and bioavailability. This can potentially lead to the discovery of novel drugs or optimization of existing ones, making the drug development process more efficient and cost-effective.
Quantum machine learning for drug discovery: Quantum machine learning algorithms running on quantum computers can analyze large datasets related to drug targets, molecular structures, and biological interactions. These algorithms can extract valuable insights, identify patterns, and make predictions to guide drug discovery efforts. Quantum machine learning has the potential to improve target identification, lead optimization, and personalized medicine.
Quantum chemistry calculations: Quantum computers can perform complex quantum chemistry calculations, such as calculating molecular energies, reaction rates, and properties of chemical systems. These calculations are computationally intensive and often intractable for classical computers. By harnessing quantum algorithms, quantum computers can provide more accurate and efficient solutions, enabling researchers to better understand chemical processes and accelerate drug discovery.

Genetics

Considering the large number of base pairs in the human genome, as explained in this paper, an entire genome of a person could be stored in ~34 qubits. And amazingly, doubling the size of the quantum compute could theoretically store the genome of every person on the planet - illustrating the potential of quantum computing is not just its potential speed but its ability to store and process vast amounts of data. This power could revolutionize medicine in genetics in two ways:

Genomic data analysis: Quantum computing can help analyze vast amounts of genomic data, such as DNA sequences, gene expression profiles, and genetic variations. Quantum algorithms can enhance the efficiency of processing and analyzing this data, enabling researchers to identify genetic patterns, disease markers, and potential drug targets more accurately and quickly.
Precision medicine and personalized therapies: Quantum computing can aid in the development of personalized medicine approaches by analyzing individual genetic data and matching it with specific drug responses. Quantum algorithms can process and analyze large-scale genomic datasets to identify genetic markers that influence drug efficacy, toxicity, and treatment outcomes. This information can guide the development of tailored therapies based on an individual's genetic profile.

For both drug discovery and genetics, combining quantum computing with artificial intelligence would be completely transformative in finding therapies and cures in healthcare.

So what should you do?

What should you do as an organization or as a technical person looking at the possible quantum landscape? Every company should periodically be evaluating their preparedness and long range opportunities, because even though practical applications may not be here yet, there will be a moment when that will change - and the opportunities could be enormous for companies in the right industries who have positioned themselves. For individuals, there is currently a knowledge shortage and I believe there will be a knowledge shortage for a long time regarding quantum computing as it accelerates. So it is never too early to start learning as much as you can about it.

Summary

Quantum computing has great potential, however it is important not to think of it as a replacement for classical computing. Quantum computing has very specific situations where it should be applied just like the GPU in machine learning has its specific uses. And just like GPUs, the ideal machine will be a marriage between quantum computing and classical computing.

However as great as the promise of quantum computing, it is still in its early stages. And although the announcements from IBM to build a 100,000 qubit computer by the end of the decade and announcements from Google and others are encouraging, many technical challenges remain. Building stable and error-resistant qubits, minimizing environmental interference, and developing error-correcting codes are among the significant hurdles researchers are working to overcome. Despite these challenges, quantum computing has the potential to revolutionize many different industries such as healthcare.

Thursday, May 25, 2023

Andrej Karpathy's Explanations of GPT

For deeply understanding GPT models like the ones from OpenAI, I think there is no one who explains it better than Andrej Karpathy. Andrej Karpathy was part of the original group at OpenAI, then left to become director of AI at Telsa, but since February he is back at OpenAI. He also makes YouTube videos explaining the underlying technology.

I want to point out two recent videos that he did that if you want to really understand how something like ChatGPT works then these are essential.

First, is the talk he just did at Microsoft Build. There were some important announcements at MS Build 2023 regarding AI and I encourage you to check them out, but the talk by Andrej even though it wasn’t an announcement talk should be really valuable for anyone using a GPT based tool. At a high level, he explains how GPT works and then in the second part of his talk he explains why different types of prompts work.

https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2

I generally have a problem using the term “prompt engineering” as it’s not engineering and getting what you want from an AI is often just common sense. But admittedly it does involve understanding how GPT works as an AI assistant versus communicating to a human assistant. Andrej explains prompt strategies like ‘chain of thought”, using an ensemble of multiple attempts, reflection, and simulating how humans reason towards an optimal solution. He also talks a little bit about AutoGPT, the hype surrounding it but that it’s “inspirationally interesting.” He also mentions a paper that just came out that talks about using a tree search in prompting called “Tree of Thought.”

The second video is from Andrej’s YouTube channel. It is called “Let's build GPT: from scratch, in code, spelled out.” He has other videos on his site called the “Makemore” series that are also really good, but this one is THE BEST explanation of GPT of how a transformer/attention based model like GPT works. All of the models today are what they are because of the seminal paper put out by Google called “Attention Is All You Need” and also because of other refinements like reinforcement learning.

But if you want to understand how these models actually work and how they are built but found other explanations too general or found diagrams like this baffling or unsatisfactory then this is the explanation.

https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1031s

Especially if the way you understand something like this is to see actual working code being built up. Understanding his video does take knowledge of Python and some knowledge of PyTorch. But even if you haven’t done much in PyTorch, you can follow along. Plus this is the perfect opportunity to build something in PyTorch if you haven’t before. His explanations are extremely clear. He goes through step by step building a simple GPT model which you can do regardless of how powerful or not powerful your machine is or what your access to a GPU might be.

So the first talk I’m recommending is more at a high level of understanding of GPT, but the second talk is more technical if you want a deeper understanding of the engineering. Both talks are excellent and I know it’s a bold statement because there are a lot of people who also are really good at this, but I think besides being a brilliant engineer Andrej Karpathy is the best at educating people right now in AI.

Wednesday, May 10, 2023

Mojo: Is it the Last Programming Paradigm We Need?

I think it’s safe to say that anyone who has been writing code for any length of time has had to learn multiple languages. This is partially because of the nature of who we are, meaning anyone who is doing some kind of engineering or science, because we are always learning and on the lookout for that new language that handles different situations better than our current “main” language. Maybe it’s to go faster, use memory better or safer, or works with a new OS or hardware. Maybe it's a new language that is specialized for front end or back end, works close to the metal or is abstracted from the hardware, or it just follows new and better software design principles.

Regardless of the reason, constantly learning changes in software languages is a requirement for working in software technology. So the question is if it's really necessary for yet another language - that language being Mojo from a company named Modular.

Before we get into why I think the answer could be yes, a little history. My very first language was Fortran, because at the time it was the best language, meaning fastest, for doing any kind of mathematical or scientific programming. At that time, there were very few packaged libraries. If you wanted special clustering options for a K-means cluster, you wrote that from scratch. If you wanted to convert a string to a number, you looped through each place in the array, subtracted 48 from it and built the number up. And there was always at least one person in an organization who knew Assembler for those heavily travelled pieces of functionality. Good times.

But I quickly moved on from Fortran and became involved with C/C++ because “modern language”, “Windows programming!”, and Object Oriented Programming, because OOP was going to be the ultimate way of structuring large, complicated ideas into code and the OOP paradigm ruled across many different languages for many years (well, until functional programming challenged some of its core ideas).

From C/C++, I went to C# and .Net for many years (which I loved), and of course Java, a myriad of JavaScript-based front-end languages, and many others. All great languages, but none of these compared to doing Python which has been my language of choice for many years. With Python and its easy-to-read syntax, productivity, and all of its supporting libraries; it’s really like having a superpower.

This is all to set up why I hope a new language called Mojo, can take Python to an even higher level and maybe take myself and others off the rollercoaster of changing programming paradigms – at least for a while.

So what is Mojo?

Mojo is being put out by a company called Modular led by Chris Lattner. If that name sounds familiar, it’s because he’s the person who created Swift, LLVM, and Clang. LLVM has many parts, but at its core it’s a low-level, assembly-like representation called intermediate representation (IR) that many, many languages use or can use – including C#, Scala, Ruby, Julia, Kotlin, Rust, and the list goes on and on.

LLVM is an important part of the Mojo story because even though Mojo is not based on LLVM, Mojo is based on a newer technology that grew out of LLVM called MLIR which is also another project led by Chris Lattner. One of the many things MLIR does is that it can abstract away the targeting of a variety of hardware – things like threads, vectors, GPUs, TPUs – things that are really important for modern AI programming.

All of this is to say that Mojo has some major history of years of technical expertise in compiler design behind it.

But what does Mojo mean to the regular data engineer or scientist who doesn’t care about all of the details of how it compiles and just wants to get stuff done?

The short answer would be that Mojo is (or hopes to be) a superset of Python that is much faster than Python that also makes it much easier to target typical machine learning hardware.

A “superset” of Python? Yes, that means that all existing Python code and libraries will work without changing anything. As a superset language it may bring to mind C++ and TypeScript as supersets to C and JavaScript respectively. Although I’m not sure that the comparison will turn out completely accurate because for one TypeScript has some idiosyncrasies in creating TypeScript code that some would argue whether it’s correct or even important to call it a superset. And for C++, I think the transition for Python programmers to implement Mojo code might be easier than the pure C programmer implementing C++ code in their C code – but this all remains to be seen.

Advantages of Mojo over Python:

Outside of the aforementioned advantage of Mojo being a superset of Python, the main advantage is speed - pure and simple. One of the advantages of Python is that it’s interpreted, which increase productivity and ease of use, but it comes at a price of being really slow compared to languages like C/C++ and Rust.

Enter Mojo. As you can see in this demo, under the right circumstances Mojo can be up to 35,000x faster than Python. In the video, you can see Jeremy Howard, who is an advisor on Mojo, step through different optimizations that speed up a Python use case. But even if you don’t do all of the optimizations, you can see starting at 1:27 taking Python code without changing anything except running that same code using the Mojo compiler he got over an 8x speed up.

There are many speed up opportunities, too many to list, but it’s important to know that Mojo can explicitly do parallelism in a way that Python simply can’t do and it can eloquently take advantage of different hardware types like GPUs because of MLIR.

And also because of MLIR, it’s not just targeting GPU, it could potentially take advantage of any emerging hardware – which is why it could have real staying power.

Finally, another important advantage is that Mojo can do better memory management similar to Rust and allows for strict type checking, which can eliminate many errors.

Reservations about Mojo:

Okay, this all sounds great, but what are the potential downsides. Well, there are a few, none of which I believe are big enough to dissuade anyone from trying out Mojo.

It’s not open source – yet

This I think is the biggest concern. Their intention is to open source it “soon” once it reaches a certain level. Their rationale is that they want to iron out the fundamentals and that they can move faster on it with their dedicated team before open sourcing it, which I can understand. I’m not an open source absolutist, but the concern is that the best way for a new language to break through the noise and reach wide adoption is for it to be open source. And maybe it might have been better to wait until they were ready to open source it before making the announcement.

It’s not generally available yet

You can sign up for it and be put on a waitlist to have access to try it out on their notebook server. For a new language, it’s important that anyone can download and test it out locally on their machine, but even once you get access you can't run it locally yet. So again, it might have been better for them to wait for the release announcement until this was possible.

Productionizing

I think with any new language or even new versions of Python, there are always questions of how this will work in an existing pipeline and infrastructure. Not a huge problem, but something to note.

So what should you do now?

If you are intrigued by Mojo, then you should read through their documentation and definitely sign up to get access here: https://www.modular.com/mojo. It doesn’t mean that the great Python speed ups that have been happening in Python 3.1x should be ignored or discounted. Or even if you have tried to solve some speed problems with Rust code talking to Python or using Julia that these efforts should be discarded.

But Mojo is something to have on the radar. If they are able to accomplish what they are setting out to do, it could turn out to really define not just AI programming but be a new programming paradigm for a whole host of applications.

billparker.ai