Tuesday, September 26, 2023

Top 10 Languages/Tools for Data Science in 2023

Let's talk top 10 languages/tools for data science in 2023. Obviously, this is going to be very opinionated and everyone's list is going to be different depending on personal experiences in specific industries. But I want to keep this as generalized as possible as to what I think someone should know in 2023. I'm also keeping this list focused on tools or software used in data science and not concepts or algorithms - I think that would make another great list: "Top 10 Data Science Concepts" and I'll try and do that too in a later post.

So why 10? Why not 5 or 20 or even 11? I'm not a big fan of top 10 lists, because 10 is such an arbitrary cutoff, but we can blame top 10 lists on the fact that we have 10 fingers and that humans generally like round numbers. It has always fascinated me that if we had 8 fingers we would be primarily base 8 and how would society be different if we weren't base 10. 

To further go down the rabbit hole, the origin of base 10 systems developed independently across different civilizations. But it was Indian mathematicians who perfected its use and along with Arabic traders spread it throughout the known world in the 7th century enabling advances in commerce and science.

But back to my top 10 list:

1) Python

You need to know Python in 2023. It is where the most cutting edge development is happening in AI. You need to know it deeply and not just superficially. And you need to be able to write not just in Jupyter Notebooks, but be able to write Python (.py) scripts. It is THE language for machine learning. But it doesn't just excel for machine learning, it excels in ETL and in general process automation. It's also making inroads into front end development with the advent of PyScript in the last year.

It routinely sits across the top of popularity lists and right now is #1 in the TIOBE index - this is due to it's relative simplicity while also being able to accomplish complex tasks. As a general purpose language it can be applied in tasks well beyond data science. Anyone with "Analyst" in their title, even if they aren't doing data science or machine learning, needs to know Python in 2023 if only just to be able to automate parts of their job.

2) Pandas

Know the Python library Pandas. You won't be able to read through any machine learning code without knowing Pandas really well. Plus, knowing Pandas is like having a super power for data manipulation.

3) Matplotlib

You need to be able to wield a charting library and Matplotlib is the most popular. It is hard to find programs, articles, books, or examples that don't include some kind of Matplotlib chart. I find its syntax to not be the cleanest and there are other good libraries like Plotly and Seaborn to name a few, but it's hard to get by without being able to use Matplotlib.

4) SQL

You need to be able to get your data from somewhere right? And most likely some form of SQL knowledge (whether that's Postgres, MS-SQL, MySQL, Oracle, etc) is going to be required (more on NoSql databases below).

5) Scikit-learn

Scikit-learn should be your bread and butter for what you use to solve most business related problems. Its rich library of machine learning and statistical models can be applied to a wide variety of everyday data analytics. Plus, the documentation is superb.

6) PyTorch

PyTorch has been moving fast up on my list and has overtaken TensorFlow for me and for many others. It's "pythonic" nature and object-oriented approach has some appeal. It's also arguably easier to debug and the PyTorch Foundation does an excellent job of maintaining its ecosystem. But it's the fact that it's increasingly in some of the most cutting edge research at some of the most important companies in AI, especially in large language modeling that is really driving its rise.

7) TensorFlow

I personally love TensorFlow and you should know it for neural networks and I like it's overall structure, but it's losing ground in my mind for some of the reasons outlined above.

8) Streamlit

I'm cheating here a little bit, because I could easily put Gradio here (or Dash for that matter). But I think what's important here is that data scientists need a good way to communicate through applications quickly.

9) Selenium

Being a data scientist is about being able to get data and some of that data might not always be available in your database or through an API, so being able to scrape data is a valuable skill.

10) MongoDB

I'm cheating again here, because I think being familiar with NoSQL databases in general whether that's MongoDB or another version is important and knowing its advantages (and disadvantages). And generally knowing MongoDB shows that you are familiar with JSON (BSON in this case) type data and can parse data that has key/value structure. Because that ability to work with MongoDB is some of the same skills you need to work with API data and Python dictionaries. And being able to work with key/value data and dictionaries in Python is a fundamental skill you should have.

So that's it.

What's not on my list? Well, to mention a few: Julia, Mojo, R, Hadoop, Hive, Javascript, C/C++, Rust, Scala, SAS, SPSS, Matlab, Java, NLP packages such as NLTK, spaCy, Excel, and VBA.

I have a lot of respect for Julia as a language, so it's barely not making the list. Same goes with Mojo. But also because Mojo was just released for Linux. It needs to be rolled out to be able run in Mac/Windows, but it may make the list next year. 

I almost put Javascript on the list primarily because as one of the most popular languages it is hard to escape and believing Javascript is just a front end language is wrong. Javascript is everywhere - NodeJS, stored procedures, etc. Plus everyone should know some JS (along with some HTML).

I also didn't single out NumPy as a specific Python package like I did with Pandas. I just thought it was obvious that you should know NumPy as a Python user and that it is really important. It's an arbitrary decision to not list it separately.

I could have easily put one or both NLP packages of NLTK or spaCy on the list and maybe next time I will. 

As for the languages of C/C++ and Rust, I could make a strong case for one or both to be on the list for performance reasons. C/C++ is really important in the Python/machine learning ecosystem and there's some great things happening with Rust, such as Polars. Rust and Python work really well together and it's going to be interesting to see how the various performance initiatives turn out in AI considering the current work to make Python faster, languages like Julia and Mojo, or interfacing directly with languages like C/C++ or Rust.

I think that for VBA as a language being used in some analytics the writing is on the wall even though there's a lot of VBA code out there. One reason is Microsoft's decision to include Python in Excel is going to supplant many of the VBA use cases. Plus, you can tell Microsoft has not been putting resources into it. The UI and the error messages are terrible. It looks like it was written in the 90's and seems like it hasn't improved or even changed in years. 

Many people will be upset with this, but the trend for R is not good. Its raison d'etre in the 90's was as a reaction to SAS and especially SPSS with their onerous licensing. It became very popular as many people jumped into creating some pretty great statistical packages for it - especially in academic circles. It will continue to have its specialized niches, but it has little future in cutting edge AI research. As a side anecdote, in the data science class I teach, we used to do a very basic introduction to R towards the end with most of the course being Python and other tools. But now the university decided to remove R after evaluating the market, so none of the students coming out of this program will have even a basic familiarity with R.

Excel is not on the list because everyone knows Excel and knowing Excel doesn't make you a data scientist.

I'm leaving off Big Data solutions for now, but if I would put some on the list it would be Spark based ones or proprietary solutions from Snowflake, AWS, Google, Databricks etc. and not MapReduce based solutions that involve Hive or Hadoop. 

I'm also leaving off proprietary machine learning solutions like AWS SageMaker for now.

I also didn't discuss editors or IDEs. I use VS Code primarily because I can have one IDE for any language, but I think there's a lot of good choices and if you use PyCharm or any of the others that is fine. I don't think it matters as long as you are comfortable with it. Or if you want to be really cool use VIM.

So that's the list for 2023. I'll revisit in 2024 to see what has changed.

Saturday, September 23, 2023

Processing Monster Datasets with Polars

Over the last year, I've been turning more and more to the Python module Polars for handling large datasets that couldn't fit into memory on a single machine. Before, I had used the library Dask for these occasions, and what I liked most about Dask is the very close syntax to Pandas, but the bottom line is that Polars is just so incredibly fast and probably a better fit on single machine processing than Dask. The Polars syntax isn't difficult and has syntax similarity to Spark. For single machine processing where I didn't need the overhead and the complexity of distributed processing that Dask and PySpark excel at, Polars is a great alternative.

You can find many speed comparisons between Pandas and Polars such as here and here, but I've found it commonly is about 10-100 times faster. And that includes it being faster than the new-ish 2.0 version of Pandas using Pyarrow.

In my work in medicinal chemistry, I've been processing the ADME (absorption, distribution, metabolism, excretion) characteristics of molecules, the synthesizability of molecules, and their patent availability which can be very compute intensive. Millions of these records can be processed in times that would not be practically achievable with Pandas.

Does that mean that you should just write everything in Polars and not use Pandas anymore? No! First, just using Polars is going to be overkill in some cases when your data size is relatively small and it's even possible it would be slower in some edge cases on small datasets. Second, Pandas dataframes are ubiquitous and you can count on every Python module that has ever been built to be able to work with Pandas dataframes. That is not true with Polars dataframes and in those cases you will have to convert the dataframe from Polars to Pandas.

But what exactly is Polars?

Polars is a high-performance dataframe library for Python written in Rust. And that's the key - that's it's written in Rust. It is designed to provide fast and efficient data processing capabilities, even for large datasets that might not fit into memory. Polars combines the flexibility and user-friendliness of Python with the speed and scalability of Rust, making it a compelling choice for a wide range of data processing tasks, including data wrangling and cleaning, data exploration and visualization, machine learning and data science, data pipelines and ETL.

Polars offers a number of features:

  • Speed: Polars is one of the fastest dataframe libraries available, with performance comparable to Arrow on many tasks. Much of this is due to it being written in Rust and that it can use all of the machine cores in parallel.
  • Scalability: Polars can handle datasets much larger than your available RAM, thanks to its lazy execution model - so it will optionally not need to load all of the data into memory. I have used it to read in a massive dataset that contained all known patented molecules.
  • Ease of use: Polars has a familiar and intuitive API that is easy to learn for users with experience with other dataframe libraries, such as Pandas and PySpark.
  • Feature richness: Polars offers a comprehensive set of features for data processing that parallels those of Pandas or PySpark, including filtering, sorting, grouping, aggregation, and joining.

You can create a Polars DataFrame from a variety of data sources, such as CSV files, Parquet files, and Arrow buffers. Once you have a DataFrame, you can use Polars to perform a wide range of data processing tasks.

For drug discovery, I have used Polars on large data sets to analyze molecular interactions, predict pharmacokinetic properties, and optimize drug design. For example, I have used it to help calculate the binding energy of a molecule to its target where the data sets are large. This can be useful for identifying drug candidates that are likely to have high affinity for their targets.

In these examples below which are common use cases in Pandas, I have the Polars code and then the equivalent Pandas code as a comment underneath when it's different:

import polars as pl
import pandas as pd

# Load data about molecules from a CSV file
df = pl.read_csv('molecules.csv')
# df = pd.read_csv('molecules.csv')

Selecting columns:
# Select specific columns about a molecule
selected_df = df.select(['Id', 'SMILES', "Mol_Wgt', "LogP", "NumHAcceptors", "NumHDonors"])
# Equivalent Pandas code
# selected_df = df[['Protein', 'SMILES', "Mol_Wgt', "LogP", "NumHAcceptors", "NumHDonors"]]
selected_df.head()
Selecting rows:

# Filter the df
filtered_df = df.filter(pl.col('Mol_Wgt') < 500)

# Equivalent Pandas code using .loc
# filtered_df = df.loc[df['Mol_Wgt'] < 500]

Grouping data:

# Group on Protein and calculate the mean Mol Wgt by Protein
grouped_df = df.groupby(by='Protein').agg(pl.col('Mol_Wgt').mean())

# Equivalent Pandas code
# grouped_df = df.groupby(['Protein']].mean()[['Mol_Wgt']]

Sorting rows by a column:

# Group on Protein and calculate the mean Mol Wgt by Protein
sorted_df = df.sort(by='Mol_Wgt')

# Equivalent Pandas code
# sorted_df = df.sort_values('Mol_Wgt')

As you can see, the syntax is not that different and because of the dataframe structure and its expressive syntax. Polars works great for large data sets that can't fit into memory that need processing speed. With all of these considerations, you can see that Polars is a great tool for any data scientist to add.

"Superhuman" Forecasting?

This just came out from the Center for AI Safety  called Superhuman Automated Forecasting . This is very exciting to me, because I've be...