billparker.ai: Processing Monster Datasets with Polars

Over the last year, I've been turning more and more to the Python module Polars for handling large datasets that couldn't fit into memory on a single machine. Before, I had used the library Dask for these occasions, and what I liked most about Dask is the very close syntax to Pandas, but the bottom line is that Polars is just so incredibly fast and probably a better fit on single machine processing than Dask. The Polars syntax isn't difficult and has syntax similarity to Spark. For single machine processing where I didn't need the overhead and the complexity of distributed processing that Dask and PySpark excel at, Polars is a great alternative.

You can find many speed comparisons between Pandas and Polars such as here and here, but I've found it commonly is about 10-100 times faster. And that includes it being faster than the new-ish 2.0 version of Pandas using Pyarrow.

In my work in medicinal chemistry, I've been processing the ADME (absorption, distribution, metabolism, excretion) characteristics of molecules, the synthesizability of molecules, and their patent availability which can be very compute intensive. Millions of these records can be processed in times that would not be practically achievable with Pandas.

Does that mean that you should just write everything in Polars and not use Pandas anymore? No! First, just using Polars is going to be overkill in some cases when your data size is relatively small and it's even possible it would be slower in some edge cases on small datasets. Second, Pandas dataframes are ubiquitous and you can count on every Python module that has ever been built to be able to work with Pandas dataframes. That is not true with Polars dataframes and in those cases you will have to convert the dataframe from Polars to Pandas.

But what exactly is Polars?

Polars is a high-performance dataframe library for Python written in Rust. And that's the key - that's it's written in Rust. It is designed to provide fast and efficient data processing capabilities, even for large datasets that might not fit into memory. Polars combines the flexibility and user-friendliness of Python with the speed and scalability of Rust, making it a compelling choice for a wide range of data processing tasks, including data wrangling and cleaning, data exploration and visualization, machine learning and data science, data pipelines and ETL.

Polars offers a number of features:

Speed: Polars is one of the fastest dataframe libraries available, with performance comparable to Arrow on many tasks. Much of this is due to it being written in Rust and that it can use all of the machine cores in parallel.
Scalability: Polars can handle datasets much larger than your available RAM, thanks to its lazy execution model - so it will optionally not need to load all of the data into memory. I have used it to read in a massive dataset that contained all known patented molecules.
Ease of use: Polars has a familiar and intuitive API that is easy to learn for users with experience with other dataframe libraries, such as Pandas and PySpark.
Feature richness: Polars offers a comprehensive set of features for data processing that parallels those of Pandas or PySpark, including filtering, sorting, grouping, aggregation, and joining.

You can create a Polars DataFrame from a variety of data sources, such as CSV files, Parquet files, and Arrow buffers. Once you have a DataFrame, you can use Polars to perform a wide range of data processing tasks.

For drug discovery, I have used Polars on large data sets to analyze molecular interactions, predict pharmacokinetic properties, and optimize drug design. For example, I have used it to help calculate the binding energy of a molecule to its target where the data sets are large. This can be useful for identifying drug candidates that are likely to have high affinity for their targets.

In these examples below which are common use cases in Pandas, I have the Polars code and then the equivalent Pandas code as a comment underneath when it's different:

import polars as pl
import pandas as pd

# Load data about molecules from a CSV file
df = pl.read_csv('molecules.csv')
# df = pd.read_csv('molecules.csv')

Selecting columns:

# Select specific columns about a molecule
selected_df = df.select(['Id', 'SMILES', "Mol_Wgt', "LogP", "NumHAcceptors", "NumHDonors"])

# Equivalent Pandas code
# selected_df = df[['Protein', 'SMILES', "Mol_Wgt', "LogP", "NumHAcceptors", "NumHDonors"]]

selected_df.head()

Selecting rows:

# Filter the df
filtered_df = df.filter(pl.col('Mol_Wgt') < 500)

# Equivalent Pandas code using .loc
# filtered_df = df.loc[df['Mol_Wgt'] < 500]

Grouping data:

# Group on Protein and calculate the mean Mol Wgt by Protein
grouped_df = df.groupby(by='Protein').agg(pl.col('Mol_Wgt').mean())

# Equivalent Pandas code
# grouped_df = df.groupby(['Protein']].mean()[['Mol_Wgt']]

Sorting rows by a column:

# Group on Protein and calculate the mean Mol Wgt by Protein
sorted_df = df.sort(by='Mol_Wgt')

# Equivalent Pandas code
# sorted_df = df.sort_values('Mol_Wgt')

As you can see, the syntax is not that different and because of the dataframe structure and its expressive syntax. Polars works great for large data sets that can't fit into memory that need processing speed. With all of these considerations, you can see that Polars is a great tool for any data scientist to add.

billparker.ai

Saturday, September 23, 2023

Processing Monster Datasets with Polars

No comments:

Post a Comment

The Octopus, AI, and the Alien Mind: Rethinking Intelligence

Report Abuse