Tuesday, September 26, 2023

Top 10 Languages/Tools for Data Science in 2023

Let's talk top 10 languages/tools for data science in 2023. Obviously, this is going to be very opinionated and everyone's list is going to be different depending on personal experiences in specific industries. But I want to keep this as generalized as possible as to what I think someone should know in 2023. I'm also keeping this list focused on tools or software used in data science and not concepts or algorithms - I think that would make another great list: "Top 10 Data Science Concepts" and I'll try and do that too in a later post.

So why 10? Why not 5 or 20 or even 11? I'm not a big fan of top 10 lists, because 10 is such an arbitrary cutoff, but we can blame top 10 lists on the fact that we have 10 fingers and that humans generally like round numbers. It has always fascinated me that if we had 8 fingers we would be primarily base 8 and how would society be different if we weren't base 10. 

To further go down the rabbit hole, the origin of base 10 systems developed independently across different civilizations. But it was Indian mathematicians who perfected its use and along with Arabic traders spread it throughout the known world in the 7th century enabling advances in commerce and science.

But back to my top 10 list:

1) Python

You need to know Python in 2023. It is where the most cutting edge development is happening in AI. You need to know it deeply and not just superficially. And you need to be able to write not just in Jupyter Notebooks, but be able to write Python (.py) scripts. It is THE language for machine learning. But it doesn't just excel for machine learning, it excels in ETL and in general process automation. It's also making inroads into front end development with the advent of PyScript in the last year.

It routinely sits across the top of popularity lists and right now is #1 in the TIOBE index - this is due to it's relative simplicity while also being able to accomplish complex tasks. As a general purpose language it can be applied in tasks well beyond data science. Anyone with "Analyst" in their title, even if they aren't doing data science or machine learning, needs to know Python in 2023 if only just to be able to automate parts of their job.

2) Pandas

Know the Python library Pandas. You won't be able to read through any machine learning code without knowing Pandas really well. Plus, knowing Pandas is like having a super power for data manipulation.

3) Matplotlib

You need to be able to wield a charting library and Matplotlib is the most popular. It is hard to find programs, articles, books, or examples that don't include some kind of Matplotlib chart. I find its syntax to not be the cleanest and there are other good libraries like Plotly and Seaborn to name a few, but it's hard to get by without being able to use Matplotlib.

4) SQL

You need to be able to get your data from somewhere right? And most likely some form of SQL knowledge (whether that's Postgres, MS-SQL, MySQL, Oracle, etc) is going to be required (more on NoSql databases below).

5) Scikit-learn

Scikit-learn should be your bread and butter for what you use to solve most business related problems. Its rich library of machine learning and statistical models can be applied to a wide variety of everyday data analytics. Plus, the documentation is superb.

6) PyTorch

PyTorch has been moving fast up on my list and has overtaken TensorFlow for me and for many others. It's "pythonic" nature and object-oriented approach has some appeal. It's also arguably easier to debug and the PyTorch Foundation does an excellent job of maintaining its ecosystem. But it's the fact that it's increasingly in some of the most cutting edge research at some of the most important companies in AI, especially in large language modeling that is really driving its rise.

7) TensorFlow

I personally love TensorFlow and you should know it for neural networks and I like it's overall structure, but it's losing ground in my mind for some of the reasons outlined above.

8) Streamlit

I'm cheating here a little bit, because I could easily put Gradio here (or Dash for that matter). But I think what's important here is that data scientists need a good way to communicate through applications quickly.

9) Selenium

Being a data scientist is about being able to get data and some of that data might not always be available in your database or through an API, so being able to scrape data is a valuable skill.

10) MongoDB

I'm cheating again here, because I think being familiar with NoSQL databases in general whether that's MongoDB or another version is important and knowing its advantages (and disadvantages). And generally knowing MongoDB shows that you are familiar with JSON (BSON in this case) type data and can parse data that has key/value structure. Because that ability to work with MongoDB is some of the same skills you need to work with API data and Python dictionaries. And being able to work with key/value data and dictionaries in Python is a fundamental skill you should have.

So that's it.

What's not on my list? Well, to mention a few: Julia, Mojo, R, Hadoop, Hive, Javascript, C/C++, Rust, Scala, SAS, SPSS, Matlab, Java, NLP packages such as NLTK, spaCy, Excel, and VBA.

I have a lot of respect for Julia as a language, so it's barely not making the list. Same goes with Mojo. But also because Mojo was just released for Linux. It needs to be rolled out to be able run in Mac/Windows, but it may make the list next year. 

I almost put Javascript on the list primarily because as one of the most popular languages it is hard to escape and believing Javascript is just a front end language is wrong. Javascript is everywhere - NodeJS, stored procedures, etc. Plus everyone should know some JS (along with some HTML).

I also didn't single out NumPy as a specific Python package like I did with Pandas. I just thought it was obvious that you should know NumPy as a Python user and that it is really important. It's an arbitrary decision to not list it separately.

I could have easily put one or both NLP packages of NLTK or spaCy on the list and maybe next time I will. 

As for the languages of C/C++ and Rust, I could make a strong case for one or both to be on the list for performance reasons. C/C++ is really important in the Python/machine learning ecosystem and there's some great things happening with Rust, such as Polars. Rust and Python work really well together and it's going to be interesting to see how the various performance initiatives turn out in AI considering the current work to make Python faster, languages like Julia and Mojo, or interfacing directly with languages like C/C++ or Rust.

I think that for VBA as a language being used in some analytics the writing is on the wall even though there's a lot of VBA code out there. One reason is Microsoft's decision to include Python in Excel is going to supplant many of the VBA use cases. Plus, you can tell Microsoft has not been putting resources into it. The UI and the error messages are terrible. It looks like it was written in the 90's and seems like it hasn't improved or even changed in years. 

Many people will be upset with this, but the trend for R is not good. Its raison d'etre in the 90's was as a reaction to SAS and especially SPSS with their onerous licensing. It became very popular as many people jumped into creating some pretty great statistical packages for it - especially in academic circles. It will continue to have its specialized niches, but it has little future in cutting edge AI research. As a side anecdote, in the data science class I teach, we used to do a very basic introduction to R towards the end with most of the course being Python and other tools. But now the university decided to remove R after evaluating the market, so none of the students coming out of this program will have even a basic familiarity with R.

Excel is not on the list because everyone knows Excel and knowing Excel doesn't make you a data scientist.

I'm leaving off Big Data solutions for now, but if I would put some on the list it would be Spark based ones or proprietary solutions from Snowflake, AWS, Google, Databricks etc. and not MapReduce based solutions that involve Hive or Hadoop. 

I'm also leaving off proprietary machine learning solutions like AWS SageMaker for now.

I also didn't discuss editors or IDEs. I use VS Code primarily because I can have one IDE for any language, but I think there's a lot of good choices and if you use PyCharm or any of the others that is fine. I don't think it matters as long as you are comfortable with it. Or if you want to be really cool use VIM.

So that's the list for 2023. I'll revisit in 2024 to see what has changed.

No comments:

Post a Comment

"Superhuman" Forecasting?

This just came out from the Center for AI Safety  called Superhuman Automated Forecasting . This is very exciting to me, because I've be...