Wednesday, January 1, 2025

Machine Leaning Classification: Scikit-Learn, PyTorch, and TensorFlow Examples

Although one can find machine learning examples using scikit-learn, PyTorch, and TensorFlow separately, there aren't really examples where one can see a comparison for these different frameworks on a standard dataset all in one place. But I think it is instructive to see how to use the same variables from the same dataset to accomplish the same prediction task. And that is what we're going to do in this post. We're going to use a very standard diabetes dataset to create basic example classification models based on seven variables to predict if someone is likely to be diabetic. The Github repository is here, the full notebook used to build the models is here, and the deployed Streamlit app can be accessed here.

The seven predictor variables are:

  • Number of pregnancies (the data was trained on all female respondents)
  • Glucose
  • Skin thickness
  • Age
  • BMI
  • Blood Pressure
  • Insulin
The dataset contains an "outcome" column that indicates the presence or absence of diabetes. We'll build four separate models:
  1. scikit-learn (Random Forest)
  2. scikit-learn (Gradient Boost)
  3. PyTorch (Neural Network)
  4. TensorFlow (Neural Network)
The models we will build in this post will focus on basic implementations emphasizing the mechanics and not on other topics like data cleaning, optimization, or fine tuning - although they are also important.

First let's read in the data and create the train/test splits. We'll use the same splits for all four models.

df = pd.read_csv('diabetes.csv')
X = df.drop("Outcome", axis=1)
y = df["Outcome"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

Random Forest Classifier:

Create the random forest classifier instance:

rf_model = RandomForestClassifier(n_estimators=100, random_state=78)
Fit the model:

rf_model= rf_model.fit(X_train, y_train)
Make predictions using the testing data:

predictions = rf_model.predict(X_test)

Save the model for the Streamlit app:
filename = 'rf.sav'

pickle.dump(rf_model, open(filename, 'wb'))

Create classification report:
filename = 'rf.sav'
print(classification_report(y_test, predictions))


              precision    recall  f1-score   support

           0       0.80      0.86      0.83       129
           1       0.66      0.56      0.60        63

    accuracy                           0.76       192
   macro avg       0.73      0.71      0.72       192
weighted avg       0.75      0.76      0.75       192

cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
As you can see the scikit-learn implementation is pretty straightforward and follows a "model, fit, predict" pattern. Likewise, gradient boost follows the same pattern.

Gradient Boost Classifier:

# Create the gradient boost classifier instance
gb_model = GradientBoostingClassifier(random_state=78)
# Fit the model
gb_model = gb_model.fit(X_train, y_train)
# Make predictions using the testing data
predictions = gb_model.predict(X_test)
# Save the model for the Streamlit app
filename = 'gb.sav'
pickle.dump(gb_model, open(filename, 'wb'))
# Create classification report
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.79      0.87      0.83       129
           1       0.66      0.52      0.58        63

    accuracy                           0.76       192
   macro avg       0.72      0.70      0.71       192
weighted avg       0.75      0.76      0.75       192  

cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

Neural Network (PyTorch):

For the neural networks for both PyTorch and TensorFlow, we will scale the data. Scaling the data wasn't really necessary for the tree-based models, but it is for the neural networks. We use the same number of layers and nodes per layer for both the PyTorch and TensorFlow models - 1 feature input layer, 2 hidden layers (16 and 8 nodes with RELU activation functions), and 1 output node that uses a sigmoid activation function. Each of the two networks will run with 100 epochs. And both are using binary cross-entropy for the loss function and Adam for optimization. There is some flexibility of how binary cross-entropy can be set with regards to the sigmoid function between the two frameworks, but how we are doing it here, it is effectively the same.

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.preprocessing import StandardScaler

Standardize the features:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Now I'll convert the data to PyTorch tensors:

X_train_scaled = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_scaled = torch.tensor(X_test_scaled, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.float32).unsqueeze(1)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.float32).unsqueeze(1)

Define the neural network model:

class DiabetesPTModel(nn.Module):
    def __init__(self):
        super(DiabetesPTModel, self).__init__()
        self.fc1 = nn.Linear(7, 16)
        self.fc2 = nn.Linear(16, 8)
        self.fc3 = nn.Linear(8, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

Initialize the model, loss function, and optimizer:

pt_model = DiabetesPTModel()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

Train the model:

num_epochs = 100
for epoch in range(num_epochs):
    pt_model.train()
    optimizer.zero_grad()
    outputs = pt_model(X_train_scaled)
    loss = criterion(outputs, y_train_tensor)
    loss.backward()
    optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

Evaluate the model:

pt_model.eval()
with torch.no_grad():
    predictions = pt_model(X_test_scaled)
    predictions = predictions.round()
    accuracy = (predictions.eq(y_test_tensor).sum() / float(y_test_tensor.shape[0])).item()
    print(f'Accuracy: {accuracy:.4f}')

Save the model and scaler - we will need them for the Streamlit app:

torch.save(pt_model.state_dict(), 'diabetes_model_pt.pth')
with open('scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

Get predictions for test set for classification report:

with torch.no_grad():
    predictions = model(X_test_scaled)
    predictions = predictions.round()

print(classification_report(y_test_predictions, predictions))

              precision    recall  f1-score   support

           0       0.79      0.88      0.83       129
           1       0.67      0.52      0.59        63

accuracy                               0.76       192
macro avg          0.73      0.70      0.71       192
weighted avg       0.75      0.76      0.75       192

cm = confusion_matrix(y_test_tensor, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
Neural Network (TensorFlow):

Define the neural network model:

import tensorflow as tf

tf_model = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation='relu', input_shape=(7,)),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Compile the model:

tf_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Train the model. Use the same scaled data used with the PyTorch model:

tf_model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, validation_split=0.2)

Evaluate the model:

loss, accuracy = model.evaluate(X_test_scaled, y_test)
print(f'Accuracy: {accuracy:.4f}')

Save the model for the Streamlit app:

tf_model.save('diabetes_model_tf.h5')

Get predictions for test set for classification report:

predictions = tf_model.predict(X_test_scaled).round()
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.81      0.83      0.82       129
           1       0.63      0.60      0.62        63

    accuracy                           0.76       192
   macro avg       0.72      0.72      0.72       192
weighted avg       0.75      0.76      0.75       192

cm = confusion_matrix(y_test, predictions)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
Model Comparison:

These examples show basic implementations of scikit-learn, PyTorch, and TensorFlow models. This is not to demonstrate the huge number of options that are available, such as for fine tuning, data loading in PyTorch, or the properties of tensors in general. We could have also have computed loss curves and done things like adjust for sample imbalance. But from what we did do, we can see the fundamental structure of each of the models.

In comparing the output, we can see confusion matrices that are very similar to each other, which is most likely a function of the data itself or the small size of the data. Normally, these models can vary significantly from each other as far as evaluation of their performance.

Before moving on to deploy these models to a Streamlit app, we could look at one more characteristic of the models, which is to answer the question of which are the most important variables to the model. We'll do that for one of the one of the models - the random forest model.

We can get and display the most important features like this:

importances = rf_model.feature_importances_
importances_sorted = sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)

# Plot the feature importances
features = sorted(zip(X.columns, importances), key = lambda x: x[1])
cols = [f[0] for f in features]
width = [f[1] for f in features]

fig, ax = plt.subplots()

fig.set_size_inches(8,6)
plt.margins(y=0.001)

ax.barh(y=cols, width=width)

plt.show()



As we can see from the chart: glucose, BMI, and age are the most important variables - at least for the random forest model.

Streamlit Application:

For the Streamlit app, we allow the user to enter in values for any of the seven predictor variables and use default values if they don't change them. They can then select from any of the four models we built (and saved) and get a prediction. And very importantly for the two models that we scaled the data, we load the trained scaler for each of those two models and apply it to the user's selections.

And that's it! We have a deployed Streamlit app for the four models.



No comments:

Post a Comment

Elements of Monte Carlo Tree Search - Typical and Non-typical Applications

Monte Carlo Tree Search (MCTS) offers a very intuitive way of tackling challenging decision making problems. In essence, MCTS combines the...