- For classification: A majority vote among the trees determines the class. In scikit-learn, we can use its RandomForestClassifier module.
- For regression: The average of the trees’ predictions is taken. In scikit-learn, we can use its RandomForestRegressor module.
- Numerical Features (Continuous and Discrete): Random Forest is naturally suited for numerical features since it creates decision tree splits based on thresholds. Some examples would be age, salary, temperature, stock prices.
- Categorical Features (Low to Medium Cardinality): Random Forest works well when categories are few and meaningful. Some examples would include “Gender”, “Day of the Week” (Monday-Sunday). If categorical variables have high cardinality (e.g., ZIP codes, product IDs), proper encoding is necessary.
How Does a Random Forest Work?
Random Forest has three main characteristics:- Bootstrapping (Bagging): Random subsets of the training data are selected with replacement to build each individual tree. This process, known as bootstrapping, ensures that every tree sees a slightly different version of the dataset. This increases the diversity among the trees and helps reduce the variance of the final model.
- Random Feature Selection: At each split instead of considering every feature, a random subset of features is chosen. The best split is determined only from this subset. This approach helps prevent any single strong predictor from dominating the model and increases overall robustness.
- Tree Aggregation: Voting or averaging: once all trees are built, their individual predictions are aggregated. For a classification task, the class that gets the most votes is chosen. In regression, the mean prediction is used.
The Math Behind Random Forests
The most important mathematical concept in Random Forests is the idea behind how decision trees make decisions regarding the splits.Gini Impurity
A common metric used to evaluate splits in a decision tree is the Gini impurity. It measures how often a randomly chosen element from the set would be incorrectly labeled if it was randomly labeled according to the distribution of labels in the node. The Gini impurity is given by: \( G = 1 - \sum_{i=1}^{C} p_i^2 \) Where:- \( C \) is the number of classes.
- \( p_i \) is the proportion of samples of class i in the node. \( p_i = \frac{\text{Number of samples in class } i}{\text{Total samples in the node}} \)
- 6 samples belong to Class 1: \( p_1 \) = 0.6
- 4 samples belong to Class 2: \( p_2 \) = 0.4
- If a node contains only one class, then the probability for that class is 1 and for all other classes is 0.
- Example: Suppose a node contains only class “A,” meaning \( p_A = 1 \), and all other \( p_i = 0 \). The Gini impurity calculation is:
Random Forest: An Example
Let’s put all of these ideas together and walk through a simplified example to see these concepts. Imagine a small dataset with two features and a binary label: \[ \begin{array}{|c|c|c|c|} \hline \textbf{Sample} & \textbf{Feature1 (X)} & \textbf{Feature2 (Y)} & \textbf{Label} \\ \hline 1 & 2 & 10 & 0 \\ 2 & 3 & 15 & 0 \\ 3 & 4 & 10 & 1 \\ 4 & 5 & 20 & 1 \\ 5 & 6 & 15 & 0 \\ 6 & 7 & 25 & 1 \\ 7 & 8 & 10 & 0 \\ 8 & 9 & 20 & 1 \\ \hline \end{array} \]Step 1: Bootstrapping
Randomly sample the dataset with replacement to create a bootstrap sample. For example, one such sample might have the indices: [2, 3, 3, 5, 7, 8, 8, 1] Extracted bootstrap sample: \[ \begin{array}{|c|c|c|c|} \hline \textbf{Sample} & \textbf{Feature1 (X)} & \textbf{Feature2 (Y)} & \textbf{Label} \\ \hline 2 & 3 & 15 & 0 \\ 3 & 4 & 10 & 1 \\ 3 & 4 & 10 & 1 \\ 5 & 6 & 15 & 0 \\ 7 & 8 & 10 & 0 \\ 8 & 9 & 20 & 1 \\ 8 & 9 & 20 & 1 \\ 1 & 2 & 10 & 0 \\ \hline \end{array} \] Notice some samples appear multiple times (e.g., Sample 3 and Sample 8), while some samples from the original dataset (e.g., Sample 4 and 6) don’t appear at all..Step 2: Building a Single Decision Tree
For the bootstrap sample, a decision tree is built by considering various splits. Suppose first we consider a split on Feature1 at X = 5:- Left Node (X ≤ 5): Contains samples 2, 3, 3, 1. Corresponding data: \[ \begin{array}{|c|c|c|c|} \hline \textbf{Sample} & \textbf{Feature1 (X)} & \textbf{Feature2 (Y)} & \textbf{Label} \\ \hline 2 & 3 & 15 & 0 \\ 3 & 4 & 10 & 1 \\ 3 & 4 & 10 & 1 \\ 1 & 2 & 10 & 0 \\ \hline \end{array} \] Labels in left node: {0, 1, 1, 0} Calculate Gini impurity for this node: \( \text{Gini} = 1 - \left(P(0)^2 + P(1)^2\right) \)
- Total samples = 4
- Class 0 count = 2
- Class 1 count = 2
- Right Node (X > 5): Contains samples 5, 7, 8, 8. \[ \begin{array}{|c|c|c|c|} \hline \textbf{Sample} & \textbf{Feature1 (X)} & \textbf{Feature2 (Y)} & \textbf{Label} \\ \hline 5 & 6 & 15 & 0 \\ 7 & 8 & 10 & 0 \\ 8 & 9 & 20 & 1 \\ 8 & 9 & 20 & 1 \\ \hline \end{array} \]
- Total samples = 4
- Class 0 count = 2
- Class 1 count = 2
Step 3: Aggregation of Trees
In a Random Forest model, this process is repeated multiple times with different bootstrap samples. After multiple decision trees are created:- Classification: Each tree independently votes for a class label with majority voting deciding the final predicted label. If there’s a tie, the class with lower numerical label (e.g., 0) might be chosen by convention, or other tie-breaker methods applied. For example, if 5 trees predict class “1” and 2 trees predict class “0,” the final prediction is class 1.
- For Regression: The average of all tree predictions is taken.
- Class 0: 2 votes
- Class 1: 3 votes
Python Example Using scikit-learn
Okay now that we have the concepts, let's put this example into some Python code that demonstrates how to implement a Random Forest classifier on our small dataset:from sklearn.ensemble import RandomForestClassifier import pandas as pd # Define the dataset data = pd.DataFrame({ 'Feature1': [2, 3, 4, 5, 6, 7, 8, 9], 'Feature2': [10, 15, 10, 20, 15, 25, 10, 20], 'Label': [0, 0, 1, 1, 0, 1, 0, 1] }) X = data[['Feature1', 'Feature2']] y = data['Label'] # Train a Random Forest with 5 trees clf = RandomForestClassifier(n_estimators=5, random_state=42) clf.fit(X, y) # Predict for a new sample new_sample = [[5, 15]] prediction = clf.predict(new_sample) print("Predicted Label:", prediction[0])
No comments:
Post a Comment