Here’s a step-by-step explanation of how Random Forest works:
Step 1: Understand the Data
You need a dataset with:
- Features (X): The input variables or predictors.
- Target (Y): The output variable or label you want to predict.
Step 2: Create Bootstrap Samples
Random Forest builds multiple decision trees using different subsets of the training data. These subsets are created by bootstrapping:
- Bootstrap Sampling: Randomly sample from the dataset with replacement to create multiple training subsets. Each subset is the same size as the original dataset but may contain duplicate records.
Step 3: Build Decision Trees
For each bootstrap sample:
- Train a Decision Tree: Construct a decision tree using the bootstrap sample.
- Feature Randomness: When splitting nodes in each decision tree, randomly select a subset of features rather than considering all features. This helps to ensure that the trees are diverse and reduces correlation between them.
Step 4: Aggregate the Trees
Once all the trees are built:
-
For Classification: Each tree in the forest votes for a class label. The class with the majority vote across all trees is the final prediction.
Example:
- Tree 1: Class A
- Tree 2: Class B
- Tree 3: Class A
- Majority vote: Class A
-
For Regression: The prediction is the average of all the trees' predictions.
Example:
- Tree 1 predicts 3.0
- Tree 2 predicts 3.5
- Tree 3 predicts 2.8
- Average prediction: (3.0 + 3.5 + 2.8) / 3 = 3.1
Step 5: Evaluate the Model
Evaluate the Random Forest model using metrics such as:
- Accuracy: For classification, the proportion of correctly classified samples.
- Confusion Matrix: For classification, details of true positives, true negatives, false positives, and false negatives.
- Mean Absolute Error (MAE): For regression, the average absolute error between predicted and actual values.
- R-squared (R²): For regression, the proportion of variance in the dependent variable that is predictable from the independent variables.
Step 6: Tune Hyperparameters (Optional)
Fine-tune the performance of the Random Forest by adjusting hyperparameters:
- Number of Trees (n_estimators): The number of decision trees in the forest. More trees usually improve performance but increase computation time.
- Maximum Depth (max_depth): The maximum depth of each tree. Limiting depth can prevent overfitting.
- Minimum Samples Split (min_samples_split): The minimum number of samples required to split an internal node.
- Minimum Samples Leaf (min_samples_leaf): The minimum number of samples required to be at a leaf node.
- Number of Features (max_features): The number of features to consider when looking for the best split.