Random Forest Classifier

Here, we discussed the random forest classifier technique with deep intuition and python implementation of various hyperparameters.

6 min readOct 23, 2021

Similar to the forest of real-life made of thousands of trees, Random Forest classifier is also made by combining multiple decision trees (not thousands).

Random forest is a supervised machine learning algorithm. The “forest” it builds, is an ensemble of decision trees, usually trained with the “bagging” method.

First, we need to know what are ensemble & bagging methods?

Just like you decide to buy a car by reading multiple reviews and opinions, in machine learning also, you can combine the decisions from multiple models to improve the overall performance. This technique of combining multiple machine learning models is called ensemble learning. Ensemble learning is one of the most effective ways to build an efficient machine learning model. You can build an ensemble machine learning model using simple models and yet get great scores that are at par with resource-hungry models like neural networks.

Bagging

The idea behind bagging (or Bootstrap Aggregating) is combining the results of multiple models (for instance, all decision trees) to get a generalized result. Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of subsets created for bagging may be less than the original set. The bagging technique uses subsets (bags) to get a fair idea of the distribution (complete set).

Bagging=Decision Tree + Row sampling

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

In simple words, random forest builds multiple decision trees and merges them to get a more accurate and stable prediction.

FEATURE IMPORTANCE

Another great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this that measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity across all trees in the forest. It computes this score automatically for each feature after training and scales the results so the sum of all importance is equal to one.

If you don’t know how a decision tree works or what a leaf or node is, then read my previous blog. I have already explained the complete working of the decision tree over there.

By looking at the feature importance you can decide which features to possibly drop because they don’t contribute enough (or sometimes nothing at all) to the prediction process. This is important because a general rule in machine learning is that the more features you have the more likely your model will suffer from overfitting and vice versa.

Looking at it step-by-step, this is what a random forest model does:

Random subsets are created from the original dataset (bootstrapping).
At each node in the decision tree, only a random set of features is considered to decide the best split.
A decision tree model is fitted on each of the subsets.
The final prediction is calculated by averaging(or mode) the predictions from all decision trees.

Note: The decision trees in the random forest can be built on a subset of data and features. Particularly, the sklearn model of random forest uses all features for the decision tree and a subset of features are randomly selected for splitting at each node.

IMPORTANT HYPERPARAMETERS

The hyperparameters in the random forest are either used to increase the predictive power of the model or to make the model faster. Let’s look at the hyperparameters of learns built-in random forest function.

Increasing Predictive power

n_estimators:

It defines the number of decision trees to be created in a random forest.
Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher training time.

max_features :

It defines the maximum number of features allowed for the split in each decision tree.
Increasing max features usually improve performance but a very high number can decrease the diversity of each tree

min_samples_leaf:

This defines the minimum number of samples required to be at a leaf node.
Smaller leaf size makes the model more prone to capturing noise in train data.

criterion:

It defines the function that is to be used for splitting.
The function measures the quality of a split for each feature and chooses the best split.

2. Increasing the model’s speed

max_depth:

The random forest has multiple decision trees. This parameter defines the maximum depth of the trees.

min_samples_split:

Used to define the minimum number of samples required in a leaf node before a split is attempted.
If the number of samples is less than the required number, the node is not split

max_leaf_nodes:

This parameter specifies the maximum number of leaf nodes for each tree.
The tree stops splitting when the number of leaf nodes becomes equal to the max-leaf node.

n_jobs:

This indicates the number of jobs to run in parallel.
Set the value to -1 if you want it to run on all cores in the system.

random_state:

This parameter is used to define the random selection.
It is used for comparison between various models.

Now, let’s look at the code using Scikit learn.

# importing required libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score# Now, we need to predict the missing target variable in the test data
#Create the object of the Random Forest model
#You can also add other parameters and test your code here
#Some parameters are : n_estimators and max_depthmodel = RandomForestClassifier()# fit the model with the training data
model.fit(train_x,train_y)# number of trees used
print('Number of Trees used : ', model.n_estimators)# predict the target on the train dataset
predict_train = model.predict(train_x)
print('\nTarget on train data',predict_train)# Accuray Score on train dataset
accuracy_train = accuracy_score(train_y,predict_train)
print('\naccuracy_score on train dataset : ', accuracy_train)# predict the target on the test dataset
predict_test = model.predict(test_x)
print('\nTarget on test data',predict_test)# Accuracy Score on test dataset
accuracy_test = accuracy_score(test_y,predict_test)
print('\naccuracy_score on test dataset : ', accuracy_test)

ADVANTAGES AND DISADVANTAGES OF THE RANDOM FOREST ALGORITHM

One of the biggest advantages of random forest is its versatility. It can be used for both regression and classification tasks, and it’s also easy to view the relative importance it assigns to the input features.

Random forest is also a very handy algorithm because the default hyperparameters it uses often to produce a good prediction result. Understanding the hyperparameters is pretty straightforward, and there are also not that many of them.

One of the biggest problems in machine learning is overfitting, but most of the time this won’t happen thanks to the random forest classifier. If there are enough trees in the forest, the classifier won’t overfit the model.

The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model. In most real-world applications, the random forest algorithm is fast enough but there can certainly be situations where run-time performance is important and other approaches would be preferred.

And, of course, random forest is a predictive modeling tool and not a descriptive tool, meaning if you’re looking for a description of the relationships in your data, other approaches would be better.

SUMMARY

Random forest is a great algorithm to train early in the model development process, to see how it performs. The algorithm is also a great choice for anyone who needs to develop a model quickly. On top of that, it provides a pretty good indicator of the importance it assigns to your features.

Random forests are also very hard to beat performance-wise. Of course, you can probably always find a model that can perform better, like a neural network for example, but these usually take more time to develop, though they can handle a lot of different feature types, like binary, categorical and numerical.

Overall, random forest is a (mostly) fast, simple and flexible tool, but not without some limitations.

Thank you for reading my blog. Hope you understood it well. For more such blogs, make sure that you follow me. Cheers! Have a great day.