A Simple Introduction to The Random Forest Method

Intuition, Motivation, and How it works on the Random Forest Method

1. Introduction

Related Fundamental knowledge

Having some basic knowledge above would help for gaining basic ideas behind Random Forest. We will discuss the bagging method first, since bagging is important part besides decision tree.

2. Bagging Method

Sometimes, decision trees have a high variance since in training process data is split into two parts, fit to both halves and lead to overfitting. When predicted with new data, the predicted results can be very different. Bagging overcomes this problem by reducing variance and helps to avoid overfitting.

as explained before in the bootstrap method, bagging processes only in step Further Inference-average the population parameters as following :

  • given a population with sample size n
  • set of B-bootstrap sample size of n — Create a random sample with replacement from the original sample with sample size N as the original sample, replicate B time, and there will totally B Bootstrap Samples
  • set of B-bootstrap estimate variance
  • further inference — evaluate mean variance with sum variance in B set bootstrap divided number of B

In other words, averaging a set of B-bootstrap reduces variance. Mathematically, we can write the above process as following

This is not practical because we generally do not have access to multiple training sets. Instead, we can bootstrap, by taking repeated samples from the (single) training data set. In this approach we generate B different bootstrapped training data sets, and finally average all the predictions, to obtain

This is called bagging [3].

2.1 Out-of-Bag Error Estimation

Out-of-Bag Error from Introduction to Statistical Learning

As we discussed, in order to predict single data, we can majority vote in the classification case and average in the regression case.

3. Random Forest Method

In splitting process, random forest is not even allowed to consider a majority of the available predictors. Suppose that splitting process consider a strong predictor, consequently all of the bagged trees will look quite similar to each other and the predictors from the bagged trees will be highly correlated. This state is not lead to reduction variance.

Of 2 paragraphs above we can analysis, random forest

  • forcing each split to consider only a subset of the predictors
  • therefore, on average (p − m)/p of the splits will not even consider the strong predictor, and so other predictors will have more of a chance

Using a small value of m in building a random forest will typically be helpful when we have a large number of correlated predictors

Random Forest from Introduction to Statistical Learning

Results from random forests for the 15-class gene expression data set with p=500 predictors. The test error is displayed as a function of the number of trees. Each colored line corresponds to a different value of m, the number of predictors available for splitting at each interior tree node. Random forests (m<p) lead to a slight improvement over bagging (m=p). A single classification tree has an error rate of 45.7 % [3]

3.1 How Does Random Forest Work?

original data set

Step by step and intuition build random forest

1. Draw a random bootstrap sample of size m :

this simulation, will create 3 bootstrap data set. The important detail is that we are allowed to select the same sample more than once

2. Grow a decision tree from the bootstrap sample :

  • Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.

1st bootstrap data set, for the root we select randomly 2 variables as candidates : cholesterol and sex. For instance, we select cholesterol as a root, since sex has been chosen, sex ​​not allowed as a data set. In this case we use cholesterol, Fbs, and Exang as variables.

1st bootstrap data set

2nd bootstrap data set, for the root we select randomly 2 variables as candidates : Fbs and sex. For instance, we select sex as a root, since Fbs has been chosen, Fbs ​​not allowed as a data set. In this case we use Sex, cholesterol, and Exang as variables.

2nd bootstrap data set

3rd bootstrap data set, for the root we select randomly 2 variables as candidates : Fbs and Exang. For instance, we select Fbs as a root, since Exang has been chosen, Exang ​​not allowed as a data set. In this case we use Sex, cholesterol, and Fbs as variables.

3rd bootstrap data set

3. Repeat the steps 1 to 2 k times.

4. Aggregate the prediction by each tree to assign the class label by majority vote or average.

1st tree

2nd tree

3rd tree

Predicting : In this case the result uses vote majority

The random forest algorithm can be summarized in four simple steps:

2. Grow a decision tree from the bootstrap sample. At each node:

  • Randomly select d features without replacement.
  • Split the node using the feature that provides the best split according to the objective function, for instance, by maximizing the information gain.

3. Repeat the steps 1 to 2 k times.

4. Aggregate the prediction by each tree to assign the class label by majority vote or average.

Continue Learning — A Simple Introduction to The AdaBoost for Decision Tree

About Me

Reference

  1. Wikipedia — Bootstrap aggregating
  2. Introduction to Statistical Learning
  3. Raschka, Sebastian. Python Machine Learning 1st Edition

Data Scientist and Artificial Intelligence Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store