The main idea of random forest is First, randomly select the original sample ( There's put back sampling , and Bagging equally ,bootstrap There's no need for cross validation )N A training subset is used to randomly generate N A decision tree , For each sample set, the optimal attribute is randomly selected when constructing the decision tree m Attributes , Instead of using all the attributes in the decision tree , And then these decision trees form a forest , There is no correlation between every decision tree of random forest . After getting the forest , When there's a new loser When the sample comes in , Let each decision tree in the forest make a judgment , Let's see what kind of sample this sample should belong to , And then we'll see which category is the most chosen , Just predict what kind of sample this sample is .
Random forest is composed of several random decision trees , yes bagging It's an enhancement algorithm , Parallel does not interfere with each other , The base learner is the classification decision tree , Random forest has two random samples 1 Training samples are randomly selected 2 The attributes of each decision tree are randomly selected （1 Randomly generated from the original training set N A training subset is used to randomly generate N A decision tree ,2 In the process of constructing the decision tree for each sample set, randomly select m Attributes ）.
Advantages of random forest ：
* Because the sample is random （ The model has strong generalization ability , For missing values 、 The outliers are not sensitive ） And random selection of attributes （ Can handle high dimensional data ）, Avoid over fitting ;
* The trees are independent of each other , Parallelizable training , It's fast to train models
* Models can handle imbalanced data , Balance error
* The end result of the training , You can sort features , Choose more important features
* Sampling in random forest algorithm is based on bootstrap sampling , Yes OOB Set （ Some data may not be selected ）, There is no need for cross validation or separate test sets to get test sets The disadvantages of random forests ：
* When the data noise is large , Over fitting will occur
* because 2 A random , It's almost impossible to control the operation inside the model , It's like a black box , Only try between different parameters and random seeds .