Random Forests are a general class of ensemble building methods that use a decision tree as the base classifier. The Random Forest classifier is a variation of the Bagging classifier (Bootstrap Aggregating). The Bagging algorithm is a method of classification that generates weak individual classifiers using bootstrap. Each classifier is trained on a random redistribution of the training set so that many of the original examples may be repeated in each classification.
The principal difference between Bagging and Random Forest is that Bagging uses all the features in each tree node and Random Forest selects a random subset of the features. The suitable number of randomized features corresponds to the square root of the total number of features. For prediction, a new sample is pushed down the tree and it is assigned the class of the terminal (or leaf) node in the tree. This method is iterated over all the trees, and finally, the average vote of all the tree predictions is considered as the prediction result. The following diagram shows the Random Forest algorithm:
Random Forests are currently one of the best classifiers available, both in recognition power and efficiency. In our example RFClassifier
, we use the OpenCV Random Forest classifier and also the OpenCV CvMLData
class. A large amount of information is typically handled in machine learning problems, and for this reason, it is convenient to use a .cvs
file. The CvMLData
class is used to load the training set information from such a file as follows:
//… (omitted for simplicity) int main(int argc, char *argv[]){ CvMLData mlData; mlData.read_csv("iris.csv"); mlData.set_response_idx(4); //Select 75% samples as training set and 25% as test set CvTrainTestSplit cvtts(0.75f, true); //Split the iris dataset mlData.set_train_test_split(&cvtts); //Get training set Mat trainsindex= mlData.get_train_sample_idx(); cout<<"Number of samples in the training set:"<<trainsindex.cols<<endl; //Get test set Mat testindex=mlData.get_test_sample_idx(); cout<<"Number of samples in the test set:"<<testindex.cols<<endl; cout<<endl; //Random Forest parameters CvRTParams params = CvRTParams(3, 1, 0, false, 2, 0, false, 0, 100, 0, CV_TERMCRIT_ITER | CV_TERMCRIT_EPS); CvRTrees classifierRF; //Taining phase classifierRF.train(&mlData,params); std::vector<float> train_responses, test_responses; //Calculate train error cout<<"Error on train samples:"<<endl; cout<<(float)classifierRF.calc_error( &mlData, CV_TRAIN_ERROR,&train_responses)<<endl; //Print train responses cout<<"Train responses:"<<endl; for(int i=0;i<(int)train_responses.size();i++) cout<<i+1<<":"<<(float)train_responses.at(i)<<" "; cout<<endl<<endl; //Calculate test error cout<<"Error on test samples:"<<endl; cout<<(float)classifierRF.calc_error( &mlData, CV_TEST_ERROR,&test_responses)<<endl; //Print test responses cout<<"Test responses:"<<endl; for(int i=0;i<(int)test_responses.size();i++) cout<<i+1<<":"<<(float)test_responses.at(i)<<" "; cout<<endl<<endl; return 0; }
The dataset has been provided by the UC Irvine Machine Learning Repository, available at http://archive.ics.uci.edu/ml/. For this code sample, the Iris dataset was used.
As we mentioned previously, the CvMLData
class allows you to load the dataset from a .csv
file using the read_csv
function and indicates the class column by the set_response_idx
function. In this case, we use this dataset to perform the training and test phases. It is possible to split the dataset into two disjoint sets for training and test. For this, we use the CvTrainTestSplit
struct and the void CvMLData::set_train_test_split(const CvTrainTestSplit* spl)
function. In the CvTrainTestSplit
struct, we indicate the percentage of samples to be used as the training set (0.75 percent in our case) and whether we want to mix the indices of the training and test samples from the dataset. The set_train_test_split
function performs the split. Then, we can store each set in Mat
with the get_train_sample_idx()
and get_test_sample_idx()
functions.
The Random Forest classifier is created using the CvRTrees
class, and its parameters are defined by the CvRTParams::CvRTParams(int max_depth, int min_sample_count, float regression_accuracy, bool use_surrogates, int max_categories, const float* priors, bool calc_var_importance, int nactive_vars, int max_num_of_trees_in_the_forest, float forest_accuracy, int termcrit_type)
constructor. Some of the most important input parameters refer to the maximum depth of the trees (max_depth
)—in our sample, it has a value of 3—the number of randomized features in each node (nactive_vars
), and the maximum number of trees in the forest (max_num_of_trees_in_the_forest
). If we set the nactive_vars
parameter to 0, the number of randomized features will be the square root of the total number of features.
Finally, once the classifier is trained with the train
function, we can obtain the percentage of misclassified samples using the float CvRTrees::calc_error(CvMLData* data, int type, std::vector<float>* resp=0 )
method. The parameter type allows you to select the source of the error: CV_TRAIN_ERROR
(an error in the training samples) or CV_TEST_ERROR
(an error in the test samples).
The following screenshot shows the training and test errors and the classifier responses in both the sets: