The Random Forest classifier

Random Forests are a general class of ensemble building methods that use a decision tree as the base classifier. The Random Forest classifier is a variation of the Bagging classifier (Bootstrap Aggregating). The Bagging algorithm is a method of classification that generates weak individual classifiers using bootstrap. Each classifier is trained on a random redistribution of the training set so that many of the original examples may be repeated in each classification.

The principal difference between Bagging and Random Forest is that Bagging uses all the features in each tree node and Random Forest selects a random subset of the features. The suitable number of randomized features corresponds to the square root of the total number of features. For prediction, a new sample is pushed down the tree and it is assigned the class of the terminal (or leaf) node in the tree. This method is iterated over all the trees, and finally, the average vote of all the tree predictions is considered as the prediction result. The following diagram shows the Random Forest algorithm:

The Random Forest classifier

The RF classifier

Random Forests are currently one of the best classifiers available, both in recognition power and efficiency. In our example RFClassifier, we use the OpenCV Random Forest classifier and also the OpenCV CvMLData class. A large amount of information is typically handled in machine learning problems, and for this reason, it is convenient to use a .cvs file. The CvMLData class is used to load the training set information from such a file as follows:

//… (omitted for simplicity)

int main(int argc, char *argv[]){

    CvMLData mlData;
    mlData.read_csv("iris.csv");
    mlData.set_response_idx(4);
    //Select 75% samples as training set and 25% as test set
    CvTrainTestSplit cvtts(0.75f, true);
    //Split the iris dataset
    mlData.set_train_test_split(&cvtts);

    //Get training set
    Mat trainsindex= mlData.get_train_sample_idx();
    cout<<"Number of samples in the training       set:"<<trainsindex.cols<<endl;
    //Get test set
    Mat testindex=mlData.get_test_sample_idx();
    cout<<"Number of samples in the test set:"<<testindex.cols<<endl;
    cout<<endl;

    //Random Forest parameters
    CvRTParams params = CvRTParams(3, 1, 0, false, 2, 0, false, 0, 100, 0, CV_TERMCRIT_ITER | CV_TERMCRIT_EPS);

    CvRTrees classifierRF;
    //Taining phase
    classifierRF.train(&mlData,params);
    std::vector<float> train_responses, test_responses;

    //Calculate train error
    cout<<"Error on train samples:"<<endl;
    cout<<(float)classifierRF.calc_error( &mlData, CV_TRAIN_ERROR,&train_responses)<<endl;

    //Print train responses
    cout<<"Train responses:"<<endl;
    for(int i=0;i<(int)train_responses.size();i++)
        cout<<i+1<<":"<<(float)train_responses.at(i)<<"  ";
    cout<<endl<<endl;

    //Calculate test error
    cout<<"Error on test samples:"<<endl;
    cout<<(float)classifierRF.calc_error( &mlData, CV_TEST_ERROR,&test_responses)<<endl;

    //Print test responses
    cout<<"Test responses:"<<endl;
    for(int i=0;i<(int)test_responses.size();i++)
        cout<<i+1<<":"<<(float)test_responses.at(i)<<"  ";
    cout<<endl<<endl;

    return 0;
}

Tip

The dataset has been provided by the UC Irvine Machine Learning Repository, available at http://archive.ics.uci.edu/ml/. For this code sample, the Iris dataset was used.

As we mentioned previously, the CvMLData class allows you to load the dataset from a .csv file using the read_csv function and indicates the class column by the set_response_idx function. In this case, we use this dataset to perform the training and test phases. It is possible to split the dataset into two disjoint sets for training and test. For this, we use the CvTrainTestSplit struct and the void CvMLData::set_train_test_split(const CvTrainTestSplit* spl) function. In the CvTrainTestSplit struct, we indicate the percentage of samples to be used as the training set (0.75 percent in our case) and whether we want to mix the indices of the training and test samples from the dataset. The set_train_test_split function performs the split. Then, we can store each set in Mat with the get_train_sample_idx() and get_test_sample_idx()functions.

The Random Forest classifier is created using the CvRTrees class, and its parameters are defined by the CvRTParams::CvRTParams(int max_depth, int min_sample_count, float regression_accuracy, bool use_surrogates, int max_categories, const float* priors, bool calc_var_importance, int nactive_vars, int max_num_of_trees_in_the_forest, float forest_accuracy, int termcrit_type) constructor. Some of the most important input parameters refer to the maximum depth of the trees (max_depth)—in our sample, it has a value of 3—the number of randomized features in each node (nactive_vars), and the maximum number of trees in the forest (max_num_of_trees_in_the_forest). If we set the nactive_vars parameter to 0, the number of randomized features will be the square root of the total number of features.

Finally, once the classifier is trained with the train function, we can obtain the percentage of misclassified samples using the float CvRTrees::calc_error(CvMLData* data, int type, std::vector<float>* resp=0 ) method. The parameter type allows you to select the source of the error: CV_TRAIN_ERROR (an error in the training samples) or CV_TEST_ERROR (an error in the test samples).

The following screenshot shows the training and test errors and the classifier responses in both the sets:

The Random Forest classifier

The RF classifier sample results

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset