Implementing a gradient boosting machine for disease risk prediction using scikit-learn

Gradient boosting is a machine learning technique that works on the principle of boosting, where weak learners iteratively shift their focus toward error observations that were difficult to predict in previous iterations and create an ensemble of weak learners, typically decision trees.

Gradient boosting trains models in a sequential manner, and involves the following steps:

  1. Fitting a model to the data
  2. Fitting a model to the residuals
  3. Creating a new model

While the AdaBoost model identifies errors by using weights that have been assigned to the data points, gradient boosting does the same by calculating the gradients in the loss function. The loss function is a measure of how a model is able to fit the data on which it is trained and generally depends on the type of problem being solved. If we are talking about regression problems, mean squared error may be used, while in classification problems, the logarithmic loss can be used. The gradient descent procedure is used to minimize loss when adding trees one at a time. Existing trees in the model remain the same.

There are a handful of hyperparameters that may be tuned for this:

    • N_estimators: This represents the number of trees in the model. Usually, the higher it is, the better the model learns the data.
    • max_depth: This signifies how deep our tree is. It is used to control overfitting.
    • min_samples_split: This is the minimum number of samples required to split an internal node. Values that are too high can prevent the model from learning relations.
    • learning_rate: This controls the magnitude of change in the estimates. Lower values with a higher number of trees are generally preferred.
    • loss: This refers to the loss function that is minimized in each split. deviance is used in the algorithm as the default parameter, while the other is exponential.
    • max_features: This represents the number of features we have to consider when looking for the best split.
    • criterion: This function measures the quality of the split and supports friedman_mse and mae to evaluate the performance of the model.
    • subsample: This represents the fraction of samples to be used for fitting the individual base learners. Choosing a subsample that is less than 1.0 leads to a reduction of variance and an increase in bias.
    • min_impurity_split: This is represented as a threshold to stop tree growth early.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset