LightGBM

When your dataset contains a large number of cases or variables, even if XGBoost is compiled from C++, it really takes a long time to train. Therefore, in spite of the success of XGBoost, there was space in January 2017 for another algorithm to appear (XGBoost's first appearance is dated March 2015). It was the high-performance LightGBM, capable of being distributed and fast-handling large amounts of data, and developed by a team at Microsoft as an open source project.

Here is its GitHub page: https://github.com/Microsoft/LightGBM. And, here is the academic paper illustrating the idea behind the algorithm: https://papers.nips.cc/paper/6907-lightgbm-a-highly-efficient-gradient-boosting-decision-tree.

LightGBM is based on decision trees, as well as XGBoost, yet it follows a different strategy. Whereas XGBoost uses decision trees to split on a variable and exploring different cuts at that variable (the level-wise tree growth strategy), LightGBM concentrates on a split and goes on splitting from there in order to achieve a better fitting (this is the leaf-wise tree growth strategy). This allows LightGBM to reach first and fast a good fit of the data, and to generate alternative solutions compared to XGBoost (which is good, if you expect to blend, i.e. average, the two solutions together in order to reduce the variance of the estimated).

Algorithmically talking, figuring out as a graph the structures of cuts operated by a decision tree, XGBoost peruses a breadth-first search (BFS), whereas LightGBM a depth-first search (DFS).

Here are other highlights of the algorithm:

  1. It has more complex trees due to the leaf-wise strategy leading to a higher accuracy in prediction but also to a higher risk of overfitting; therefore, it is particularly ineffective with small datasets (uses datasets with more than 10,000 examples).
  2. It is faster on larger datasets.
  3. It can leverage parallelization and GPU usage; therefore, it can be scaled on even larger problems (actually it is still a GBM, a sequential algorithm; what is parallelized is the Find Best Split part of the decision tree).
  4. It is memory parsimonious because it doesn’t store and handles continuous variables as they are, but it turns them into discrete bins of values (histogram-based algorithms).

Tuning LightGBM may appear daunting with more than a hundred parameters to fix (you can find them all here: https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst), but, actually, you can just tune a few ones and get away with excellent results. Parameters in LightGBM are distinct in terms of the following:

  • Core parameters, specifying the task to be done on data
  • Control parameters, dictating how the decision trees behave
  • Metric parameters, defying your error measures (and there is really a large list to choose from apart from the classical errors for classification and regression)
  • IO parameters, mostly ruling about how inputs are dealt with

Here is a quick overview of the principal parameters for each category.

As for as core parameters, you can operate your key choices by the following:

  • task: The task you want to achieve with your model; it could be train, predict, convert_model (to get it as a series of if-then statements), refit (for updating a model with new data).
  • application: By default, the expected model is a regression, but it could be regression, binary, multiclass and many others (it is also available as lambdarank for ranking tasks such as in search engine optimization).
  • boosting: LightGBM can use different algorithms for its learning iterations. The default is gbdt, the single decision tree, but it could be rf (random forest), darts (Dropouts meet Multiple Additive Regression Trees) or goss (Gradient-based One-Side Sampling).
  • device: It is cpu by default, but you use gpu if you have one available on your system.

IO parameters define how data is loaded (and even stored by your model):

  • max_bin: The maximum number of bins to be used for feature values to be bucketed in (the more, the less approximation when dealing with numeric variables but the more memory and computation time)
  • categorical_feature: The index of categorical features
  • ignore_column: The index of features to be ignored
  • save_binary: If to save the data on disk in binary format to speed up loading and saving memory

Finally, by setting control parameters, you instead decide more specifically how the model has to learn from data:

  • num_boost_round: The number of boosting iterations to be done.
  • learning_rate: The rate each boosting iteration weights on the construction of the resulting model.
  • num_leaves: The maximum number of leaves in a tree, which is 31 by default.
  • max_depth: The maximum depth that a tree can reach.
  • min_data_in_leaf: The minimum number of the examples for a leaf to be created.
  • bagging_fraction: The fraction of data to be randomly used at each iteration.
  • feature_fraction: When your boosting is rf, this parameter dictates the fraction of total features to be randomly considered for a split.
  • early_stopping_round: Fixing this parameter, if your model doesn’t improve for a certain number of rounds, it will stop training. It helps reducing overfitting and training time.
  • lambda_l1 or lambda_l2: Regularization parameters ranging from 0 to 1 (the maximum).
  • min_gain_to_split: This parameter dictates the minimum gain to create a split on the tree. It limits the complexity of the tree by not developing splits are not contributing much to the model.
  • max_cat_group: When dealing with categorical variables with high cardinality (a large number of categories), this parameter puts a limit on the number of categories that a variable can have by aggregating the less important. The default value of this parameter is 64.
  • is_unbalance: For unbalanced datasets in binary classification, is set to True let the algorithm adjust for unbalanced classes.
  • scale_pos_weight: Also for unbalanced datasets in binary classification, it sets a weight for the positive class.

We actually quoted just a small part of all the possible parameters of a LightGBM model, yet the most essential and important ones. Browsing the documentation, you can find many more parameters that can fit even more specific situations and projects of yours.

How do we tune all these parameters? Actually, you can effectively operate on a few ones. If you want to achieve faster computations, just use save_binary and set a small max_bin. You can also use bagging_fraction and feature_fraction with a low number to reduce the size of the training set and speed up the learning process (at the price of increasing the variance of your solution, because it will learn from less data).

If you want to achieve higher accuracy with your error measure, you should instead use a larger max_bin (implying more accuracy when working with numeric variables), use a smaller learning_rate and more num_iterations (necessary because the algorithm will converge in a slower way), and use a larger num_leaves (it may lead to overfitting though).

In the case of overfitting, you can try to set lambda_l1, lambda_l2, and min_gain_to_split and achieve some more regularization. You can also try max_depth to avoid growing too deep trees.

In our example, we take on the same task as before, to classify the Forest Covertype dataset. We start by importing the necessary packages.

Out next steps are then to set the parameters for this boosting algorithm to properly work. We define the objective (‘multiclass’), set a low learning rate (0.01), and allow its branches to spread almost completely like a random forest would do: its trees’ maximum depth is set to 128 and the number of resulting leaves is 256. In doing so, we also set a random sampling of both cases and features (bagging 90% of them every time):

In: import lightgbm as lgb
import numpy as np
params = {'task': 'train',
'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class':len(np.unique(covertype_y)),
'metric': 'multi_logloss',
'learning_rate': 0.01,
'max_depth': 128,
'num_leaves': 256,
'feature_fraction': 0.9,
'bagging_fraction': 0.9,
'bagging_freq': 10}

Then, we set the dataset for train, validation, and test using the Dataset command from the LightGBM package:

In: train_data = lgb.Dataset(data=covertype_X, label=covertype_y)
val_data = lgb.Dataset(data=covertype_val_X, label=covertype_val_y)

Finally, we set the training instance, by feeding the previously set parameters, deciding on a maximum number of iterations of 2,500, setting a validation set, and requiring early stopping if the error measure doesn’t improve on the validation for over 25 iterations (this will allow us to avoid any overfitting due to too many iterations, that is, boosting trees added):

In: bst = lgb.train(params,
train_data,
num_boost_round=2500,
valid_sets=val_data,
verbose_eval=500,
early_stopping_rounds=25)

After a while, the training stops pointing out a log-loss on validation of 0.40 and 851 iterations as the best number to pick. Training until validation scores don't improve for 25 rounds:

Out: Early stopping, best iteration is:[851]       
valid_0's multi_logloss: 0.400478

Instead of using a validation set, we could also test for the best number of iterations by cross-validation, that is, on the same train set:

In: lgb_cv = lgb.cv(params,
train_data,
num_boost_round=2500,
nfold=3,
shuffle=True,
stratified=True,
verbose_eval=500,
early_stopping_rounds=25)
nround = lgb_cv['multi_logloss-mean'].index(np.min(lgb_cv[
'multi_logloss-mean']))
print("Best number of rounds: %i" % nround)

Out: cv_agg's multi_logloss: 0.468806 + 0.0124661
Best number of rounds: 782

The result is not as brilliant as with the validation set, but the number of rounds is not all that far from what we found before. We will use the initial train by early stop, anyway. First, we get the probability for each class using the predict method, and the best iteration, then we will pick as our prediction the class with the highest probability.

After doing so, we will check for accuracy and plot a confusion matrix. The obtained score is analogous to XGBoost but obtained in a shorter training time:

In: y_probs = bst.predict(covertype_test_X, 
num_iteration=bst.best_iteration)
y_preds = np.argmax(y_probs, axis=1)
from sklearn.metrics import accuracy_score, confusion_matrix
print('test accuracy:', accuracy_score(covertype_test_y, y_preds))
print(confusion_matrix(covertype_test_y, y_preds))

Out: test accuracy: 0.8444
[[1495 309 0 0 0 2 14]
[ 221 2196 17 0 5 9 0]
[ 0 20 258 5 0 18 0]
[ 0 0 3 19 0 5 0]
[ 1 51 4 0 21 0 0]
[ 0 14 43 0 0 87 0]
[ 36 1 0 0 0 0 146]]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset