Appendix A. Appendix A. Machine Learning Quick Reference: Best Practices

Topic Common Challenges Suggested Best Practice

Data Preparation

Data collection

  • Biased data
  • Incomplete data
  • The curse of dimensionality
  • Sparsity
  • Take time to understand the business problem and its context
  • Enrich the data
  • Dimension-reduction techniques
  • Change representation of data (e.g., COO)

“Untidy” data

  • Value ranges as columns
  • Multiple variables in the same column
  • Variables in both rows and columns

Restructure the data to be “tidy” by using the melt and cast process

Outliers

  • Out-of-range numeric values and unknown categorical values in score data
  • Undue influence on squared loss functions (e.g. regression, GBM, and k-means)
  • Robust methods (e.g. Huber loss function)
  • Discretization (binning)
  • Winsorizing

Sparse target variables

  • Low primary event occurrence rate
  • Overwhelming preponderance of zero or missing values in target
  • Proportional oversampling
  • Inverse prior probabilities
  • Mixture models

Variables of disparate magnitudes

  • Misleading variable importance
  • Distance measure imbalance
  • Gradient dominance
Standardization

High-cardinality variables

  • Overfitting
  • Unknown categorical values in holdout data
  • Discretization (binning)
  • Weight of evidence
  • Leave-one-out event rate

Missing data

  • Information loss
  • Bias
  • Discretization (binning)
  • Imputation
  • Tree-based modeling techniques

Strong multicollinearity

Unstable parameter estimates

  • Regularization
  • Dimension reduction
Training

Overfitting

High-variance and low-bias models that fail to generalize well

  • Regularization
  • Noise injection
  • Partitioning or cross validation

Hyperparameter tuning

Combinatorial explosion of hyper-parameters in conventional algorithms (e.g., deep neural networks, Super Learners)

  • Local search optimization, including genetic algorithms
  • Grid search, random search

Ensemble models

  • Single models that fail to provide adequate accuracy
  • High-variance and low-bias models that fail to generalize well
  • Established ensemble methods (e.g., bagging, boosting, stacking)
  • Custom or manual combinations of predictions

Model Interpretation

Large number of parameters, rules, or other complexity obscures model interpretation

  • Variable selection by regularization (e.g., L1)
  • Surrogate models
  • Partial dependency plots, variable importance measures

Computational resource exploitation

  • Single-threaded algorithm implementations
  • Heavy reliance on interpreted languages
  • Train many single-threaded models in parallel
  • Hardware acceleration (e.g., SSD, GPU)
  • Low-level, native libraries
  • Distributed computing, when appropriate
Deployment

Model deployment

Trained model logic must be transferred from a development environment to an operational computing system to assist in organizational decision making processes

  • Portable scoring code or scoring executables
  • In-database scoring
  • Web service scoring

Model decay

  • Business problem or market conditions have changed since the model was created
  • New observations fall outside domain of training data
  • Monitor models for decreasing accuracy
  • Update/retrain models regularly
  • Champion-challenger tests
  • Online updates
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset