

  1. Actual vs. predicted plot linear model

  2. Actual vs. predicted plot quadratic polynomial model

  3. Amazon Food Review

  4. American Statistical Association (ASA)

  5. An Exploratory Technique for Investigating Large Quantities of Categorical Data

  6. Apache Pig

  7. Apriori

  8. Area Under the Curve (AUC)

  9. Artificial intelligence (AI)

  10. Artificial neural networks (ANN)

    1. architecture

      1. components

      2. linear seperability

      3. MLP

    2. attribute importance

      1. by Garson method

      2. by Olden method

    3. deep learning

      1. applications

      2. architecture

      3. darch for classification

      4. guidelines

      5. hidden layers

      6. multi-layer

      7. multiple linear and non-linear transformations

      8. mxNet image classification

      9. mxNet package

      10. normalized image

      11. volcano picture, image recognition exercise

    4. evolutionary methods

    5. expectation maximization

    6. feed-forward back-propagation

    7. GEP

    8. hidden layer

    9. human cognitive learning

    10. learning algorithms

    11. machine learning

    12. non-parametric methods

    13. particle swarm optimization

    14. perceptron

    15. purchase prediction

    16. sigmoid neuron

    17. simulated annealing

    18. supervised vs . unsupervised neural nets

  11. Association rule mining (ARM)

    1. algorithms

    2. apriori

    3. confidence

    4. Eclat

    5. IBCF

    6. item frequency plot

    7. lift

    8. Market Basket data

    9. POS

    10. scarcity visualization

    11. support

    12. transactional data

    13. UBCF

  12. Autocorrelation

  13. Auto-correlation function (ACF)

  14. Automatic grid search optimization


  1. Back-propagation learning

  2. Back-propagation method

  3. Back-propagation of errors

  4. Bagging

    1. bootstrap aggregating

    2. CART

    3. random forest

  5. Bayes formula

  6. Bayesian algorithms

  7. Bayesian optimization, machine learning models

    1. black box function

    2. Gaussian processes

    3. parameters

    4. random tuning

    5. RMSE, cost and Sigma space

    6. sample t-test

  8. Bayes rule

  9. Bayes theorem

  10. Bias and variance tradeoff

    1. boosting

    2. bootstrap aggregation

    3. bulls eye plot

    4. components

    5. definition

    6. graphical representation

    7. model performance improvements

    8. plot function

    9. random variable

    10. real model prototype

  11. Bias-variance decomposition

  12. Bivariate plots

    1. actual probability

    2. actual vs. predicted plot

      1. CustomerPropensity

      2. IncomeClass

      3. MembershipPoints

    3. frequency

    4. predicted probability

  13. Boosting

  14. Bootstrap aggregation

  15. Bootstrap sampling

    1. advantages

    2. arguments

    3. coefficient

    4. confidence band

    5. density function

    6. disadvantages

    7. histogram

    8. hypothesis testing

    9. jackknife

    10. jackknife estimate

    11. linear regression model

    12. mean and variance

    13. metric estimation

    14. normal distribution

    15. QQ plot

    16. sampling distribution

    17. t.test()

  16. Boxplots

    1. interquartile range

    2. outliers

    3. population

  17. Breush-Pagan test

  18. Bubble charts

    1. fertility rate vs. life expectancy

    2. GDP per capita vs. life expectancy

  19. Business implications of sampling

    1. deciding factors

    2. features

    3. machine learning

    4. methods and interpretation

    5. shortcomings


  1. C5.0 algorithm

    1. attribute-value description

    2. discrete classes

    3. evaluation

    4. Hunt’s approach

    5. logical classification models

    6. model building

    7. model summary

    8. predefined classes

    9. pruning

    10. purchase prediction dataset

    11. Ross Quinlan’s web page

    12. sufficient data

  2. caretEmseble() function

  3. Caret package

    1. complex regression and classification problems

    2. function/tools

    3. trainControl() function

    4. train() function algorithm

  4. CART

SeeClassification and Regression Tree (CART)
  1. Central Limit Theorem

  2. Centroid-based clustering

  3. Chi-Square Automated Interaction Detection (CHAID)

    1. algorithm

    2. building the model

    3. decision tree

    4. model evaluation

    5. R code

    6. splitting

    7. stopping

  4. Classification and Regression Tree (CART)

    1. building the model

    2. cp (complexity parameter)

    3. Gini-Index

    4. model evaluation

    5. pseudo code

    6. regression tree-based approach

    7. rpart function

  5. Classification matrix

  6. Classification tree

  7. Class imbalance

  8. Cluster dendogram

  9. Cluster sampling

    1. advantages

    2. conditional statement

    3. disadvantages

    4. international transactions

    5. International transactions

    6. k-means function

    7. outstanding balance

    8. population data

    9. single-stage sampling

    10. startum variable

    11. stratified() function

    12. subsets

    13. two-stage sampling

    14. t.test()

    15. two-stage

  10. Clustering algorithms

  11. Clustering analysis

    1. algorithms

    2. applications

    3. centroid-based clustering

    4. centroid models

    5. connectivity models

    6. definition

    7. density-based clustering

    8. density models

    9. distribution-based clustering

    10. distribution models

    11. Dunn index

    12. external evaluation

    13. hierarchal

    14. internal evaluation

    15. Jaccard index

    16. k-means

    17. machine learning

    18. principle

    19. rand measure

    20. silhouette coefficient

    21. types

    22. unsupervised learning algorithm

  12. Cohort diagrams

    1. active credit cards volume

    2. credit example

    3. definition

  13. Collaborative filtering-based approach

  14. Comma-separated values (CSV)

  15. Computational savings

    1. linear regression model

    2. population dataset

    3. sys.time()

  16. Conditional independence

  17. Confidence interval

  18. Continuous variables

  19. Convenience sampling

  20. Cook’s distance

  21. Correlation, definition

  22. Correlation analysis

    1. features

    2. observations

    3. Pearson correlation

    4. population correlation coefficient

    5. scatter plot, HousePrice vs. StoreArea

    6. statistical relationship

  23. Correlation plots

    1. description

    2. positive or negative correlation

    3. world development indicators

  24. Credit card fraud

    1. data description

    2. data exploration

    3. data import

    4. data transformation

    5. pooled mean and variance

    6. population mean

    7. population variance

    8. sampling plan

    9. statistical measures

  25. Credit risk modeling

  26. Custom search algorithms


  1. Data formats

  2. Data frames

  3. Data mining

  4. Data preparation and exploration

    1. categorical variables

    2. data and visualization

    3. date variable

    4. derived variables

    5. markup language

    6. model building

    7. n-day averages

    8. reshaping

    9. semi-Structured

    10. structured

    11. unstructured

    12. variables types

  5. Data science

  6. Dataset

    1. house sale prices prediction

    2. purchase preference prediction

  7. Data visualization, R

  8. Data visualization, R

    1. benefits

    2. boxplots

    3. bubble charts

    4. cohort diagrams

    5. correlation plots

    6. definition

    7. dendograms

    8. elements, data presentation

    9. ggplot2 package

    10. heatmaps

    11. histograms and density plots

    12. line chart

    13. pie charts

    14. Sankey plots

    15. scatterplot

    16. spatial maps

    17. stacked column charts

    18. time series graphs

    19. waterfall chart

    20. wordclouds

    21. world development indicators

  9. Dates and times

  10. Daylight saving time (DST)

  11. Decision trees

    1. algorithms

    2. bagging

    3. boosting

    4. classification

    5. decision nodes

    6. ensemble models

    7. ID3

    8. leaf nodes

    9. learning methods

    10. measures

      1. entropy

      2. Gini Index

      3. information gain

    11. non-parametric model

    12. regression

  12. Deep learning algorithms

  13. Dendograms

    1. clusters, species classification

    2. definition

    3. distance/height

    4. ggdendro() and dendextend()

    5. x-axis

    6. y-axis

  14. Density-based clustering

    1. border points

    2. core points

    3. DBSCAN

    4. EM algorithm

    5. outliers

    6. parameters

  15. Density-based spatial clustering of applications with noise (DBSCAN)

  16. Density plot

  17. Dimensionality reduction

    1. algorithms

    2. description

    3. orthogonality, principal components

    4. PCA

    5. principal component analysis

  18. Directed Acyclic Graph (DAG)

  19. Distance-based/event-based algorithms

  20. Distributed processing and storage

    1. GFS

    2. MapReduce

    3. parallel execution in R

      1. cores setting

      2. problem statement

      3. random forest model

      4. stopping clusters

  21. Distribution-based clustering

  22. Distribution of studentized residuals

  23. dplyr

  24. Dunn Index

  25. Durbin Watson statistics bounds

  26. Durbin Watson test


  1. Eclat

  2. EM algorithm

  3. Empirical Distribution Function (EDF)

  4. Ensemble learning

    1. methods

      1. bagging

      2. boosting

    2. model performance improvement

    3. supervised learning algorithm

    4. voting ensembles

  5. Ensemble models

  6. Ensemble techniques illustration, R

    1. algorithms, purchase prediction data

    2. bagging trees

    3. blending KNN and Rpart

    4. C5.0 decision tree model

    5. Caret package

    6. caretStack() function

    7. GBM model

    8. resamples() function

    9. stacking, caretEnsemble

  7. Entropy

  8. Exploratory Data Analysis (EDA)

  9. Exposure at Default (EAD)

  10. Extensible Markup languages (XML)


  1. Factor variables

  2. False positive rate (FPR)

  3. Feature engineering

    1. checklist

    2. dimensionality reduction

SeeDimensionality reduction
  1. embedded methods

  2. feature ranking

  3. filter methods

  4. selection problem checklist

  5. variable subset selection

SeeVariable subset selection
  1. working data

    1. continuous/categorical features

    2. EAD

    3. LGD

    4. PD

    5. willingness to pay and ability to pay

  2. wrapper methods

  1. Feature ranking

  2. Feedforward Neural Networks (FFNN)

  3. Fine needle aspirate (FNA)

  4. Fuzzy C-means clustering


  1. Gains charts, AUC

  2. Gauss-Markov theorem

  3. Gene expression programming (GEP)

  4. Generalized Linear Model (GLM)

  5. GFS

SeeGoogle File System (GFS)
  1. ggplot2 Package

    1. description

    2. R documentation

  2. Gini-Index

  3. Google file system (GFS)

  4. Gradient Boosting Machine (GBM)


  1. H2O, machine learning in R

    1. clusters initialization

    2. deep learning demo

    3. documented materials

    4. java virtual machine

    5. package installation

    6. running demo

    7. testing data

  2. Hadoop ecosystem

    1. Apache Pig

      1. command pig-x local connects

      2. count and sort

      3. flattening tokens

      4. group words

      5. load data into A1

      6. tokenize each line

    2. components and tools

    3. hadoop distributed file system

    4. Hadoop YARN

    5. HBase

      1. create and put data

      2. data scanning

      3. starting HBase

    6. Hive

      1. Apache

      2. creating tables

      3. data loading, Hive table

      4. describing tables

      5. generating data and storing

      6. HDFS

      7. large-scale data processing

      8. query selection

      9. SQL queries

    7. MapReduce

      1. code snippet

      2. libraries rmr2 and rhdfs

      3. procedures

      4. shuffle

      5. Word Count

      6. wordcount function

    8. spark

  3. Heat maps

    1. description

    2. regions vs. world development indicators

  4. Hierarchal clustering

  5. Hinge loss

  6. Histogram

    1. construction

    2. description

    3. GDP and population

  7. Homoscedasticity

  8. House sale price dataset

  9. Human cognitive learning

  10. Hyper-parameters

    1. Bayesian approach

    2. decision points

    3. “higher-level” properties

    4. optimization

      1. automatic grid search

      2. custom search algorithms

      3. manual grid search

      4. manual search

      5. optimal search

      6. random search

    5. properties

    6. random forest algorithm

    7. random forest models

  11. Hypertext Markup Language (HTML)

  12. Hypothesis testing


  1. Independent events

  2. Influence plot

  3. Infographics

  4. Information gain

  5. Initial data analysis (IDA)

    1. description

    2. dplyr

    3. multiple sources

    4. naming convention

    5. str() function

    6. table(): pattern

  6. Item-Based Collaborative Filtering (IBCF)

    1. cosine/Pearson correlation

    2. creation rating matrix

    3. data preparation

    4. distribution of ratings

    5. evaluation

    6. exploring, rating matrix

    7. loading data

    8. raw ratings by users

    9. true positive ratio vs. false positive ratio

    10. UBCF recommendation model

  7. Iteration error

  8. Iterative Dichotomizer 3 (ID3)

    1. algorithm

    2. commands

    3. model building

    4. model evaluation

    5. RWeka

    6. RWekajars


  1. Jaccard index

  2. JSON file


  1. Kappa error metric

  2. K-fold cross validation

  3. K-Means Clustering Algorithm

  4. Knowledge Discovery and Data Mining (KDD)

  5. Kolmogorov-Smirnov tests (KS test)

  6. Kurtosis


  1. Law of Large Numbers (LLN)

    1. strong law

    2. weak law

  2. Learning Vector Quantization (LVQ)

  3. Least Absolute Shrinkage and Selection Operator (LASSO)

  4. LGD

SeeLoss Given Default (LGD)
  1. Lift chart

  2. Linear predictors

    1. bias of estimator

    2. consistent estimator

    3. efficient estimator

    4. OLS

  3. Linear regression

    1. actual vs. predicted

    2. affine function

    3. definition

    4. dependent and independent variable

    5. diagnostics

    6. estimated equation

    7. estimation

    8. Gauss-Markov theorem

    9. lm() package

    10. minimization problem

    11. model diagnostics

      1. homoscedasticity

      2. influential point analysis

      3. multicollinearity

      4. normality of residuals

      5. outliers

      6. residual autocorrelation

    12. OLS

    13. parametric method

    14. predicted values

    15. residuals

    16. standard error

    17. t-value and p-value

  4. Line chart

    1. description

    2. GDP growth, countries

    3. melt() function

  5. Link function

  6. List

  7. Logistic regression

    1. analysis

    2. binomial

    3. binomially distributed

    4. logit transformation

    5. model diagnostics

      1. bivariate plots

      2. concordance and discordant ratios

      3. cumulative gains and lift charts

      4. deviance

      5. log likelihoods

      6. pseudo R-Square

      7. wald test

    6. multinomial

    7. odds ratio

    8. ordered

    9. predictor variables

  8. Logit function

  9. Logit transformation

  10. Loss Given Default (LGD)

  11. LOWESS plot (Locally Weighted Scatterplot Smoothing)


  1. Machine learning (ML)

    1. abstraction layer

    2. algorithms

      1. ANN

      2. association rule mining

      3. Bayesian algorithms

      4. clustering algorithms

      5. deep learning

      6. dimensionality reduction

      7. distance-based/event-based algorithms

      8. ensemble learning

      9. regression-based methods

      10. regularization methods

      11. text mining

      12. tree-based algorithms

    3. case study

    4. computer vision

    5. 3D approach

      1. demo in R

      2. real-world use case

      3. statistical background

    6. distributions

    7. evaluation

    8. exploration

    9. feature engineering

SeeFeature engineering
  1. friction-less pipeline

  2. intelligent personal assistant/machines

  3. PEBE framework

  4. phase forms

  5. plethora of algorithms

  6. predictive models

  7. process flow

  8. probability

    1. conditional independence

    2. counting

    3. independent events

    4. notation

    5. statistics

  9. randomness

  10. R-package

  11. statistical concepts

  12. statistical learning

  13. statistical modeling

  14. statistics and computer science

  15. types

    1. factors

    2. reinforcement learning

    3. semi-supervised learning

    4. supervised learning

    5. unsupervised learning

  1. Manual grid search optimization

  2. MapReduce

  3. Market Basket Data

  4. Matrix

  5. Maximum likelihood estimation (MLE)

  6. Mean

  7. Mean absolute error

  8. Mean Absolute Percentage Error (MAPE)

  9. Mean Absolute Scaled Error (MASE)

  10. Microsoft Excel

  11. Model building checklist

  12. Model evaluation

    1. continuous output

      1. mean absolute error

      2. model performance metrics

      3. RMSE

      4. R-square

    2. discrete output

      1. classification matrix

      2. ROC curve

      3. sensitivity and specificity

    3. kappa error metric

    4. population stability index

SeePopulation stability index
  1. probabilistic techniques

SeeProbabilistic techniques
  1. statistical methods

  1. Model performance

    1. Bayesian optimization

    2. bias and variance tradeoff

SeeBias and variance tradeoff
  1. Caret package

  2. continuous output

  3. discrete output

  4. ensemble learning

SeeEnsemble learning
  1. evaluation

  2. hyper-parameters

  1. machine learning and statistical modeling

  2. testing data

  3. training data

  4. validation data

  1. Model performance

SeeModel evaluation
  1. Model sampling

  2. Model-selection process

  3. Model suffering

    1. from bias

    2. from variance

  4. Moment

  5. Monte Carlo method

    1. acceptance-rejection methods

    2. beta density

    3. EDF

    4. random sampling techniques

    5. stochastic calculus

  6. Multicollinearity

  7. Multi-Layer Perceptron (MLP)

  8. Multinomial logistic regression

    1. classifier

    2. class imbalance

    3. estimation process

    4. multinom() function

    5. probability/proportion


  1. Naive Bayes method

    1. Bayes theorem

    2. chain rule

    3. conditional probability

    4. data preparation

    5. likelihood and marginal likelihood

    6. model

    7. model evaluation

    8. posterior probability

    9. prior probability

    10. purchase prediction dataset

  2. National Sample Survey Organization (NNSO)

  3. Natural Language Processing (NLP)

  4. Neuron anatomy

  5. Nonparametric Multiplicative Regression (NPMR)

  6. Non-probability sampling

  7. Not Available (NAs)


  1. Online machine learning algorithms

    1. benefits and challenges

    2. fuzzy C-means clustering

    3. tackling

  2. Optimal search optimization

  3. Ordinary Least Square (OLS)


  1. Particle swarm optimization

  2. Part-of-speech (POS)

    1. categorization

    2. extraction

    3. frequency

    4. mapping

    5. pre-processing

  3. Pearson Product-Moment Correlation Coefficient

  4. Perceptron

  5. Performance evaluation metrics

  6. Permutation

  7. Pie charts

  8. Point-of-sale (POS)

  9. Polynomial regression

  10. Pooled mean

  11. Pooled variance

  12. Population stability index

    1. continuous distribution

    2. discrete cases

    3. discrete distributions

    4. ECDF plots, Set_1 and Set_2

    5. Empirical Cumulative Distribution Function (ECDF)

    6. KS test

    7. threshold values

  13. Principal component analysis (PCA)

    1. advantages

    2. orthogonality

    3. steps

  14. Probabilistic techniques

    1. bootstrap sampling

    2. K-fold cross validation

  15. Probability

    1. vs. non-probability sampling

    2. sampling technique

      1. data dimensions

      2. histogram

      3. population mean

      4. population variance

      5. sampling methods

  16. Probability of default (PD)

  17. Pseudo R-Square

  18. Purposive sampling


  1. Quantile

  2. Quota sampling


  1. R

    1. building blocks

    2. calculations

    3. data frames

    4. data structures

    5. functions

    6. GNU S

    7. lists

    8. matrixes

    9. packages

    10. statistics

    11. subsetting

    12. vectors

  2. Radial basis function (RBF)

  3. Rand index

  4. Random Forest

  5. Random search algorithms

  6. Random search optimization

  7. rbinom()

  8. R code

  9. Receiver operating characteristic (ROC) curve

  10. Recommendation algorithm

  11. Recursive binary split

  12. Recursive partitioning

  13. Regression analysis

    1. causation

    2. distributional assumptions

    3. linear model

    4. non-parametric methods

    5. notation

    6. parametric methods

    7. prediction/forecasting

    8. statistical learning and machine learning space

    9. statistical model

    10. variables

  14. Regression-based methods

  15. Regression trees

  16. Regularization algorithms

  17. Reinforcement learning

  18. Relational Database Management Systems (RDBMS)

  19. Residual Sum of Squares (RSS)

  20. Residuals vs. fitted plot

  21. River plots

SeeSankey plots
  1. RMSE

SeeRoot mean square error (RMSE)
  1. ROC curve

SeeReceiver operating characteristic (ROC) curve
  1. Root mean square error (RMSE)

  2. Root node


  1. Sample point

  2. Sampling

    1. bias

    2. classification

    3. description

    4. distribution

    5. error

    6. fraction

    7. objectives

    8. population mean

    9. population statistics

    10. sources and storing

    11. technological advancement

    12. test statistics

    13. variance

  3. Sampling without replacement (SWOR)

  4. Sampling with replacement (SWR)

  5. Sankey plots

  6. Scatterplots

    1. description

    2. higher dimensional

    3. population vs. GDP relationship

  7. Semi-supervised learning

  8. Serial correlation

  9. Shapiro-Wilk test

  10. Sigmoid function

  11. Sigmoid neurons

  12. Silhouette coefficient

  13. Simple random sampling

    1. distribution of data

    2. function

    3. histograms

    4. hypothesis

    5. KS test

    6. population

    7. population average

    8. population sampling

    9. population size

    10. p-value of t.test

    11. replacement

    12. sample and population

    13. sample() function

    14. summarise function

    15. without replacement

  14. Simulated annealing

  15. Simulation

  16. Skewness

  17. Spark’s machine learning

    1. algorithms

    2. build, ML model

    3. MLlib

    4. preprocessing

    5. SparkDataFrame creation

    6. SparkR session, initializing

    7. sparkR.stop()

    8. system properties, setting

    9. test dataset

    10. tools

  18. Spatial maps

    1. data frame creation

    2. ggmap()

    3. ggplot() function

    4. India map, robbery counts

  19. Specialization vs. generalization

  20. Squared Euclidean distance

  21. Stacked column charts

    1. age dependency ratio

    2. contribution, sectors

    3. description

    4. working age ratio

  22. Stacking

  23. Statistical learning

  24. Stratified random sampling

    1. disadvantages

    2. histograms

    3. KS test

    4. population

    5. proportion

    6. sample() function

    7. stratified function

    8. stratified sampling

    9. stratum variables

    10. sub-populations

    11. summarise() function

    12. t.test()

  25. Summary statistics

  26. Supervised learning

  27. Supervised vs. unsupervised learning

  28. Support vector machine (SVM)

    1. binary classifier

      1. data preparation

      2. data summary

      3. model building

      4. model evaluation

    2. classification

    3. class separation

    4. hard margins

    5. linear

    6. multi-class

    7. nonlinearity

    8. overlapping classes

    9. soft margins

  29. Systematic random sampling

    1. business and computational capacity

    2. circular sampling frame

    3. EDF

    4. formula

    5. homogeneous sets

    6. KS test

    7. population variance

    8. sample distribution

    9. sample frame

    10. skip factor

    11. subsetting


  1. Term Frequency/Inverse Term frequency (TF_IDF)

  2. Text mining algorithms

  3. Text-mining approaches

    1. consumer behavior/product performance

    2. data preparation

    3. data summary

    4. Microsoft Cognitive Services

      1. analytics features

      2. language detection

      3. mscstexta4r

      4. Project Oxford

      5. sentiment analysis

      6. summarization

      7. third-party API

      8. topic detection

      9. twitterR() package

    5. NLP

    6. POS tagging

    7. summarization

    8. text analysis

    9. text data

    10. TF-IDF

    11. Twitter statics

    12. word cloud

  4. Time series graphs

    1. GDP growth, countries

    2. GDP growth, recession

  5. Torsten Hothorn

  6. True Negative Rate (TNR)

  7. True positive rate (TPR)

  8. Twitter feeds and article


  1. UCI Machine Learning Repository

  2. Unsupervised Fuzzy Competitive Learning

  3. Unsupervised learning

  4. User-Based Collaborative Filtering (UBCF)


  1. Variable subset selection

    1. definition

    2. embedded method

      1. fit model

      2. fitted Cross Validated Linear Model

      3. glmnet fit model

      4. logistic regression

      5. misclassification error and log of penalization factor (lambda)

      6. regularization

      7. statistical approaches

    3. filter method

      1. CoV

      2. Gini coefficient

      3. statistical approaches

      4. variance threshold

    4. wrapper method

  2. Variance

  3. Variance inflation factor (VIF)

  4. Vectors


  1. Wald test

  2. Waterfall charts

  3. Within cluster sum of squares (WCSS)

  4. Wordclouds

  5. World development indicators (WDI)

X, Y, Z

  1. XML

SeeExtensible Markup languages (XML)
