Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

9. Scalable Machine Learning and Related Technologies

Index

A

Actual vs. predicted plot linear model
Actual vs. predicted plot quadratic polynomial model
Amazon Food Review
American Statistical Association (ASA)
An Exploratory Technique for Investigating Large Quantities of Categorical Data
Apache Pig
Apriori
Area Under the Curve (AUC)
Artificial intelligence (AI)
Artificial neural networks (ANN)
1. architecture
  1. components
  2. linear seperability
  3. MLP
2. attribute importance
  1. by Garson method
  2. by Olden method
3. deep learning
  1. applications
  2. architecture
  3. darch for classification
  4. guidelines
  5. hidden layers
  6. multi-layer
  7. multiple linear and non-linear transformations
  8. mxNet image classification
  9. mxNet package
  10. normalized image
  11. volcano picture, image recognition exercise
4. evolutionary methods
5. expectation maximization
6. feed-forward back-propagation
7. GEP
8. hidden layer
9. human cognitive learning
10. learning algorithms
11. machine learning
12. non-parametric methods
13. particle swarm optimization
14. perceptron
15. purchase prediction
16. sigmoid neuron
17. simulated annealing
18. supervised vs . unsupervised neural nets
Association rule mining (ARM)
1. algorithms
2. apriori
3. confidence
4. Eclat
5. IBCF
6. item frequency plot
7. lift
8. Market Basket data
9. POS
10. scarcity visualization
11. support
12. transactional data
13. UBCF
Autocorrelation
Auto-correlation function (ACF)
Automatic grid search optimization

B

Back-propagation learning
Back-propagation method
Back-propagation of errors
Bagging
1. bootstrap aggregating
2. CART
3. random forest
Bayes formula
Bayesian algorithms
Bayesian optimization, machine learning models
1. black box function
2. Gaussian processes
3. parameters
4. random tuning
5. RMSE, cost and Sigma space
6. sample t-test
Bayes rule
Bayes theorem
Bias and variance tradeoff
1. boosting
2. bootstrap aggregation
3. bulls eye plot
4. components
5. definition
6. graphical representation
7. model performance improvements
8. plot function
9. random variable
10. real model prototype
Bias-variance decomposition
Bivariate plots
1. actual probability
2. actual vs. predicted plot
  1. CustomerPropensity
  2. IncomeClass
  3. MembershipPoints
3. frequency
4. predicted probability
Boosting
Bootstrap aggregation
Bootstrap sampling
1. advantages
2. arguments
3. coefficient
4. confidence band
5. density function
6. disadvantages
7. histogram
8. hypothesis testing
9. jackknife
10. jackknife estimate
11. linear regression model
12. mean and variance
13. metric estimation
14. normal distribution
15. QQ plot
16. sampling distribution
17. t.test()
Boxplots
1. interquartile range
2. outliers
3. population
Breush-Pagan test
Bubble charts
1. fertility rate vs. life expectancy
2. GDP per capita vs. life expectancy
Business implications of sampling
1. deciding factors
2. features
3. machine learning
4. methods and interpretation
5. shortcomings

C

C5.0 algorithm
1. attribute-value description
2. discrete classes
3. evaluation
4. Hunt’s approach
5. logical classification models
6. model building
7. model summary
8. predefined classes
9. pruning
10. purchase prediction dataset
11. Ross Quinlan’s web page
12. sufficient data
caretEmseble() function
Caret package
1. complex regression and classification problems
2. function/tools
3. trainControl() function
4. train() function algorithm
CART

SeeClassification and Regression Tree (CART)

Central Limit Theorem
Centroid-based clustering
Chi-Square Automated Interaction Detection (CHAID)
1. algorithm
2. building the model
3. decision tree
4. model evaluation
5. R code
6. splitting
7. stopping
Classification and Regression Tree (CART)
1. building the model
2. cp (complexity parameter)
3. Gini-Index
4. model evaluation
5. pseudo code
6. regression tree-based approach
7. rpart function
Classification matrix
Classification tree
Class imbalance
Cluster dendogram
Cluster sampling
1. advantages
2. conditional statement
3. disadvantages
4. international transactions
5. International transactions
6. k-means function
7. outstanding balance
8. population data
9. single-stage sampling
10. startum variable
11. stratified() function
12. subsets
13. two-stage sampling
14. t.test()
15. two-stage
Clustering algorithms
Clustering analysis
1. algorithms
2. applications
3. centroid-based clustering
4. centroid models
5. connectivity models
6. definition
7. density-based clustering
8. density models
9. distribution-based clustering
10. distribution models
11. Dunn index
12. external evaluation
13. hierarchal
14. internal evaluation
15. Jaccard index
16. k-means
17. machine learning
18. principle
19. rand measure
20. silhouette coefficient
21. types
22. unsupervised learning algorithm
Cohort diagrams
1. active credit cards volume
2. credit example
3. definition
Collaborative filtering-based approach
Comma-separated values (CSV)
Computational savings
1. linear regression model
2. population dataset
3. sys.time()
Conditional independence
Confidence interval
Continuous variables
Convenience sampling
Cook’s distance
Correlation, definition
Correlation analysis
1. features
2. observations
3. Pearson correlation
4. population correlation coefficient
5. scatter plot, HousePrice vs. StoreArea
6. statistical relationship
Correlation plots
1. description
2. positive or negative correlation
3. world development indicators
Credit card fraud
1. data description
2. data exploration
3. data import
4. data transformation
5. pooled mean and variance
6. population mean
7. population variance
8. sampling plan
9. statistical measures
Credit risk modeling
Custom search algorithms

D

Data formats
Data frames
Data mining
Data preparation and exploration
1. categorical variables
2. data and visualization
3. date variable
4. derived variables
5. markup language
6. model building
7. n-day averages
8. reshaping
9. semi-Structured
10. structured
11. unstructured
12. variables types
Data science
Dataset
1. house sale prices prediction
2. purchase preference prediction
Data visualization, R
Data visualization, R
1. benefits
2. boxplots
3. bubble charts
4. cohort diagrams
5. correlation plots
6. definition
7. dendograms
8. elements, data presentation
9. ggplot2 package
10. heatmaps
11. histograms and density plots
12. line chart
13. pie charts
14. Sankey plots
15. scatterplot
16. spatial maps
17. stacked column charts
18. time series graphs
19. waterfall chart
20. wordclouds
21. world development indicators
Dates and times
Daylight saving time (DST)
Decision trees
1. algorithms
2. bagging
3. boosting
4. classification
5. decision nodes
6. ensemble models
7. ID3
8. leaf nodes
9. learning methods
10. measures
  1. entropy
  2. Gini Index
  3. information gain
11. non-parametric model
12. regression
Deep learning algorithms
Dendograms
1. clusters, species classification
2. definition
3. distance/height
4. ggdendro() and dendextend()
5. x-axis
6. y-axis
Density-based clustering
1. border points
2. core points
3. DBSCAN
4. EM algorithm
5. outliers
6. parameters
Density-based spatial clustering of applications with noise (DBSCAN)
Density plot
Dimensionality reduction
1. algorithms
2. description
3. orthogonality, principal components
4. PCA
5. principal component analysis
Directed Acyclic Graph (DAG)
Distance-based/event-based algorithms
Distributed processing and storage
1. GFS
2. MapReduce
3. parallel execution in R
  1. cores setting
  2. problem statement
  3. random forest model
  4. stopping clusters
Distribution-based clustering
Distribution of studentized residuals
dplyr
Dunn Index
Durbin Watson statistics bounds
Durbin Watson test

E

Eclat
EM algorithm
Empirical Distribution Function (EDF)
Ensemble learning
1. methods
  1. bagging
  2. boosting
2. model performance improvement
3. supervised learning algorithm
4. voting ensembles
Ensemble models
Ensemble techniques illustration, R
1. algorithms, purchase prediction data
2. bagging trees
3. blending KNN and Rpart
4. C5.0 decision tree model
5. Caret package
6. caretStack() function
7. GBM model
8. resamples() function
9. stacking, caretEnsemble
Entropy
Exploratory Data Analysis (EDA)
Exposure at Default (EAD)
Extensible Markup languages (XML)

F

Factor variables
False positive rate (FPR)
Feature engineering
1. checklist
2. dimensionality reduction

SeeDimensionality reduction

embedded methods
feature ranking
filter methods
selection problem checklist
variable subset selection

SeeVariable subset selection

working data
1. continuous/categorical features
2. EAD
3. LGD
4. PD
5. willingness to pay and ability to pay
wrapper methods

Feature ranking
Feedforward Neural Networks (FFNN)
Fine needle aspirate (FNA)
Fuzzy C-means clustering

G

Gains charts, AUC
Gauss-Markov theorem
Gene expression programming (GEP)
Generalized Linear Model (GLM)
GFS

SeeGoogle File System (GFS)

ggplot2 Package
1. description
2. R documentation
Gini-Index
Google file system (GFS)
Gradient Boosting Machine (GBM)

H

H2O, machine learning in R
1. clusters initialization
2. deep learning demo
3. documented materials
4. java virtual machine
5. package installation
6. running demo
7. testing data
Hadoop ecosystem
1. Apache Pig
  1. command pig-x local connects
  2. count and sort
  3. flattening tokens
  4. group words
  5. load data into A1
  6. tokenize each line
2. components and tools
3. hadoop distributed file system
4. Hadoop YARN
5. HBase
  1. create and put data
  2. data scanning
  3. starting HBase
6. Hive
  1. Apache
  2. creating tables
  3. data loading, Hive table
  4. describing tables
  5. generating data and storing
  6. HDFS
  7. large-scale data processing
  8. query selection
  9. SQL queries
7. MapReduce
  1. code snippet
  2. libraries rmr2 and rhdfs
  3. procedures
  4. shuffle
  5. Word Count
  6. wordcount function
8. spark
Heat maps
1. description
2. regions vs. world development indicators
Hierarchal clustering
Hinge loss
Histogram
1. construction
2. description
3. GDP and population
Homoscedasticity
House sale price dataset
Human cognitive learning
Hyper-parameters
1. Bayesian approach
2. decision points
3. “higher-level” properties
4. optimization
  1. automatic grid search
  2. custom search algorithms
  3. manual grid search
  4. manual search
  5. optimal search
  6. random search
5. properties
6. random forest algorithm
7. random forest models
Hypertext Markup Language (HTML)
Hypothesis testing

I

Independent events
Influence plot
Infographics
Information gain
Initial data analysis (IDA)
1. description
2. dplyr
3. multiple sources
4. naming convention
5. str() function
6. table(): pattern
Item-Based Collaborative Filtering (IBCF)
1. cosine/Pearson correlation
2. creation rating matrix
3. data preparation
4. distribution of ratings
5. evaluation
6. exploring, rating matrix
7. loading data
8. raw ratings by users
9. true positive ratio vs. false positive ratio
10. UBCF recommendation model
Iteration error
Iterative Dichotomizer 3 (ID3)
1. algorithm
2. commands
3. model building
4. model evaluation
5. RWeka
6. RWekajars

J

Jaccard index
JSON file

K

Kappa error metric
K-fold cross validation
K-Means Clustering Algorithm
Knowledge Discovery and Data Mining (KDD)
Kolmogorov-Smirnov tests (KS test)
Kurtosis

L

Law of Large Numbers (LLN)
1. strong law
2. weak law
Learning Vector Quantization (LVQ)
Least Absolute Shrinkage and Selection Operator (LASSO)
LGD

SeeLoss Given Default (LGD)

Lift chart
Linear predictors
1. bias of estimator
2. consistent estimator
3. efficient estimator
4. OLS
Linear regression
1. actual vs. predicted
2. affine function
3. definition
4. dependent and independent variable
5. diagnostics
6. estimated equation
7. estimation
8. Gauss-Markov theorem
9. lm() package
10. minimization problem
11. model diagnostics
  1. homoscedasticity
  2. influential point analysis
  3. multicollinearity
  4. normality of residuals
  5. outliers
  6. residual autocorrelation
12. OLS
13. parametric method
14. predicted values
15. residuals
16. standard error
17. t-value and p-value
Line chart
1. description
2. GDP growth, countries
3. melt() function
Link function
List
Logistic regression
1. analysis
2. binomial
3. binomially distributed
4. logit transformation
5. model diagnostics
  1. bivariate plots
  2. concordance and discordant ratios
  3. cumulative gains and lift charts
  4. deviance
  5. log likelihoods
  6. pseudo R-Square
  7. wald test
6. multinomial
7. odds ratio
8. ordered
9. predictor variables
Logit function
Logit transformation
Loss Given Default (LGD)
LOWESS plot (Locally Weighted Scatterplot Smoothing)

M

Machine learning (ML)
1. abstraction layer
2. algorithms
  1. ANN
  2. association rule mining
  3. Bayesian algorithms
  4. clustering algorithms
  5. deep learning
  6. dimensionality reduction
  7. distance-based/event-based algorithms
  8. ensemble learning
  9. regression-based methods
  10. regularization methods
  11. text mining
  12. tree-based algorithms
3. case study
4. computer vision
5. 3D approach
  1. demo in R
  2. real-world use case
  3. statistical background
6. distributions
7. evaluation
8. exploration
9. feature engineering

SeeFeature engineering

friction-less pipeline
intelligent personal assistant/machines
PEBE framework
phase forms
plethora of algorithms
predictive models
process flow
probability
1. conditional independence
2. counting
3. independent events
4. notation
5. statistics
randomness
R-package
statistical concepts
statistical learning
statistical modeling
statistics and computer science
types
1. factors
2. reinforcement learning
3. semi-supervised learning
4. supervised learning
5. unsupervised learning

Manual grid search optimization
MapReduce
Market Basket Data
Matrix
Maximum likelihood estimation (MLE)
Mean
Mean absolute error
Mean Absolute Percentage Error (MAPE)
Mean Absolute Scaled Error (MASE)
Microsoft Excel
Model building checklist
Model evaluation
1. continuous output
  1. mean absolute error
  2. model performance metrics
  3. RMSE
  4. R-square
2. discrete output
  1. classification matrix
  2. ROC curve
  3. sensitivity and specificity
3. kappa error metric
4. population stability index

SeePopulation stability index

probabilistic techniques

SeeProbabilistic techniques

statistical methods

Model performance
1. Bayesian optimization
2. bias and variance tradeoff

SeeBias and variance tradeoff

Caret package
continuous output
discrete output
ensemble learning

SeeEnsemble learning

evaluation
hyper-parameters

SeeHyper-parameters

machine learning and statistical modeling
testing data
training data
validation data

Model performance

SeeModel evaluation

Model sampling
Model-selection process
Model suffering
1. from bias
2. from variance
Moment
Monte Carlo method
1. acceptance-rejection methods
2. beta density
3. EDF
4. random sampling techniques
5. stochastic calculus
Multicollinearity
Multi-Layer Perceptron (MLP)
Multinomial logistic regression
1. classifier
2. class imbalance
3. estimation process
4. multinom() function
5. probability/proportion

N

Naive Bayes method
1. Bayes theorem
2. chain rule
3. conditional probability
4. data preparation
5. likelihood and marginal likelihood
6. model
7. model evaluation
8. posterior probability
9. prior probability
10. purchase prediction dataset
National Sample Survey Organization (NNSO)
Natural Language Processing (NLP)
Neuron anatomy
Nonparametric Multiplicative Regression (NPMR)
Non-probability sampling
Not Available (NAs)

O

Online machine learning algorithms
1. benefits and challenges
2. fuzzy C-means clustering
3. tackling
Optimal search optimization
Ordinary Least Square (OLS)

P

Particle swarm optimization
Part-of-speech (POS)
1. categorization
2. extraction
3. frequency
4. mapping
5. pre-processing
Pearson Product-Moment Correlation Coefficient
Perceptron
Performance evaluation metrics
Permutation
Pie charts
Point-of-sale (POS)
Polynomial regression
Pooled mean
Pooled variance
Population stability index
1. continuous distribution
2. discrete cases
3. discrete distributions
4. ECDF plots, Set_1 and Set_2
5. Empirical Cumulative Distribution Function (ECDF)
6. KS test
7. threshold values
Principal component analysis (PCA)
1. advantages
2. orthogonality
3. steps
Probabilistic techniques
1. bootstrap sampling
2. K-fold cross validation
Probability
1. vs. non-probability sampling
2. sampling technique
  1. data dimensions
  2. histogram
  3. population mean
  4. population variance
  5. sampling methods
Probability of default (PD)
Pseudo R-Square
Purposive sampling

Q

Quantile
Quota sampling

R

R
1. building blocks
2. calculations
3. data frames
4. data structures
5. functions
6. GNU S
7. lists
8. matrixes
9. packages
10. statistics
11. subsetting
12. vectors
Radial basis function (RBF)
Rand index
Random Forest
Random search algorithms
Random search optimization
rbinom()
R code
Receiver operating characteristic (ROC) curve
Recommendation algorithm
Recursive binary split
Recursive partitioning
Regression analysis
1. causation
2. distributional assumptions
3. linear model
4. non-parametric methods
5. notation
6. parametric methods
7. prediction/forecasting
8. statistical learning and machine learning space
9. statistical model
10. variables
Regression-based methods
Regression trees
Regularization algorithms
Reinforcement learning
Relational Database Management Systems (RDBMS)
Residual Sum of Squares (RSS)
Residuals vs. fitted plot
River plots

SeeSankey plots

RMSE

SeeRoot mean square error (RMSE)

ROC curve

SeeReceiver operating characteristic (ROC) curve

Root mean square error (RMSE)
Root node

S

Sample point
Sampling
1. bias
2. classification
3. description
4. distribution
5. error
6. fraction
7. objectives
8. population mean
9. population statistics
10. sources and storing
11. technological advancement
12. test statistics
13. variance
Sampling without replacement (SWOR)
Sampling with replacement (SWR)
Sankey plots
Scatterplots
1. description
2. higher dimensional
3. population vs. GDP relationship
Semi-supervised learning
Serial correlation
Shapiro-Wilk test
Sigmoid function
Sigmoid neurons
Silhouette coefficient
Simple random sampling
1. distribution of data
2. function
3. histograms
4. hypothesis
5. KS test
6. population
7. population average
8. population sampling
9. population size
10. p-value of t.test
11. replacement
12. sample and population
13. sample() function
14. summarise function
15. without replacement
Simulated annealing
Simulation
Skewness
Spark’s machine learning
1. algorithms
2. build, ML model
3. MLlib
4. preprocessing
5. SparkDataFrame creation
6. SparkR session, initializing
7. sparkR.stop()
8. system properties, setting
9. test dataset
10. tools
Spatial maps
1. data frame creation
2. ggmap()
3. ggplot() function
4. India map, robbery counts
Specialization vs. generalization
Squared Euclidean distance
Stacked column charts
1. age dependency ratio
2. contribution, sectors
3. description
4. working age ratio
Stacking
Statistical learning
Stratified random sampling
1. disadvantages
2. histograms
3. KS test
4. population
5. proportion
6. sample() function
7. stratified function
8. stratified sampling
9. stratum variables
10. sub-populations
11. summarise() function
12. t.test()
Summary statistics
Supervised learning
Supervised vs. unsupervised learning
Support vector machine (SVM)
1. binary classifier
  1. data preparation
  2. data summary
  3. model building
  4. model evaluation
2. classification
3. class separation
4. hard margins
5. linear
6. multi-class
7. nonlinearity
8. overlapping classes
9. soft margins
Systematic random sampling
1. business and computational capacity
2. circular sampling frame
3. EDF
4. formula
5. homogeneous sets
6. KS test
7. population variance
8. sample distribution
9. sample frame
10. skip factor
11. subsetting

T

Term Frequency/Inverse Term frequency (TF_IDF)
Text mining algorithms
Text-mining approaches
1. consumer behavior/product performance
2. data preparation
3. data summary
4. Microsoft Cognitive Services
  1. analytics features
  2. language detection
  3. mscstexta4r
  4. Project Oxford
  5. sentiment analysis
  6. summarization
  7. third-party API
  8. topic detection
  9. twitterR() package
5. NLP
6. POS tagging
7. summarization
8. text analysis
9. text data
10. TF-IDF
11. Twitter statics
12. word cloud
Time series graphs
1. GDP growth, countries
2. GDP growth, recession
Torsten Hothorn
True Negative Rate (TNR)
True positive rate (TPR)
Twitter feeds and article

U

UCI Machine Learning Repository
Unsupervised Fuzzy Competitive Learning
Unsupervised learning
User-Based Collaborative Filtering (UBCF)

V

Variable subset selection
1. definition
2. embedded method
  1. fit model
  2. fitted Cross Validated Linear Model
  3. glmnet fit model
  4. logistic regression
  5. misclassification error and log of penalization factor (lambda)
  6. regularization
  7. statistical approaches
3. filter method
  1. CoV
  2. Gini coefficient
  3. statistical approaches
  4. variance threshold
4. wrapper method
Variance
Variance inflation factor (VIF)
Vectors

W

Wald test
Waterfall charts
Within cluster sum of squares (WCSS)
Wordclouds
World development indicators (WDI)

X, Y, Z

SeeExtensible Markup languages (XML)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Backmatter

Create new playlist

Sign In

Sign Up

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X, Y, Z

Table of Contents for
Backmatter