1-of-N pseudo variables, 97
accuracy
of classification models, 107-112
of time-series forecasting methods, assessing, 176-177
Acxiom, 58
advanced analytics, 17
affinity analysis. See association rule mining
affordability of analytics, 5-6
aggregating data, 74
AIC (Akaike information criterion), 118
for association rule mining, 127
for decision tree creation, 116
genetic algorithms, 114
k-means clustering algorithm, 122-123
k-nearest neighbor algorithm, 113, 142-143
Minkowski distance, finding, 144-145
parameter selection, 145
variables, 97
analytics
advanced analytics, 17
Big Data, 14
cancer survivability studies, 87-90
data mining, 4
descriptive analytics, 17
hierarchy of, 16
ERP, 10
ERP systems, 12
executive information systems, 12
rule-based ESs, 11
in-database analytics, 242
in-memory analytics, 242
prescriptive analytics, 18
reasons for popularity
culture change, 6
need, 5
data, 8
privacy, 9
return on investment, 8
security, 9
talent, 7
technology, 9
terminology, 2
ANNs (artificial neural networks), 147-158. See also neural networks
back-propagation algorithm, 155-157
and biological neural networks, 149-152
feed-forward neural networks, 154
neurons, 152
processing elements, 153
appliances, 242
application examples
Big Data for political campaigns, 261-263
data mining for complex medical procedures, 177-180
data mining for Hollywood managers, 60-65
methodology, 62
results, 63
predicting NCAA bowl game results, 130-139
text mining of research literature, 209-213
text-based deception detection, 225-227
area under the ROC curve, 112
association rule mining, 123-127
assumptions in linear regression, 171-173
attributes, 114
automated data collection, 5-6
availability of analytics, 5-6
back-propagation algorithm, 155-157
banking, data mining applications, 40
Bayesian classifiers, 114
BI (business intelligence), 17
BIC (Bayesian information criterion), 118
challenges to Big Data analytics implementation, 242-243
characteristics of
value proposition, 238
variety, 236
veracity, 237
critical success factors, 240-241
educational requirements, 256
skill requirements, 257
high-performance computing, 242
problems addressed by, 244
NoSQL, 254
terminology, 2
binary frequencies, 205
biological neural networks, 148-152
BioText, 199
bootstrapping, 111
branches, 114
brokerages, data mining applications, 41
building
algorithms, 116
splitting indices, 116
linear regression models, 167-171
model deployment, 164
model development, 163
business applications for analytics, 6-7
business intelligence, 2
OLAP, 38
cancer survivability, analytic studies in, 87-90
case-based reasoning, 113
challenges of analytics adoption, 7-9
Big Data analytics adoption, 242-243
data, 8
privacy, 9
return on investment, 8
security, 9
talent, 7
technology, 9
characteristics of Big Data
value proposition, 238
variety, 236
veracity, 237
classification, 47-49, 105-114. See also cluster analysis
Bayesian classifiers, 114
case-based reasoning, 113
attributes, 114
branches, 114
nodes, 114
genetic algorithms, 114
models, 107
accuracy, 107
estimating accuracy of, 107-112
interpretability, 107
robustness, 107
scalability, 107
speed, 107
nearest-neighbor algorithm, 113
N-P polarity classification, 222-223
rough sets, 114
statistical analysis, 113
SVMs, 113
ClearForest, 214
determining number of clusters, 118
distance measures, 119
examples of, 117
external assessment methods, 121-122
hierarchical clustering methods, 120
internal assessment methods, 121
k-means clustering algorithm, 122-123
partitive clustering methods, 120
coefficients (logistic regression), 175
commercial text mining software, 213-214
comparing
commercial and free data mining tools, 52
correlation and regression, 166-167
data mining and statistics, 39
data mining methodologies, 86-89
SEMMA and CRISP-DM, 82
concepts, 187
confusion matrices, 108
contingency tables, 108
corpora, 187
correlation versus regression, 166-167
CRISP-DM (Cross-Industry Standard Process for Data Mining), 69-77
comparing with SEMMA, 82
deployment, 77
testing and evaluation, 76
critical success factors for Big Data, 240-241
CRM (customer relationship management), 40
cross-validation methodologies, 132
k-nearest neighbor algorithm, 145-147
cubes, 38
Cutting, Doug, 248
discrete data, 95
interval data, 96
nominal data, 95
numeric data, 95
ordinal data, 95
ratio data, 96
structured data, 94
traditional data, 232
unstructured data, 94
Data and Text Analytics Toolkit, 214
algorithms, 141
nearest-neighbor algorithm, 142-143
applications
banking, 40
brokerages, 41
CRM, 40
entertainment industry, 43
finance, 40
government, 42
health care industry, 43
insurance, 41
law enforcement, 44
manufacturing, 41
marketing, 40
retailing and logistics, 41
association rule mining, 49, 123-127
algorithms, 127
cancer survivability, analytic studies in, 87-90
classification, 47-49, 105-114
Bayesian classifiers, 114
case-based reasoning, 113
genetic algorithms, 114
models, 107
nearest-neighbor algorithm, 113
neural networks, 113
rough sets, 114
statistical analysis, 113
SVMs, 113
determining number of clusters, 118
distance measures, 119
examples of, 117
external assessment methods, 121-122
hierarchical clustering methods, 120
internal assessment methods, 121
k-means clustering algorithm, 122-123
partitive clustering methods, 120
unstructured data, 94
data preprocessing
data consolidation phase, 99
data scrubbing phase, 99
data transformation phase, 100
defining, 36
methodology, 62
link analysis, 49
methodologies, comparing, 86-89
neural networks, 48
OLAP, 38
associations, 45
clusters, 46
predictive analytics, 105
selling customer data, 58
rule induction, 49
sequence mining, 49
standardized processes, 67
and statistics, 39
structured data, 94
initiatives, 199
tools, 52
KNIME, 52
RapidMiner, 52
top 10 tools, 55
vendors, 51
Weka, 52
visualization, 51
data consolidation phase, 99
data scrubbing phase, 99
data transformation phase, 100
in text mining process, 202-204
educational requirements, 256
experimental physicists, 256
skill requirements, 257
data scrubbing, 99
databases
HBase, 254
in-database analytics, 242
OLAP, 38
deception detection, 197
text-based deception detection, application example, 225-227
attributes, 114
branches, 114
algorithms, 116
splitting indices, 116
nodes, 114
Deep Blue, 20
defining data mining, 36
de-identified customer records, 57
descriptive analytics, 17
detecting objectivity, 222
developing SVM models, 163
dimensional reduction, 101
discrete data, 95
distance measures, 119
DMAIC (Define, Measure, Analyze, Improve, and Control) methodology, 83-86
EB (exabyte), 235
ECHELON, 196
educational requirements for data scientists, 256
entertainment industry, data mining applications, 43
ERP (enterprise resource planning), 10, 12
ESs (expert systems), rule-based, 11
estimating accuracy of classification models, 107-112
k-fold cross-validation, 110-111
simple split methodology, 109-110
Euclidian distance, 119
Big Data, 14
ERP, 10
ERP systems, 12
executive information systems, 12
rule-based ESs, 11
examples
application examples
Big Data for political campaigns, 261-263
data mining for complex medical procedures, 177-180
data mining for Hollywood managers, 60-65
predicting NCAA bowl game results, 130-139
text mining of research literature, 209-213
of cluster analysis, 117
executive information systems, 12
experimental physicists, 256
explanatory variable, relationship to response variable, 169
explicit sentiment, 217
external assessment methods, 121-122
feed-forward neural networks, 154
finance
data mining applications, 40
sentiment analysis applications, 219-220
Ford, Henry, 10
free software tools
for data mining, 52
in education, 27
in finance, 26
in research, 28
GATE, 215
GeB (gegobyte), 235
genetic algorithms, 114
GIGO (garbage in, garbage out) rule, 97-98
Gini index, 116
government
data mining applications, 42
sentiment analysis applications, 220
grid computing, 242
HBase, 254
HDFS (Hadoop Distributed File System), 248
health care industry, data mining applications, 43
hierarchical clustering methods, 120
hierarchy of analytics, 16
descriptive analytics, 17
predictive analytics, 18
prescriptive analytics, 18
high-performance computing, 242
Big Data, 14
ERP, 10
ERP systems, 12
executive information systems, 12
rule-based ESs, 11
Hollywood, data mining in motion picture industry, 60-64
homeland security, data mining applications, 44
homonyms, 188
Human Genome Project, 198
human-generated data, 232
hyperplanes, 160
in education, 27
in finance, 26
in research, 28
associations, 45
clusters, 46
predictions, 45
targets of expressed sentiment, 223-224
in-database analytics, 242
indices, representing, 204-205
information, 32
information gain, 116
INFORMS (Institute for Operations Research and Management Science), 15-16
initiatives in text mining, 199
in-memory analytics, 242
insurance industry, data mining applications, 41
internal assessment methods, 121
interpretability of classification models, 107
interval data, 96
inverse document frequencies, 205
jackknifing, 112
Jackman, Simon, 263
Jennings, Ken, 21
job tracker nodes, 250
KDD (knowledge discovery in databases) process, 67-68
KDnuggets.com, 54
k-fold cross-validation, 109-111
k-means clustering algorithm, 122-123
k-nearest neighbor algorithm, 142-147
KNIME, 52
KXEN Text Coder, 214
law enforcement, data mining applications, 44
leave-one-out methodology, 111
numeric assessment of model, 169-171
OLS method, 168
LingPipe, 215
link analysis, 49
location-prediction systems, 198
log frequencies, 205
coefficients, 175
logistic function, 174
models, 174
logistics, data mining applications, 41
machine-learning techniques, 141
nearest-neighbor algorithm, 142-143
hyperplanes, 160
machine-generated data, 232
Manhattan distance, 119
manufacturing, data mining applications, 41
market-basket analysis, 123-127
marketing
data mining applications, 40
text mining applications, 195
medicine
data mining applications, 43
text mining applications, 197-199
Megaputer Text Analyst, 214
Microsoft Enterprise Consortium, 53-54
Minkowski distance, finding, 144-145
misconceptions about data mining, 129-130
MLP (multilayered perceptron) architecture, 154-155
models, 45
classification models, 107
accuracy, 107
estimating accuracy of, 107-112
interpretability, 107
robustness, 107
scalability, 107
speed, 107
linear regression models, building, 167-171
logistic regression models, 174
model deployment, 164
model development, 163
morphology, 188
motion picture industry, data mining in, 60-64
methodology, 62
Movie Forecast Guru, 64
multicollinearity, 172
multiple regression, 167
name nodes, 249
National Centre for Text Mining, 199
nearest-neighbor algorithm, 113, 142-143
Minkowski distance, finding, 144-145
parameter selection, 145
NER (named entity recognition), 198
network structure of ANNs, 153-154
neural networks, 48, 113, 147-158
ANNs
back-propagation algorithm, 155-157
processing elements, 153
biological neural networks, 148
feed-forward neural networks, 154
neurons, 152
neurons, 152
NLP (natural language processing), 189-194. See also text mining
challenges associated with, 191-192
nodes, 114
normalization methods, 205
NoSQL, 254
N-P polarity classification, 222-223
numeric data, 95
OASIS (Overall Analysis System for Intelligence Support), 196
objectivity, detecting, 222
OLAP (online analytical processing), 38
OLS (ordinary least squares) method, 168
Open Calais, 215
OR (operations research), 10
OTMI (Open Text Mining Interface), 199
partitive clustering methods, 120
part-of-speech tagging, 188
Patil, D. J., 255
associations, 45
clusters, 46
predictions, 45
phases of data preprocessing, 102-103
data consolidation phase, 99
data scrubbing phase, 99
data transformation phase, 100
polarity, identifying, 224-225
politics
sentiment analysis applications, 220
polysemes, 188
popularity of analytics, reasons for
culture change, 6
need, 5
predicting NCAA bowl game results, 130-139
in motion picture industry, 60-64
methodology, 62
time-series forecasting, 175-180
accuracy of methods, assessing, 176-177
averaging methods, 176
prescriptive analytics, 18
and predictive analytics, 58-59
as roadblock to analytics adoption, 9
problems addressed by Big Data, 244
processing elements of ANNs, 153
qualitative data, 73
quantitative data, 73
Quinlan, Ross, 116
ratio data, 96
linear regression
OLS method, 168
coefficients, 175
logistic function, 174
models, 174
multiple regression, 167
simple regression, 167
response variable, relationship to explanatory variable, 169
retail, data mining applications, 41
RMSE (root mean square error), 169
roadblocks to analytics adoption, 7-9
data, 8
privacy, 9
return on investment, 8
security, 9
talent, 7
technology, 9
robustness of classification models, 107
rough sets, 114
rule induction, 49
rule-based ESs, 11
Rutter, Brad, 21
SAS Text Miner, 214
scalability of classification models, 107
scatter plots, 167
secondary nodes, 250
securities trading, data mining applications, 41
security
as roadblock to analytics adoption, 9
text mining applications, 196-197
selling customer data, privacy issues, 58
SEMMA (sample, explore, modify, model, assess), 78-82
sensitivity analysis in ANNs, 157-158
applications
government intelligence, 220
politics, 220
VOC, 218
VOE, 219
explicit sentiment, 217
implicit sentiment, 217
multistep process
collection and aggregation, 224
N-P polarity classification, 222-223
sentiment detection, 222
target identification, 223-224
polarity, 217
sequence mining, 49
Silver, Nate, 263
simple regression, 167
simple split methodology, 109-110
singular-value decomposition, 189
skills required for data scientists, 257
slave nodes, 250
speed of classification models, 107
splitting indices, 116
sports, data mining applications, 44-45
SPSS Modeler, 214
Spy-EM, 215
standardized data mining processes, 67
comparing with SEMMA, 82
deployment, 77
testing and evaluation, 76
Statistica Text Mining engine, 214
statistical analysis, 113
statistics, 39
stemming, 187
stop words, 187
stratified k-fold cross validation, 137
structured data, 94
survivability of cancer, analytic studies in, 87-90
SVMs (support vector machines), 113, 159-165
hyperplanes, 160
model deployment, 164
model development, 163
synonyms, 188
synthesis, 3
Target, use of predictive analytics, 58-59
target of expressed sentiment, identifying, 223-224
descriptive analytics, 17
predictive analytics, 18
prescriptive analytics, 18
Taylor, Frederick Winslow, 10
TDM (term-document matrix), establishing, 202-204
reducing dimensionality, 206
technical components in Hadoop, 249-250
term dictionaries, 188
terms, 187
test sets, 110
applications, 186
marketing, 195
initiatives, 199
challenges associated with, 191-192
three-task process
establishing the corpus, 202
tools
commercial software tools, 213-214
free text mining tools, 214-215
text-based deception detection, 225-227
three-task text mining process
establishing the corpus, 202
time-series forecasting, 51, 175-180
accuracy of methods, assessing, 176-177
averaging methods, 176
tokenizing, 188
top 10 data mining tools, 55
Torch Concepts, 58
traditional data, 232
training sets, 110
travel industry, data mining applications, 42
tuples, 259
VantagePoint, 214
variables, 97
1-of-N pseudo variables, 97
relationship between response and explanatory variables, 169
scatter plots, 167
vendors of data mining tools, 51
visual analytics, 51
visualization, 51
Vivisimo/Clusty, 215
VOC (voice of the customer), 218
VOE (voice of the employee), 219
VOM (voice of the market), 218-219
in education, 27
in finance, 26
in research, 28
Websites, KDnuggets.com, 54
Weka, 52
word counting, 191
word frequency, 188
WordStat analysis module, 214
ZB (zettabyte), 235