Index

Numerics

1-of-N pseudo variables, 97

A

accuracy

of classification models, 107-112

of time-series forecasting methods, assessing, 176-177

Acxiom, 58

advanced analytics, 17

affinity analysis. See association rule mining

affordability of analytics, 5-6

aggregating data, 74

AIC (Akaike information criterion), 118

algorithms, 104, 141

Apriori algorithm, 127-128

for association rule mining, 127

back-propagation, 155-157

for decision tree creation, 116

genetic algorithms, 114

k-means clustering algorithm, 122-123

k-nearest neighbor algorithm, 113, 142-143

cross-validation, 145-147

Minkowski distance, finding, 144-145

parameter selection, 145

similarity measure, 144-147

learning algorithms, 156-157

logistic regression, 173-175

variables, 97

analytics

advanced analytics, 17

versus analysis, 3-4

Big Data, 14

business analytics, 1-2

business applications, 6-7

cancer survivability studies, 87-90

data mining, 4

descriptive analytics, 17

hierarchy of, 16

history of, 10-14

data warehouses, 12-13

ERP, 10

ERP systems, 12

executive information systems, 12

rule-based ESs, 11

IBM Watson, 20-28

DeepQA, 22-23

future of, 23-28

Jeopardy! challenge, 21-23

in-database analytics, 242

in-memory analytics, 242

predictive analytics, 18, 105

privacy issues, 58-59

prescriptive analytics, 18

reasons for popularity

affordability, 5-6

availability, 5-6

culture change, 6

need, 5

roadblocks to adoption, 7-9

culture, 7-8

data, 8

privacy, 9

return on investment, 8

security, 9

talent, 7

technology, 9

stream analytics, 257-259

taxonomy for, 15-19

terminology, 2

text analytics, 183-184

ANNs (artificial neural networks), 147-158. See also neural networks

back-propagation algorithm, 155-157

and biological neural networks, 149-152

feed-forward neural networks, 154

MLP architecture, 154-155

network structure, 153-154

neurons, 152

processing elements, 153

sensitivity analysis, 157-158

versus SVMs, 164-165

appliances, 242

application examples

Big Data for political campaigns, 261-263

data mining for complex medical procedures, 177-180

data mining for Hollywood managers, 60-65

methodology, 62

results, 63

sample data, 60-61

predicting NCAA bowl game results, 130-139

evaluation, 136-137

methodology, 131-132

results, 137-139

sample data, 132-136

text mining of research literature, 209-213

text-based deception detection, 225-227

Apriori algorithm, 127-128

area under the ROC curve, 112

association rule mining, 123-127

applications, 124-125

Apriori algorithm, 127-128

associations, 45, 49

assumptions in linear regression, 171-173

attributes, 114

automated data collection, 5-6

availability of analytics, 5-6

B

back-propagation algorithm, 155-157

bag-of-words, 189-190

banking, data mining applications, 40

Bayesian classifiers, 114

benefits of Hadoop, 250-251

BI (business intelligence), 17

BIC (Bayesian information criterion), 118

Big Data, 14, 231, 238-243

challenges to Big Data analytics implementation, 242-243

characteristics of

value proposition, 238

variability, 237-238

variety, 236

velocity, 236-237

veracity, 237

volume, 234-236

critical success factors, 240-241

data scientists, 254-257

educational requirements, 256

skill requirements, 257

high-performance computing, 242

in politics, 261-263

problems addressed by, 244

sources of, 232-233

stream analytics, 257-259

technologies, 244-254

Hadoop, 247-253

MapReduce, 245-247

NoSQL, 254

terminology, 2

binary frequencies, 205

biological neural networks, 148-152

BioText, 199

bootstrapping, 111

branches, 114

brokerages, data mining applications, 41

building

decision trees, 115-116

algorithms, 116

splitting indices, 116

linear regression models, 167-171

SVM models, 161-164

data preprocessing, 162-163

model deployment, 164

model development, 163

business analytics, 1-2, 19

business applications for analytics, 6-7

business intelligence, 2

OLAP, 38

C

cancer survivability, analytic studies in, 87-90

Capgemini, 15-16

case-based reasoning, 113

categorical data, 94-95

challenges of analytics adoption, 7-9

Big Data analytics adoption, 242-243

culture, 7-8

data, 8

privacy, 9

return on investment, 8

security, 9

talent, 7

technology, 9

characteristics of Big Data

value proposition, 238

variability, 237-238

variety, 236

velocity, 236-237

veracity, 237

volume, 234-236

classification, 47-49, 105-114. See also cluster analysis

Bayesian classifiers, 114

case-based reasoning, 113

decision trees, 49, 113-116

attributes, 114

branches, 114

building, 115-116

nodes, 114

genetic algorithms, 114

models, 107

accuracy, 107

estimating accuracy of, 107-112

interpretability, 107

robustness, 107

scalability, 107

speed, 107

nearest-neighbor algorithm, 113

neural networks, 48, 113

N-P polarity classification, 222-223

rough sets, 114

statistical analysis, 113

SVMs, 113

ClearForest, 214

cluster analysis, 50, 117-122

applications, 117-118

determining number of clusters, 118

distance measures, 119

examples of, 117

external assessment methods, 121-122

hierarchical clustering methods, 120

internal assessment methods, 121

k-means clustering algorithm, 122-123

partitive clustering methods, 120

clustering, 46, 50

coefficients (logistic regression), 175

commercial text mining software, 213-214

comparing

analytics and analysis, 3-4

ANNs and SVMs, 164-165

commercial and free data mining tools, 52

correlation and regression, 166-167

data mining and statistics, 39

data mining methodologies, 86-89

SEMMA and CRISP-DM, 82

concepts, 187

confidence, 126-127

confusion matrices, 108

contingency tables, 108

corpora, 187

correlation versus regression, 166-167

CRISP-DM (Cross-Industry Standard Process for Data Mining), 69-77

business understanding, 70-71

comparing with SEMMA, 82

data preparation, 73-74

data understanding, 71-73

deployment, 77

model building, 74-75

testing and evaluation, 76

critical success factors for Big Data, 240-241

CRM (customer relationship management), 40

cross-validation methodologies, 132

k-nearest neighbor algorithm, 145-147

cubes, 38

Cutting, Doug, 248

D

data, 32, 35

categorical data, 94-95

discrete data, 95

interval data, 96

nominal data, 95

numeric data, 95

ordinal data, 95

ratio data, 96

structured data, 94

traditional data, 232

unstructured data, 94

Data and Text Analytics Toolkit, 214

data mining, 4, 31-39

algorithms, 141

ANNs, 147-158

nearest-neighbor algorithm, 142-143

applications

banking, 40

brokerages, 41

CRM, 40

entertainment industry, 43

finance, 40

government, 42

health care industry, 43

insurance, 41

law enforcement, 44

manufacturing, 41

marketing, 40

retailing and logistics, 41

sports, 44-45

association rule mining, 49, 123-127

algorithms, 127

cancer survivability, analytic studies in, 87-90

classification, 47-49, 105-114

Bayesian classifiers, 114

case-based reasoning, 113

decision trees, 49, 113-116

genetic algorithms, 114

models, 107

nearest-neighbor algorithm, 113

neural networks, 113

rough sets, 114

statistical analysis, 113

SVMs, 113

cluster analysis, 50, 117-122

applications, 117-118

determining number of clusters, 118

distance measures, 119

examples of, 117

external assessment methods, 121-122

hierarchical clustering methods, 120

internal assessment methods, 121

k-means clustering algorithm, 122-123

partitive clustering methods, 120

data, 35, 93-97

unstructured data, 94

data preprocessing

data consolidation phase, 99

data reduction phase, 100-102

data scrubbing phase, 99

data transformation phase, 100

data stream mining, 260-261

defining, 36

GIGO rule, 97-98

for Hollywood managers, 60-64

data, 60-61

methodology, 62

knowledge, 32-33

link analysis, 49

methodologies, comparing, 86-89

misconceptions of, 129-130

neural networks, 48

OLAP, 38

patterns, identifying, 45-51

associations, 45

clusters, 46

predictions, 45, 47

predictive analytics, 105

privacy issues, 57-65

selling customer data, 58

reasons for popularity, 33-34

rule induction, 49

sequence mining, 49

standardized processes, 67

CRISP-DM, 69-77

KDD process, 67-68

SEMMA, 78-81

Six Sigma, 83-86

and statistics, 39

structured data, 94

text mining, 185-189

applications, 186, 195-199

bag-of-words, 189-190

initiatives, 199

NLP, 189-194

tools, 52

KNIME, 52

Microsoft SQL Server, 53-54

RapidMiner, 52

top 10 tools, 55

vendors, 51

Weka, 52

visualization, 51

data preprocessing, 73-74

data consolidation phase, 99

data reduction phase, 100-102

data scrubbing phase, 99

data transformation phase, 100

SVM model building, 162-163

in text mining process, 202-204

data scientists, 254-257

educational requirements, 256

experimental physicists, 256

skill requirements, 257

data scrubbing, 99

data stream mining, 260-261

data warehouses, 12-13, 68

databases

HBase, 254

in-database analytics, 242

OLAP, 38

Davenport, Thomas, 32, 255

deception detection, 197

text-based deception detection, application example, 225-227

decision trees, 49, 113-116

attributes, 114

branches, 114

building, 115-116

algorithms, 116

splitting indices, 116

nodes, 114

Deep Blue, 20

DeepQA, 22-23

defining data mining, 36

de-identified customer records, 57

descriptive analytics, 17

detecting objectivity, 222

developing SVM models, 163

dimensional reduction, 101

discrete data, 95

distance measures, 119

DMAIC (Define, Measure, Analyze, Improve, and Control) methodology, 83-86

E

EB (exabyte), 235

ECHELON, 196

educational requirements for data scientists, 256

entertainment industry, data mining applications, 43

ERP (enterprise resource planning), 10, 12

ESs (expert systems), rule-based, 11

estimating accuracy of classification models, 107-112

k-fold cross-validation, 110-111

simple split methodology, 109-110

Euclidian distance, 119

evolution of analytics, 10-14

Big Data, 14

data warehouses, 12-13

ERP, 10

ERP systems, 12

executive information systems, 12

rule-based ESs, 11

examples

application examples

Big Data for political campaigns, 261-263

data mining for complex medical procedures, 177-180

data mining for Hollywood managers, 60-65

predicting NCAA bowl game results, 130-139

text mining of research literature, 209-213

of cluster analysis, 117

executive information systems, 12

experimental physicists, 256

explanatory variable, relationship to response variable, 169

explicit sentiment, 217

external assessment methods, 121-122

extracting knowledge, 206-209

F

feed-forward neural networks, 154

finance

data mining applications, 40

sentiment analysis applications, 219-220

Ford, Henry, 10

free software tools

for data mining, 52

for text mining, 214-215

future of Watson, 23-28

in education, 27

in finance, 26

in government, 27-28

in health care, 24-25

in research, 28

security systems, 25-26

G

GATE, 215

GeB (gegobyte), 235

genetic algorithms, 114

GIGO (garbage in, garbage out) rule, 97-98

Gini index, 116

government

data mining applications, 42

sentiment analysis applications, 220

grid computing, 242

H

Hadoop, 247-253

benefits of, 250-251

misconceptions, 251-253

technical components, 249-250

HBase, 254

HDFS (Hadoop Distributed File System), 248

health care industry, data mining applications, 43

hierarchical clustering methods, 120

hierarchy of analytics, 16

descriptive analytics, 17

predictive analytics, 18

prescriptive analytics, 18

high-performance computing, 242

history of analytics, 10-14

Big Data, 14

data warehouses, 12-13

ERP, 10

ERP systems, 12

executive information systems, 12

rule-based ESs, 11

Hollywood, data mining in motion picture industry, 60-64

homeland security, data mining applications, 44

homonyms, 188

Human Genome Project, 198

human-generated data, 232

hyperplanes, 160

I

IBM Watson, 20-28

DeepQA, 22-23

future of, 23-28

in education, 27

in finance, 26

in government, 27-28

in health care, 24-25

in research, 28

security systems, 25-26

Jeopardy! challenge, 21-23

identifying

patterns in data sets, 45-51

associations, 45

clusters, 46

predictions, 45

polarity, 224-225

targets of expressed sentiment, 223-224

in-database analytics, 242

indices, representing, 204-205

information, 32

information gain, 116

INFORMS (Institute for Operations Research and Management Science), 15-16

initiatives in text mining, 199

in-memory analytics, 242

insurance industry, data mining applications, 41

internal assessment methods, 121

interpretability of classification models, 107

interval data, 96

inverse document frequencies, 205

J

jackknifing, 112

Jackman, Simon, 263

Jennings, Ken, 21

Jeopardy! challenge, 21-23

job tracker nodes, 250

K

KDD (knowledge discovery in databases) process, 67-68

KDnuggets.com, 54

k-fold cross-validation, 109-111

k-means clustering algorithm, 122-123

k-nearest neighbor algorithm, 142-147

KNIME, 52

knowledge, 32-33

extracting, 206-209

KXEN Text Coder, 214

L

law enforcement, data mining applications, 44

learning algorithms, 156-157

leave-one-out methodology, 111

lift, 126-127

linear regression, 165-173

assumptions, 171-173

model building, 167-171

numeric assessment of model, 169-171

OLS method, 168

LingPipe, 215

link analysis, 49

location-prediction systems, 198

log frequencies, 205

logistic regression, 173-175

coefficients, 175

logistic function, 174

models, 174

logistics, data mining applications, 41

M

machine-learning techniques, 141

nearest-neighbor algorithm, 142-143

SVMs, 159-165

versus ANNs, 164-165

hyperplanes, 160

machine-generated data, 232

Manhattan distance, 119

manufacturing, data mining applications, 41

MapReduce, 245-247

market-basket analysis, 123-127

applications, 124-125

marketing

data mining applications, 40

text mining applications, 195

medicine

data mining applications, 43

text mining applications, 197-199

Megaputer Text Analyst, 214

Microsoft Enterprise Consortium, 53-54

Microsoft SQL Server, 53-54

Minkowski distance, finding, 144-145

misconceptions about data mining, 129-130

MLP (multilayered perceptron) architecture, 154-155

models, 45

classification models, 107

accuracy, 107

estimating accuracy of, 107-112

interpretability, 107

robustness, 107

scalability, 107

speed, 107

linear regression models, building, 167-171

logistic regression models, 174

SVM models, building, 161-164

data preprocessing, 162-163

model deployment, 164

model development, 163

morphology, 188

motion picture industry, data mining in, 60-64

data, 60-61

methodology, 62

Movie Forecast Guru, 64

multicollinearity, 172

multiple regression, 167

N

name nodes, 249

National Centre for Text Mining, 199

nearest-neighbor algorithm, 113, 142-143

cross-validation, 145-147

Minkowski distance, finding, 144-145

parameter selection, 145

similarity measure, 144-147

NER (named entity recognition), 198

network structure of ANNs, 153-154

neural networks, 48, 113, 147-158

ANNs

back-propagation algorithm, 155-157

MLP architecture, 154-155

network structure, 153-154

processing elements, 153

sensitivity analysis, 157-158

versus SVMs, 164-165

biological neural networks, 148

feed-forward neural networks, 154

neurons, 152

neurons, 152

NLP (natural language processing), 189-194. See also text mining

applications, 193-194

challenges associated with, 191-192

WordNet, 192-193

nodes, 114

nominal data, 73, 95

normalization methods, 205

NoSQL, 254

N-P polarity classification, 222-223

numeric data, 95

O

OASIS (Overall Analysis System for Intelligence Support), 196

objectivity, detecting, 222

OLAP (online analytical processing), 38

OLS (ordinary least squares) method, 168

Open Calais, 215

OR (operations research), 10

ordinal data, 73, 95

OTMI (Open Text Mining Interface), 199

P

partitive clustering methods, 120

part-of-speech tagging, 188

Patil, D. J., 255

patterns, identifying, 45-51

associations, 45

clusters, 46

predictions, 45

phases of data preprocessing, 102-103

data consolidation phase, 99

data reduction phase, 100-102

data scrubbing phase, 99

data transformation phase, 100

polarity, identifying, 224-225

politics

Big Data, 261-263

sentiment analysis applications, 220

polysemes, 188

popularity of analytics, reasons for

affordability, 5-6

availability, 5-6

culture change, 6

need, 5

predicting NCAA bowl game results, 130-139

evaluation, 136-137

methodology, 131-132

results, 137-139

sample data, 132-136

prediction, 45, 47

predictive analytics, 18, 105

in motion picture industry, 60-64

data, 60-61

methodology, 62

privacy issues, 58-59

time-series forecasting, 175-180

accuracy of methods, assessing, 176-177

averaging methods, 176

prescriptive analytics, 18

privacy, 57-65

and predictive analytics, 58-59

as roadblock to analytics adoption, 9

problems addressed by Big Data, 244

processing elements of ANNs, 153

Q-R

qualitative data, 73

quantitative data, 73

Quinlan, Ross, 116

RapidMiner, 52, 214

ratio data, 96

regression analysis, 165-167

versus correlation, 166-167

linear regression

assumptions, 171-173

OLS method, 168

logistic regression, 173-175

coefficients, 175

logistic function, 174

models, 174

multiple regression, 167

simple regression, 167

representing indices, 204-205

response variable, relationship to explanatory variable, 169

retail, data mining applications, 41

RMSE (root mean square error), 169

roadblocks to analytics adoption, 7-9

culture, 7-8

data, 8

privacy, 9

return on investment, 8

security, 9

talent, 7

technology, 9

robustness of classification models, 107

rough sets, 114

rule induction, 49

rule-based ESs, 11

Rutter, Brad, 21

S

SAS Text Miner, 214

scalability of classification models, 107

scatter plots, 167

secondary nodes, 250

securities trading, data mining applications, 41

security

as roadblock to analytics adoption, 9

text mining applications, 196-197

selling customer data, privacy issues, 58

SEMMA (sample, explore, modify, model, assess), 78-82

sensitivity analysis in ANNs, 157-158

sentiment analysis, 215-227

applications

finance, 219-220

government intelligence, 220

politics, 220

VOC, 218

VOE, 219

VOM, 218-219

explicit sentiment, 217

identifying polarity, 224-225

implicit sentiment, 217

multistep process

collection and aggregation, 224

N-P polarity classification, 222-223

sentiment detection, 222

target identification, 223-224

polarity, 217

sequence mining, 49

Silver, Nate, 263

similarity measure, 144-147

simple regression, 167

simple split methodology, 109-110

singular-value decomposition, 189

Six Sigma, 83-86

skills required for data scientists, 257

slave nodes, 250

sources of Big Data, 232-233

speed of classification models, 107

splitting indices, 116

sports, data mining applications, 44-45

SPSS Modeler, 214

Spy-EM, 215

standardized data mining processes, 67

CRISP-DM, 69-77

business understanding, 70-71

comparing with SEMMA, 82

data preparation, 73-74

data understanding, 71-73

deployment, 77

model building, 74-75

testing and evaluation, 76

KDD process, 67-68

SEMMA, 78-81

Six Sigma, 83-86

Statistica Text Mining engine, 214

statistical analysis, 113

linear regression, 165-173

statistics, 39

stemming, 187

stop words, 187

stratified k-fold cross validation, 137

stream analytics, 257-259

structured data, 94

support, 126-127

survivability of cancer, analytic studies in, 87-90

SVMs (support vector machines), 113, 159-165

versus ANNs, 164-165

hyperplanes, 160

model building, 161-164

data preprocessing, 162-163

model deployment, 164

model development, 163

synonyms, 188

synthesis, 3

T

Target, use of predictive analytics, 58-59

target of expressed sentiment, identifying, 223-224

taxonomy for analytics, 15-19

descriptive analytics, 17

predictive analytics, 18

prescriptive analytics, 18

Taylor, Frederick Winslow, 10

TDM (term-document matrix), establishing, 202-204

reducing dimensionality, 206

representing indices, 204-205

technical components in Hadoop, 249-250

term dictionaries, 188

terms, 187

test sets, 110

text analytics, 183-184

text mining, 185-189

applications, 186

marketing, 195

medicine, 197-199

security, 196-197

bag-of-words, 189-190

initiatives, 199

NLP, 189-194

applications, 193-194

challenges associated with, 191-192

WordNet, 192-193

representing indices, 204-205

three-task process

data preprocessing, 202-204

establishing the corpus, 202

extracting knowledge, 206-209

tools

commercial software tools, 213-214

free text mining tools, 214-215

text-based deception detection, 225-227

three-task text mining process

data preprocessing, 202-204

establishing the corpus, 202

extracting knowledge, 206-209

time-series forecasting, 51, 175-180

accuracy of methods, assessing, 176-177

averaging methods, 176

tokenizing, 188

top 10 data mining tools, 55

Torch Concepts, 58

traditional data, 232

training sets, 110

travel industry, data mining applications, 42

trend analysis, 208-209

tuples, 259

U-V

unstructured data, 14, 94

text mining, 185-189

VantagePoint, 214

variables, 97

1-of-N pseudo variables, 97

relationship between response and explanatory variables, 169

scatter plots, 167

vendors of data mining tools, 51

visual analytics, 51

visualization, 51

Vivisimo/Clusty, 215

VOC (voice of the customer), 218

VOE (voice of the employee), 219

VOM (voice of the market), 218-219

W

Watson, 20-28

DeepQA, 22-23

future of, 23-28

in education, 27

in finance, 26

in government, 27-28

in health care, 24-25

in research, 28

security systems, 25-26

Jeopardy! challenge, 21-23

Websites, KDnuggets.com, 54

Weka, 52

word counting, 191

word frequency, 188

WordNet, 192-193

WordStat analysis module, 214

X-Y-Z

ZB (zettabyte), 235

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset