[SYMBOL][A][B][C][D][E][F][G][H][I][J][K][L][M][N][O][P][Q][R][S][T][U][V][W][X]
` (backtick)
: (colon), 2nd
[[]] (double square braces), 2nd
[] (square braces), 2nd
@ (at symbol), 2nd
& vectorized logic operator
# (hash symbol)
%in% operation
+ operator
<- assignment operator
<<- assignment operator
= assignment operator
== vectorized logic operator
-> assignment operator
->> assignment operator
| vectorized logic operator
$ (dollar sign)
absolute error
academic presentations
accuracyMeasures() function
adaptive learning
add command, 2nd
additive process
adjusted R-squared
AdWords
aesthetics
anonymous functions
Apgar test
Apriori
apriori() function
arcsinh
area under the curve.
See AUC.
arules package
as.formula() function
assignment operators
at symbol ( @ ), 2nd
AUC (area under the curve)
defined
scoring categorical variables by
audience for presentations
average silhouette width
averaging to reduce variance
backtick ( ` )
backups and version control
bagging
classifiers and
overview, 2nd
bag-of-k-grams model
bag-of-words model
bar charts
checking distributions for single variable
checking relationships between two variables
base error rate
baskets
batch model
Bayesian inference
Bayesian information criterion.
See BIC.
Bayesian methods
Bayesian posterior estimate
beta regression
betas, defined
between sum of squares.
See BSS.
bias
model problems
variance decomposition
BIC (Bayesian information criterion)
big data tools
bimodal distribution
binomial classification
binwidth parameter
blame command
block declaration format for knitr
bookstore example
boosting technique
bounded predictions
branches vs. commits (Git)
BSS (between sum of squares)
business rules
buzz dataset
overview
product names in
c() command
cache knitr option
Calinski-Harabasz index
cluster analysis
kmeansruns() function
call-by-value semantics, 2nd
CART (classification and regression trees)
casual variables
categorization
accuracy
single-variable models
variables
CDC 2010 natality public-use data file
central limit theorem
centroid
change history for Git
characterization
checkout command
checkpoint documentation
chi-squared test
chooseCRANmirror() command
churn, defined
city block distance.
See Manhattan distance.
class() command
classification and regression trees.
See CART.
classifiers and bagging
client role
clusterboot() function
assessing clusters
k-means algorithm
clustering
defined
models
clusters as classifications or scores
distance comparisons
overview
coefficients
defined
for linear regression
overview
table of
for logistic regression
interpreting values
overview
table of
negative
collinearity, 2nd
colon ( : ), 2nd
comments
commit command, 2nd
comparing files with Git
Comprehensive R Archive Network.
See CRAN.
computer science machine learning
conditional entropy
confidence intervals
confidence parameter
contingency table
continuous variables
coord_flip command
correlation
cos() function
cosine similarity
distances
kernels
mathematical definition
Cover’s theorem
coverage, defined
CRAN (Comprehensive R Archive Network)
installing
online resources
credible intervals
Cromwell’s rule
cross-language linkage
cross-validation
estimating overfitting effects using
performing using function
cumulative distribution function
cut() function, 2nd
cutree() function
Cygwin
data architect
data collection
data cuts
data dictionary
Data directory
data frame
defined
overview
dbinom() function
decision trees
classification methods
data cuts for
problem-to-method mapping
training variance and
workings of
declarative language
definitional kernels
dendrogram
density estimation
density plots
dependent variables, 2nd, 3rd
Derived directory
deviance
probability models
residuals, logistic regression
diff command, 2nd
difference parameter
dim() command
discrete variables
dissimilarity
dissolved clusters
dist() function
distances
clustering models
cosine similarity
Euclidean distance
Hamming distance
Manhattan distance
distribution function
distribution shape
distribution tail bound
dlnorm() function
dnorm() function
document classification
dollar sign ( $ )
domain knowledge
dot plot
dot product
mathematical definition
similarity
using kernel
double-precision floating-point numbers
Dremel
Drill
dropping records for missing values
dynamic language
echo knitr option
end users, presentations for
overview, 2nd
showing model usage
summarizing goals
workflow and model
enrichment rate
ensemble learning
entropy
equal sign ( = )
Euclidean distance
eval knitr option
exchangeability
Executive Summary slide
experimental design, statistics attempt to correct
explanatory variables
explicit kernels
defined
mathematical definition
transforms
linear regression example
using
export, deployment by
Extensible Markup Language.
See XML.
F1
faceting graph
factor
defined
making sure levels are consistent
overview
summary command
factor variable
factor() command
false positive rate.
See FPR.
faulty sensor
filled bar chart
Fisher scoring iterations
fitdistr() function
floating-point numbers
for loops
forecasting vs. prediction
formats, data files
fpc package
FPR (false positive rate), 2nd
frequentist inference
frequentist significance test
F-statistic
full normal form database
functional language
gam package
gam() function, 2nd, 3rd
gap statistic
Gaussian distributions, 2nd
Gaussian kernels
defined
example using
mathematical definition
gbm package
gdata package
generalization error, 2nd, 3rd
generalized additive models.
See GAMs.
generalized linear models
generic language
geom layers
ggplot2
glm() function
beta regression
logistic regression
separation and
separation and quasi-separation
two-category classification
weights argument
glmnet package
goal
defining for project
in presentations
for end users
for project sponsor
Greenplum
grouped data
grouping records
.gz extension
H2 database
defined
driver for
overview
Hadoop, 2nd
hair clusters
Hamming distance
hash symbol ( # )
hash, file
hclust() function
HDF5 (Hierarchical Data Format 5)
held-out data
help() command, 2nd, 3rd, 4th
heteroscedastic errors
heteroscedastic, defined
hexbin plots
hierarchical clustering
defined
with hclust() function
Hierarchical Data Format 5.
See HDF5.
histogram
checking distributions for single variable
defined
Hive
hold-out set
homoscedastic errors
homoscedastic, defined
household grouping
HTML (Hypertext Markup Language)
HTTP service, R-based
HTTPS (Hypertext Transfer Protocol Secure)
hyperellipsoid
Hypertext Markup Language.
See HTML.
Hypertext Transfer Protocol Secure.
See HTTPS.
hypothesis testing
Impala
importance() function
in keyword
independent variables, 2nd, 3rd
indicator variables
defined
overview
init command
inner product
input variables
inspect() function
interaction terms
interestMeasure() function
invalid values
itemset
J language
Jaccard coefficient
Java
JavaScript Object Notation.
See JSON.
JDBC (Java Database Connectivity)
join statement, 2nd
joint probability of the evidence
Julia language
kernel, machine learning definition
kernlab library
k-fold cross-validation
k-nearest neighbor.
See KNN.
KNN (k-nearest neighbor).
See also nearest neighbor methods.
Knowledge Discovery and Data Mining.
See KDD.
L1/L2 distance
languages, alternative
Laplace smoothing
lazy evaluation
leaf node
least squares method
less-than symbol (< )
levels
lhs() function
library() function
lift concept
line plots
linear relationships
linear transformation kernels
defined
mathematical definition
linearly inseparable data
list label operators
lists
loess function
log command, 2nd, 3rd
log transformations
log, Git
logarithmic scale
density plot
when to use
logit
log-odds
lowess function
Mahout
maintenance
Manhattan distance
margin, defined
Markdown
best cases for using
knitr example
masking variable
MASS package
master branch
matrices
max command
maxnodes parameter
mean command
mean value, and lognormal population
median command
Mercer’s theorem, 2nd
message knitr option
mgcv package
milestones
documenting
knitr
min command
mining, restricting items for
mirrors, CRAN
MongoDB
motivation for project
multicategory classification
multiline commands
multimodal distribution
multinomial classification
multiplicative process
MySQL
Mythical Man-Month
NA data type
Naive Bayes
classification methods
document classification and
multiple-variable models
Naive Bayes assumption
problem-to-method mapping
smoothing
naming knitr blocks
narrow data ranges
NB (nota bene) notes
negative coefficients
negative correlation
newborn baby weight example
nonlinear relationships
non-monotone relationships
defined
extracting nonlinear relationships
logistic regression using
one-dimensional regression example
overview, 2nd
predicting newborn baby weight
nonsignificance
normal probability function
normalization
organizing data for analysis
overview
standard deviation and
normalized form
nota bene notes.
See NB notes.
null classifiers
NULL data type
null deviance
null hypothesis
number sequences
numeric accuracy, 2nd
object-oriented language
odds, defined
OLTP (online transaction processing)
online transaction processing.
See OLTP.
operations role
operators, assignment
organizing data for analysis
origin repository
outcome variables
outliers
out-of-bag samples
overfitting
common model problems
estimating effects of using cross-validation
pseudo R-squared and
random forests
package system.
See CRAN.
pbeta() function
pbinom() function
Pearson coefficient
performance
permutation test
phi() function, 2nd, 3rd
Pig
pipe-separated values, 2nd
pivot table
plnorm() function
plot() function
PMML (Predictive Model Markup Language)
point estimate
Poisson distribution
polynomial kernels
defined
mathematical definition
posterior estimate
PostgreSQL
prcomp() function
Predictive Model Markup Language.
See PMML.
Presto
primalizing
print() function
prior distribution
probability distribution function
procedural language
production environment
promise-based argument evaluation
pseudo R-squared
defined
logistic regression
p-value and
pull command, 2nd
PUMS American Community Survey data
push command, 2nd
Python
qbinom() function
qlnorm() function
qnorm() function
quantile() function, 2nd, 3rd
quasi-separation
R in Action, 2nd
radial kernels
defined
example using
mathematical definition
RAND command
random sample, reproducing
randomForest() function, 2nd, 3rd
randomization
randomly missing values
ranking
defined
models
R-based HTTP service
rbinom() function, 2nd
read.table() function
gzip compression
structured data
read.transactions() function
rebasing, 2nd, 3rd
receiver operating characteristic curve.
See ROC curve.
reference level
defined
SCHL coefficient
regression
defined, 2nd
problem-to-method mapping
technical definition.
See also linear regression; logistic regression.
relational databases.
See databases.
relationships
data science tasks
visually checking
bar charts
hexbin plots
line plots
scatter plots
remote repository for Git
replicate() function
reproducing results
documentation
random sample
rescaling
reshaping data
residual standard error
residuals
defined
deviance, logistic regression
predictions on graph
response variables
Results directory
results knitr option
rlnorm() function
rm() function
rnorm() function
ROC (receiver operating characteristic) curve
root mean square error.
See RMSE.
root node
rpart() command
RSQLite package
RStudio IDE, 2nd
rug, defined
runif function
running documentation
S language
sample function
saturated model
scale() function
scaling
scatter plot
SCHL coefficient
scientific honesty
Screwdriver tool
Scripts directory
select statement
sensitivity
separable data
separation, logistic regression
sequences of numbers
shape of distribution
shasum program
sigmoid function
signed logarithm
sign-off by project sponsor
sin() function
size() function
slots
smoothing curves
soft margin optimization
soundness of model
spam, identifying
Spambase dataset
applying SVM
comparing results
SVMs
specificity
splines
SQL Screwdriver
sqldf package, 2nd
square braces, 2nd
SQuirreL SQL, 2nd, 3rd
Stack Overflow
stacked bar chart
standard deviation
star workflow
stat layers
statistical learning
statistical test power, 2nd
status command
Storm
structured values
subsets
sufficient statistic
summary() function
summary() function
checking data for errors
data ranges
invalid values
missing values
outliers
overview
units
linear regression
coefficients table
original model call
producing
quality statistics
residuals summary
logistic regression
AIC
coefficients table
deviance residuals
Fisher scoring iterations
glm() function
null deviance
producing
pseudo R-squared
quasi-separation
residual deviance
separation
overview
support vector machines.
See SVMs.
support vectors
defined
overview
SVMs (support vector machines)
classification methods
defined
overview, 2nd
problem-to-method mapping
Spambase example
applying SVM
comparing results
overview
spiral example
good kernel
overview
wrong kernel
support vectors
synchronizing with Git
synthetic variables
system() function
systematically missing values
table() command
tag command
targetRate parameter
technical debt
terminology, and model quality
test set
theta angle
tidy knitr option
time series analysis
TODO notes
total sum of squares.
See TSS.
total WSS (within sum of squares)
TPR (true positive rate)
training error
transforming data
trial and error
true negative rate
true outcome
true positive rate.
See TPR.
TSS (total sum of squares)
two-by-two confusion matrix
two-category classification
UCI car dataset
uncommitted changes
unexplainable variance
ungrouped data
uniform resource locator.
See URL.
unimodal distribution
units
checking data using summary command
cluster analysis
unsupervised learning
upselling
URL (uniform resource locator)
variance
variance command
varImpPlot() function
vectorized operations
vectorized, defined
vectors
venue shopping
views, in R
waste clusters
workflow of end user, and model