Appendix B. Glossary

ACF

See autocovariance function

AIC

See Akaike Information Criterion

AICC

See Akaike Information Criterion Corrected

Akaike Information Criterion

Measure of model fit quality that penalizes model complexity

Akaike Information Criterion Corrected

Version of AIC with greater penalty for model complexity

Analysis of variance

See ANOVA

Andersen-Gill

Survival analysis for modeling time to multiple events

ANOVA

Test for comparing the means of multiple groups; the test can only detect if there is a difference between any two groups, it cannot tell which ones are different from which others

Ansari-Bradley test

Nonparametric test for the equality of variances between two groups

AR

See autoregressive

ARIMA

Like an ARMA model but it includes a parameter for the number of differences of the time series data

ARMA

See Autoregressive Moving Average

array

Object that holds data in multiple dimensions

autocorrelation

When observations in a single variable are correlated with previous observations

Autocovariance function

The correlation of a time series with lags of itself

Autoregressive

Time series model that is a linear regression of the current value of a time series against previous values

Autoregressive Moving Average

Combination of AR and MA models

average

While generally held to be the arithmetic mean, average is actually a generic term that can mean any number of measures of centrality such as the mean, median or mode

Bartlett test

Parametric test for the equality of variances between two groups

BASH

A command line processor in the same vein as DOS; mainly used on Linux and MAC OS X though there is an emulator for Windows

basis functions

Functions whose linear combination make up other functions

basis splines

Basis functions used to compose splines

Bayesian

Type of statistics where prior information is used to inform the model

Bayesian Information Criterion

Similar to AIC but with an even greater penalty for model complexity

Beamer

LATEX document class for producing slide shows

Bernoulli Distribution

Probability distribution for modeling the success or failure of an event

Beta Distribution

Probability distribution for modeling a set of possible values on a finite interval

BIC

See Bayesian Information Criterion

Binomial Distribution

Probability distribution for modeling the number of successful independent trials with identical probabilities of success

Bioconductor

Repository of R packages for the analysis of genomic data

BitBucket

Online Git repository

Boost

Fast C++ library

Bootstrap

A process in which data are resampled repeatedly, and a statistic is calculated for each resampling to form an empirical distribution for that statistic

Boxplot

A graphical display of one variable where the middle 50% of the data are in a box, and there are lines reaching out to 1.5 times the Interquartile Range and dots representing outliers

BUGS

Probabilistic programming language specializing in Bayesian computations

byte-compilation

The process of turning human readable code into machine code that runs faster

C

A fast, low-level programming language; R is written primarily in C

C++

A fast, low-level programming language that is similar to C

Cauchy Distribution

Probability distribution for the ratio of two Normal random variables

censored data

Data with unknown information, such as the occurrence of an event after a cutoff time

character

Data type for storing text

Chi-squared Distribution

The sum of k squared standard normal distributions

chunk

Piece of R code inside a LATEX or Markdown document

class

Type of an R object

Classification

Determining the class membership of data

Clustering

Partitioning data into groups

Coefficient

A multiplier associated with a variable in an equation; in statistics this is typically what is being estimated by a regression

Coefficient plot

A visual display of the coefficients and standard errors from a regression

Comprehensive R Archive Network

See CRAN

Confidence Interval

A range within which an estimate should fall a certain percent of time

correlation

The strength of the association between two variables

covariance

A measure of the association between two variables; the strength of the relationship is not necessarily indicated

Cox proportional hazards

Model for survival analysis where predictors have a multiplicative effect on the survival rate

CRAN

The central repository for all things R

cross-validation

A modern form of model assessment where the data are split into k discrete folds, and a model is repeatedly fitted on all but one and used to make predictions on the holdout fold

Data Gotham

Data science conference in New York

data munging

The process of cleaning, correcting, aggregating, joining and manipulating data to prepare it for analysis

Data Science

The confluence of statistics, machine learning, computer engineering, visualization and social skills

data.frame

The main data type in R, similar to a spreadsheet with tabular rows and columns

data.table

A high speed extension of data.frames

database

Store of data, usually in relational tables

Date

Data type for storing dates

DB2

Enterprise level database from IBM

Debian

Linux Distribution

decision tree

Modern technique for performing nonlinear regression or classification by iteratively splitting predictors

Degrees of freedom

For some statistic or distribution, this is the number of observations minus the number of parameters being estimated

density plot

Display showing the probability of observations falling within a sliding window along a variable of interest

deviance

A measure of error for generalized linear models

drop-in deviance

The amount by which deviance drops when adding a variable to a model; a general rule of thumb is that deviance should drop by two for each term added

DSN

Data source connection used to describe communication to a data source, often a database

dzslides

HTML5 slide show format

EDA

See Exploratory Data Analysis

Elastic Net

New algorithm that is a dynamic blending of lasso and ridge regressions, which is great for predictions and dealing with high dimensional datasets

Emacs

Text editor popular among programmers

ensemble

Method of combining multiple models to get an average prediction

Excel

The most commonly used data analysis tool in the world

expected value

Weighted mean

Exploratory Data Analysis

Visually and numerically exploring data to get a sense of it before performing rigorous analysis

Exponential Distribution

Probability distribution often used to model the amount of time until an event occurs

F-test

Statistical test often used for comparing models, as with the ANOVA

F Distribution

The ratio of two Chi-Squared Distributions, often used as the null distribution in analysis of variance

factor

Special data type for handling character data as an integer value with character labels; important for including categorical data in models

fitted values

Values predicted by a model, mostly used to denote predictions made on the same data used to fit the model

formula

Novel interface in R that allows specification of a model using convenient mathematical notation

FORTRAN

High-speed, low-level language; much of R is written in FORTRAN

FRED

Federal Reserve Economic Data

FTP

file transfer protocol

g++

Open source compiler for C++

GAM

See Generalized Additive Models

Gamma Distribution

Probability distribution for the time one has to wait for n events to occur

gamma regression

GLM for response data that are continuous, positive and skewed, such as auto insurance claims

Gap statistic

Measure of clustering quality, which compares the within-cluster dissimilarity for a clustering of the data with that of a bootstrapped sample of data

GARCH

See Generalized Autoregressive Conditional Heteroskedasticity

Gaussian Distribution

See Normal Distribution

gcc

Family of open-source compilers

Generalized Additive Models

Models that are formed by adding a series of smoother functions fitted on individual variables

Generalized Autoregressive Conditional Heteroskedasticity

Time series method that is more robust to extreme values of data

Generalized Linear Models

Family of regression models that model non-normal response data such as binary and count data

Geometric Distribution

Probability distribution for the number of Bernoulli trials required before the first success occurs

Git

Popular version control standard

GitHub

Online Git repository

GLM

See Generalized Linear Models

Hadoop

Framework for distributing data and computations across a grid of computers

Hartigan’s Rule

Measure of clustering quality, which compares the within-cluster sum of squares for a clustering of k clusters and one with k + 1 clusters

heatmap

Visual display where the relationship between two variables is visualized as a mix of colors

Hierarchical Clustering

Form of clustering where each observation belongs to a cluster, which in turn belongs to a larger cluster and so on until the whole dataset is represented

histogram

Display of the counts of observations falling in discrete buckets of a variable of interest

HTML

Hypertext Markup Language; used for creating Web pages

Hypergeometric Distribution

Probability distribution for drawing k successes out of a possible N items, of which K are considered successes

hypothesis test

Test for the significance of a statistic that is being estimated

IDE

See Integrated Development Environment

indicator variables

Binary variables representing one level of a categorical variable; also called dummy variables

inference

Drawing conclusions on how predictors affect a response

integer

Data type that is only whole numbers, either positive, negative or zero

Integrated Development Environment

Software with features to make programming easier

Intel Matrix Kernel Library

Optimized matrix algebra library

interaction

The combined effect of two or more variables in a regression

intercept

Constant term in a regression; literally, the point where the best fit line passes through the y-axis; it is generalized for higher dimensions

Interquartile Range

The third quartile minus the first quartile

inverse link function

Function that transforms linear predictors to the original scale of the response data

inverse logit

Transformation needed to interpret logistic regression on the 0/1 scale; scales any number to be between 0 and 1

IQR

See Interquartile Range

Java

Low-level programming language

Joint Statistical Meetings

Conference for statisticians

JSM

See Joint Statistical Meetings

K-means

Clustering that divides the data into k discrete groups as defined by some distance measurement

K-medoids

Similar to K-means except it handles categorical data and is more robust to outliers

knitr

Modern package for interweaving R code with LATEX or Markdown

Lasso Regression

Modern regression using an L1 penalty to perform variable selection and dimension reduction

LATEX

High-quality typesetting program especially well suited for mathematical and scientific documents and books

level

A unique value in a factor variable

linear model

Model that is linear in the coefficients

link function

Function that transforms response data so it can be modeled with a GLM

Linux

Open source operating system

list

Robust data type that can hold any arbitrary data types

log

The inverse of an exponent; typically the natural log in statistics

Log-normal Distribution

Probability distribution whose log is Normally distributed

logical

Data type that takes on the values TRUE or FALSE

Logistic Distribution

Probability distribution used primarily for logistic regression

Logistic Regression

Regression for modeling a binary response

logit

The opposite of the inverse logit; transforms numbers between 0 and 1 to the real numbers

loop

Code that iterates through some index

MA

See Moving Average

Mac OS X

Apple’s proprietary operating system

Machine Learning

Modern, computationally heavy statistics

MapReduce

Paradigm where data are split into discrete sets, computed on, and then recombined in some fashion

Markdown

Simplified formatting syntax used to produce elegant HTML documents in a simple fashion

Matlab

Expensive commercial software for mathematical programming

matrix

Two-dimensional data type

matrix algebra

Algebra performed on matrices, which greatly simplifies the math

maximum

Largest value in a set of data

mean

Mathematical average; typically either arithmetic (traditional average) or weighted

mean squared error

Quality measure for an estimator; the average of the squares of the differences between an estimator and the true value

median

Middle number of an ordered set of numbers; when there are an even number of numbers, the median is the mean of the middle two numbers

Meetup

A Web site that facilitates real-life social interaction for any number of interests; particularly popular in the data field

memory

Also referred to as RAM, this is where the data that R analyzes is stored while being processed; this is typically the limiting factor on the size of data that R can handle

Microsoft Access

Lightweight database from Microsoft

Microsoft SQL Server

Enterprise-level database from Microsoft

minimum

Smallest value in a set of data

Minitab

GUI based statistical package

missing data

A big problem in statistics, this is data that is not available to compute for any one of a number of reasons

MKL

See Intel Matrix Kernel Library

model complexity

Primarily how many variables are included in the model; overly complex models can be problematic

model selection

Process of fitting the optimal model

Moving Average

Time series model that is a linear regression of the current value of a time series against current and previous residuals

multicolinearity

When one column in a matrix is a linear combination of any other columns

multidimensional scaling

Projecting multiple dimensions into a smaller dimensionality

Multinomial Distribution

Probability distribution for discrete data that can take on any of k classes

Multinomial Regression

Regression for discrete response that can take on any of k classes

multiple comparisons

Doing repeated tests on multiple groups

multiple imputation

Advanced process to fill in missing data using repeated regressions

Multiple Regression

Regression with more than one predictor

MySQL

Open source database

NA

Value that indicates missing data

namespace

Convention where functions belong to specific packages; helps solve conflicts when multiple functions have the same name

natural cubic spline

Smoothing function with smooth transitions at interior breakpoints and linear behavior beyond the endpoints of the input data

Negative Binomial Distribution

Probability distribution for the number of trials required to obtain r successes; this is often used as the approximate distribution for pseudopoisson regression

nonlinear least squares

Least squares regression (squared error loss) with nonlinear parameters

nonlinear model

Model where the variables do not necessarily have a linear relationship, such as decision trees and GAMs

nonparametric model

Model where the response does not necessarily follow the regular GLM distributions such as Normal, Logistic or Poisson

Normal Distribution

The most common probability distribution that is used for a wide array of phenomenon; the familiar bell curve

NULL

A data concept that represents nothingness

null hypothesis

The assumed true value in hypothesis tests

numeric

Data type for storing numeric values

NYC Data Mafia

Informal term for the growing prevalence of data scientists in New York City

NYC Open Data

Initiative to make New York City government data transparent and available

Octave

Open-source version of Matlab

ODBC

See Open Database Connectivity

Open Database Connectivity

Industry standard for communicating data to and from a database

ordered factor

Character data where one level can be said to be greater or less than another level

overdispersion

When data show more variability than indicated by the theoretical probability distribution

p-value

The probability, if the null hypothesis were correct, of getting as extreme, or more extreme, a result

PACF

See partial autocovariance function

paired t-test

Two-sample t-test where every member of one sample is paired with a member of a second sample

PAM

See Partitioning Around Medoids

pandoc

Software for easy conversion of documents among various formats such as Markdown, HTML, LATEX and Microsoft Word

parallel

In computational context, the running of multiple instructions simultaneously to speed computation

parallelization

The process of writing code to run in parallel

partial autocovariance function

The amount of correlation between a time series and lags of itself that is not explained by previous lags

Partioning Around Medoids

Most common algorithm for K-medoids clustering

PDF

Common document format most often opened with Adobe Acrobat Reader

Penalized Regression

Form of regression where a penalty term prevents the coefficients from growing too large

Perl

Scripting language commonly used for text parsing

Poisson Distribution

Probability Distribution for count data

Poisson Regression

GLM for response data that are counts, such as number of accidents, number of touchdowns or number of ratings for a pizzeria

POSIXct

Date-time data type

prediction

Finding the expected value of response data for given values of predictors

predictor

Data that are used as inputs into a model and explain and/or predict the response

prior

Bayesian statistics use prior information, in the form of distributions for the coefficients of predictors, to improve the model fit

Python

Scripted language that is popular for data munging

Q-Q plot

Visual means of comparing two distributions by seeing if the quantiles of the two fall on a diagonal line

quantile

Value, corresponding to a specified percentage, for a set of numbers, below which that percent of numbers falls

quartile

The 25th quantile

Quasipoisson Distribution

Distribution (actually the Negative Binomial) used for estimating count data that are overdispersed

R-Bloggers

Popular site from Tal Galili that aggregates blogs about R

R Console

Where R commands are entered and results are shown

R Core Team

Group of 20 prime contributors to R who are responsible for its maintenance and direction

R Enthusiasts

Popular R blog by Romain François

R in Finance

Conference in Chicago about using R for finance

RAM

See memory

Random Forest

Ensemble method that builds multiple decision trees, each with a random subset of predictors, and combines the results to make predictions

Rcmdr

GUI interface to R

Rcpp Gallery

Online collection of Rcpp examples

Rdata

File format for storing R objects on disk

regression

Method that analyzes the relationship between predictors and a response; the bedrock of statistics

regression tree

See decision tree

Regular Expressions

String pattern matching paradigm

regularization

Method to prevent overfitting of a model, usually by introducing a penalty term

residual sum of squares

Summation of the squared residuals

residuals

Difference between fitted values from a model and the actual response values

response

Data that are the outcome of a model and are predicted and/or explained by the predictors

Revolution R

Commercial distribution of R developed by Revolution Analytics designed to be faster and more stable and scale better

Ridge Regression

Modern regression using an L2 penalty to shrink coefficients for more stable predictions

RSS

See residual sum of squares

RStudio

Powerful and popular open-source IDE for R

RTools

Set of tools needed in Windows for integrating C++, and other compiled code, into R

S

Statistical language developed at Bell Labs that was the precursor to R

S3

Basic object type in R

S4

Advanced object type in R

s5

HTML5 slide show format

SAS

Expensive commercial scripting software for statistical analysis

scatterplot

Two-dimensional display of data where each point represents a unique combination of two variables

shapefile

Common file format for map data

shrinkage

Reducing the size of coefficients to prevent overfitting

Simple Regression

Regression with one predictor, not including the intercept

slideous

HTML5 slide show format

slidy

HTML5 slide show format

slope

Ratio of a line’s rise and run; in regression this is represented by the coefficients

smoothing spline

Spline used for fitting a smooth trend to data

spline

Function f that is a linear combination of N functions (one for each unique data point) that are transformations of the variable x

SPSS

Expensive point-and-click commercial software for statistical analysis

SQL

Database language for accessing or inserting data

Stack Overflow

Online resource for programming questions

STAN

Next generation probabilistic programming language specializing in Bayesian computations

standard deviation

How far, on average, each point is from the mean

standard error

Measure of the uncertainty for a parameter estimate

Stata

Commercial scripting language for statistical analysis

stationarity

When the mean and variance of a time series are constant for the whole series

stepwise selection

Process of choosing model variables by systematically fitting different models and adding or eliminating variables at each step

Strata

Large data conference

survival analysis

Analysis of time to event, such as death or failure

SUSE

Linux Distribution

SVN

Older version control standard

Sweave

Framework for interweaving R code with LATEX; has been superceded by knitr

Systat

Commercial statistical package

t-statistic

Ratio where the numerator is the difference between the estimated mean and the hypothesized mean, and the denominator is the standard error of the estimated mean

t-test

Test for the value of the mean of a group or the difference between the means of two groups

t Distribution

Probability distribution used for testing a mean with a student t-test

tensor product

A way of representing transformation functions of predictors, possibly measured on different units

text editor

Program for editing code that preserves the structure of the text

TextPad

Popular text editor

time series

Data where the order and time of the data are important to its analysis

ts

Data type for storing time series data

Two Sample t-test

Test for the difference of means between two samples

Ubuntu

Linux Distribution

UltraEdit

Popular text editor

Uniform Distribution

Probability distribution where every value is equally likely to be drawn

USAID Open Government

Initiative to make U.S. Aid data transparent and available

useR!

Conference for R users

VAR

See Vector Autoregressive Model

variable

R object; can be data, functions, any object

variance

Measure of the variability, or spread, of the data

vector

A collection of data elements, all of the same type

Vector Autoregressive Model

Multivariate times series model

version control

Means of saving snapshots of code at different time periods for easy maintenance and collaboration

vim

Text editor popular among programmers

violin plot

Similar to a boxplot except that the box is curved, giving a sense of the density of the data

Visual Basic

Programming language for building macros, mostly associated with Excel

Visual Studio

IDE produced by Microsoft

Wald test

Test for comparing models

Weibull Distribution

Probability distribution for the lifetime of an object

weighted mean

Mean where each value carries a weight, allowing the numbers to have different effects on the mean

weights

Importance given to observations in data so that one observation can be valued more or less than another

Welch t-test

Test for the difference in means between two samples where the variances of each sample can be different

white noise

Essentially random data

Windows Live Writer

Desktop blog publishing application from Microsoft

Xcode

Apple’s IDE

xkcd

Web comic by Randall Munroe, beloved by statisticians, physicists and mathematicians

XML

Extensible Markup Language; often used to descriptively store and transport data

xts

Advanced data type for storing time series data

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset