See Akaike Information Criterion
See Akaike Information Criterion Corrected
Measure of model fit quality that penalizes model complexity
Akaike Information Criterion Corrected
Version of AIC with greater penalty for model complexity
See ANOVA
Survival analysis for modeling time to multiple events
Test for comparing the means of multiple groups; the test can only detect if there is a difference between any two groups, it cannot tell which ones are different from which others
Nonparametric test for the equality of variances between two groups
See autoregressive
Like an ARMA model but it includes a parameter for the number of differences of the time series data
See Autoregressive Moving Average
Object that holds data in multiple dimensions
When observations in a single variable are correlated with previous observations
The correlation of a time series with lags of itself
Time series model that is a linear regression of the current value of a time series against previous values
Combination of AR and MA models
While generally held to be the arithmetic mean, average is actually a generic term that can mean any number of measures of centrality such as the mean, median or mode
Parametric test for the equality of variances between two groups
A command line processor in the same vein as DOS; mainly used on Linux and MAC OS X though there is an emulator for Windows
Functions whose linear combination make up other functions
Basis functions used to compose splines
Type of statistics where prior information is used to inform the model
Bayesian Information Criterion
Similar to AIC but with an even greater penalty for model complexity
LATEX document class for producing slide shows
Probability distribution for modeling the success or failure of an event
Probability distribution for modeling a set of possible values on a finite interval
See Bayesian Information Criterion
Probability distribution for modeling the number of successful independent trials with identical probabilities of success
Repository of R
packages for the analysis of genomic data
Online Git repository
Fast C++ library
A process in which data are resampled repeatedly, and a statistic is calculated for each resampling to form an empirical distribution for that statistic
A graphical display of one variable where the middle 50% of the data are in a box, and there are lines reaching out to 1.5 times the Interquartile Range and dots representing outliers
Probabilistic programming language specializing in Bayesian computations
The process of turning human readable code into machine code that runs faster
A fast, low-level programming language; R
is written primarily in C
A fast, low-level programming language that is similar to C
Probability distribution for the ratio of two Normal random variables
Data with unknown information, such as the occurrence of an event after a cutoff time
Data type for storing text
The sum of k squared standard normal distributions
Piece of R
code inside a LATEX or Markdown document
Type of an R
object
Determining the class membership of data
Partitioning data into groups
A multiplier associated with a variable in an equation; in statistics this is typically what is being estimated by a regression
A visual display of the coefficients and standard errors from a regression
Comprehensive R Archive Network
See CRAN
A range within which an estimate should fall a certain percent of time
The strength of the association between two variables
A measure of the association between two variables; the strength of the relationship is not necessarily indicated
Model for survival analysis where predictors have a multiplicative effect on the survival rate
The central repository for all things R
A modern form of model assessment where the data are split into k discrete folds, and a model is repeatedly fitted on all but one and used to make predictions on the holdout fold
Data science conference in New York
The process of cleaning, correcting, aggregating, joining and manipulating data to prepare it for analysis
The confluence of statistics, machine learning, computer engineering, visualization and social skills
The main data type in R
, similar to a spreadsheet with tabular rows and columns
A high speed extension of data.frames
Store of data, usually in relational tables
Data type for storing dates
Enterprise level database from IBM
Linux Distribution
Modern technique for performing nonlinear regression or classification by iteratively splitting predictors
For some statistic or distribution, this is the number of observations minus the number of parameters being estimated
Display showing the probability of observations falling within a sliding window along a variable of interest
A measure of error for generalized linear models
The amount by which deviance drops when adding a variable to a model; a general rule of thumb is that deviance should drop by two for each term added
Data source connection used to describe communication to a data source, often a database
HTML5 slide show format
New algorithm that is a dynamic blending of lasso and ridge regressions, which is great for predictions and dealing with high dimensional datasets
Text editor popular among programmers
Method of combining multiple models to get an average prediction
The most commonly used data analysis tool in the world
Weighted mean
Visually and numerically exploring data to get a sense of it before performing rigorous analysis
Probability distribution often used to model the amount of time until an event occurs
Statistical test often used for comparing models, as with the ANOVA
The ratio of two Chi-Squared Distributions, often used as the null distribution in analysis of variance
Special data type for handling character data as an integer value with character labels; important for including categorical data in models
Values predicted by a model, mostly used to denote predictions made on the same data used to fit the model
Novel interface in R
that allows specification of a model using convenient mathematical notation
High-speed, low-level language; much of R
is written in FORTRAN
Federal Reserve Economic Data
file transfer protocol
Open source compiler for C++
See Generalized Additive Models
Probability distribution for the time one has to wait for n events to occur
GLM for response data that are continuous, positive and skewed, such as auto insurance claims
Measure of clustering quality, which compares the within-cluster dissimilarity for a clustering of the data with that of a bootstrapped sample of data
See Generalized Autoregressive Conditional Heteroskedasticity
Family of open-source compilers
Models that are formed by adding a series of smoother functions fitted on individual variables
Generalized Autoregressive Conditional Heteroskedasticity
Time series method that is more robust to extreme values of data
Family of regression models that model non-normal response data such as binary and count data
Probability distribution for the number of Bernoulli trials required before the first success occurs
Popular version control standard
Online Git repository
Framework for distributing data and computations across a grid of computers
Measure of clustering quality, which compares the within-cluster sum of squares for a clustering of k clusters and one with k + 1 clusters
Visual display where the relationship between two variables is visualized as a mix of colors
Form of clustering where each observation belongs to a cluster, which in turn belongs to a larger cluster and so on until the whole dataset is represented
Display of the counts of observations falling in discrete buckets of a variable of interest
Hypertext Markup Language; used for creating Web pages
Probability distribution for drawing k successes out of a possible N items, of which K are considered successes
Test for the significance of a statistic that is being estimated
See Integrated Development Environment
Binary variables representing one level of a categorical variable; also called dummy variables
Drawing conclusions on how predictors affect a response
Data type that is only whole numbers, either positive, negative or zero
Integrated Development Environment
Software with features to make programming easier
Optimized matrix algebra library
The combined effect of two or more variables in a regression
Constant term in a regression; literally, the point where the best fit line passes through the y-axis; it is generalized for higher dimensions
The third quartile minus the first quartile
Function that transforms linear predictors to the original scale of the response data
Transformation needed to interpret logistic regression on the 0/1 scale; scales any number to be between 0 and 1
Low-level programming language
Conference for statisticians
See Joint Statistical Meetings
Clustering that divides the data into k discrete groups as defined by some distance measurement
Similar to K-means except it handles categorical data and is more robust to outliers
Modern package for interweaving R
code with LATEX or Markdown
Modern regression using an L1 penalty to perform variable selection and dimension reduction
High-quality typesetting program especially well suited for mathematical and scientific documents and books
A unique value in a factor
variable
Model that is linear in the coefficients
Function that transforms response data so it can be modeled with a GLM
Open source operating system
Robust data type that can hold any arbitrary data types
The inverse of an exponent; typically the natural log in statistics
Probability distribution whose log is Normally distributed
Data type that takes on the values TRUE
or FALSE
Probability distribution used primarily for logistic regression
Regression for modeling a binary response
The opposite of the inverse logit; transforms numbers between 0 and 1 to the real numbers
Code that iterates through some index
See Moving Average
Apple’s proprietary operating system
Modern, computationally heavy statistics
Paradigm where data are split into discrete sets, computed on, and then recombined in some fashion
Simplified formatting syntax used to produce elegant HTML documents in a simple fashion
Expensive commercial software for mathematical programming
Two-dimensional data type
Algebra performed on matrices, which greatly simplifies the math
Largest value in a set of data
Mathematical average; typically either arithmetic (traditional average) or weighted
Quality measure for an estimator; the average of the squares of the differences between an estimator and the true value
Middle number of an ordered set of numbers; when there are an even number of numbers, the median is the mean of the middle two numbers
A Web site that facilitates real-life social interaction for any number of interests; particularly popular in the data field
Also referred to as RAM, this is where the data that R
analyzes is stored while being processed; this is typically the limiting factor on the size of data that R
can handle
Lightweight database from Microsoft
Enterprise-level database from Microsoft
Smallest value in a set of data
GUI based statistical package
A big problem in statistics, this is data that is not available to compute for any one of a number of reasons
See Intel Matrix Kernel Library
Primarily how many variables are included in the model; overly complex models can be problematic
Process of fitting the optimal model
Time series model that is a linear regression of the current value of a time series against current and previous residuals
When one column in a matrix is a linear combination of any other columns
Projecting multiple dimensions into a smaller dimensionality
Probability distribution for discrete data that can take on any of k classes
Regression for discrete response that can take on any of k classes
Doing repeated tests on multiple groups
Advanced process to fill in missing data using repeated regressions
Regression with more than one predictor
Open source database
Value that indicates missing data
Convention where functions belong to specific packages; helps solve conflicts when multiple functions have the same name
Smoothing function with smooth transitions at interior breakpoints and linear behavior beyond the endpoints of the input data
Negative Binomial Distribution
Probability distribution for the number of trials required to obtain r successes; this is often used as the approximate distribution for pseudopoisson regression
Least squares regression (squared error loss) with nonlinear parameters
Model where the variables do not necessarily have a linear relationship, such as decision trees and GAMs
Model where the response does not necessarily follow the regular GLM distributions such as Normal, Logistic or Poisson
The most common probability distribution that is used for a wide array of phenomenon; the familiar bell curve
A data concept that represents nothingness
The assumed true value in hypothesis tests
Data type for storing numeric values
Informal term for the growing prevalence of data scientists in New York City
Initiative to make New York City government data transparent and available
Open-source version of Matlab
See Open Database Connectivity
Industry standard for communicating data to and from a database
Character data where one level can be said to be greater or less than another level
When data show more variability than indicated by the theoretical probability distribution
The probability, if the null hypothesis were correct, of getting as extreme, or more extreme, a result
See partial autocovariance function
Two-sample t-test where every member of one sample is paired with a member of a second sample
See Partitioning Around Medoids
Software for easy conversion of documents among various formats such as Markdown, HTML, LATEX and Microsoft Word
In computational context, the running of multiple instructions simultaneously to speed computation
The process of writing code to run in parallel
partial autocovariance function
The amount of correlation between a time series and lags of itself that is not explained by previous lags
Most common algorithm for K-medoids clustering
Common document format most often opened with Adobe Acrobat Reader
Form of regression where a penalty term prevents the coefficients from growing too large
Scripting language commonly used for text parsing
Probability Distribution for count data
GLM for response data that are counts, such as number of accidents, number of touchdowns or number of ratings for a pizzeria
Date-time data type
Finding the expected value of response data for given values of predictors
Data that are used as inputs into a model and explain and/or predict the response
Bayesian statistics use prior information, in the form of distributions for the coefficients of predictors, to improve the model fit
Scripted language that is popular for data munging
Visual means of comparing two distributions by seeing if the quantiles of the two fall on a diagonal line
Value, corresponding to a specified percentage, for a set of numbers, below which that percent of numbers falls
The 25th quantile
Distribution (actually the Negative Binomial) used for estimating count data that are overdispersed
Popular site from Tal Galili that aggregates blogs about R
Where R
commands are entered and results are shown
Group of 20 prime contributors to R
who are responsible for its maintenance and direction
Popular R
blog by Romain François
Conference in Chicago about using R
for finance
See memory
Ensemble method that builds multiple decision trees, each with a random subset of predictors, and combines the results to make predictions
GUI interface to R
Online collection of Rcpp examples
File format for storing R
objects on disk
Method that analyzes the relationship between predictors and a response; the bedrock of statistics
See decision tree
String pattern matching paradigm
Method to prevent overfitting of a model, usually by introducing a penalty term
Summation of the squared residuals
Difference between fitted values from a model and the actual response values
Data that are the outcome of a model and are predicted and/or explained by the predictors
Commercial distribution of R
developed by Revolution Analytics designed to be faster and more stable and scale better
Modern regression using an L2 penalty to shrink coefficients for more stable predictions
Powerful and popular open-source IDE for R
Set of tools needed in Windows for integrating C++, and other compiled code, into R
Statistical language developed at Bell Labs that was the precursor to R
Basic object type in R
Advanced object type in R
HTML5 slide show format
Expensive commercial scripting software for statistical analysis
Two-dimensional display of data where each point represents a unique combination of two variables
Common file format for map data
Reducing the size of coefficients to prevent overfitting
Regression with one predictor, not including the intercept
HTML5 slide show format
HTML5 slide show format
Ratio of a line’s rise and run; in regression this is represented by the coefficients
Spline used for fitting a smooth trend to data
Function f that is a linear combination of N functions (one for each unique data point) that are transformations of the variable x
Expensive point-and-click commercial software for statistical analysis
Database language for accessing or inserting data
Online resource for programming questions
Next generation probabilistic programming language specializing in Bayesian computations
How far, on average, each point is from the mean
Measure of the uncertainty for a parameter estimate
Commercial scripting language for statistical analysis
When the mean and variance of a time series are constant for the whole series
Process of choosing model variables by systematically fitting different models and adding or eliminating variables at each step
Large data conference
Analysis of time to event, such as death or failure
Linux Distribution
Older version control standard
Framework for interweaving R
code with LATEX; has been superceded by knitr
Commercial statistical package
Ratio where the numerator is the difference between the estimated mean and the hypothesized mean, and the denominator is the standard error of the estimated mean
Test for the value of the mean of a group or the difference between the means of two groups
Probability distribution used for testing a mean with a student t-test
A way of representing transformation functions of predictors, possibly measured on different units
Program for editing code that preserves the structure of the text
Popular text editor
Data where the order and time of the data are important to its analysis
Data type for storing time series data
Test for the difference of means between two samples
Linux Distribution
Popular text editor
Probability distribution where every value is equally likely to be drawn
Initiative to make U.S. Aid data transparent and available
Conference for R
users
See Vector Autoregressive Model
R
object; can be data, functions, any object
Measure of the variability, or spread, of the data
A collection of data elements, all of the same type
Multivariate times series model
Means of saving snapshots of code at different time periods for easy maintenance and collaboration
Text editor popular among programmers
Similar to a boxplot except that the box is curved, giving a sense of the density of the data
Programming language for building macros, mostly associated with Excel
IDE produced by Microsoft
Test for comparing models
Probability distribution for the lifetime of an object
Mean where each value carries a weight, allowing the numbers to have different effects on the mean
Importance given to observations in data so that one observation can be valued more or less than another
Test for the difference in means between two samples where the variances of each sample can be different
Essentially random data
Desktop blog publishing application from Microsoft
Apple’s IDE
Web comic by Randall Munroe, beloved by statisticians, physicists and mathematicians
Extensible Markup Language; often used to descriptively store and transport data
Advanced data type for storing time series data