Using statistical functions

R is highly productive in doing statistical computing and modeling since it provides a good variety of functions ranging from random sampling to statistical testing. The functions in the same category share a common interface. In this section, I will demonstrate a number of examples so that you can draw inferences about the usage of other similar functions.

Sampling from a vector

In statistics, the study of a population often begins with a random sample of it. The sample() function is designed for drawing a random sample from a given vector or list. In default, sample() draws a sample without replacement. For example, the following code draws a sample of five from a numeric vector without replacement:

sample(1:6, size = 5)
## [1] 2 6 3 1 4 

With replace = TRUE, the sampling is done with replacement:

sample(1:6, size = 5, replace = TRUE)
## [1] 3 5 3 4 2 

Although sample() is often used to draw samples from a numeric vector, it also works with other types of vectors:

sample(letters, size = 3)
## [1] "q" "w" "g" 

It even works with lists:

sample(list(a = 1, b = c(2, 3), c = c(3, 4, 5)), size = 2)
## $b
## [1] 2 3
##
## $c
## [1] 3 4 5 

In fact, sample() is capable of sampling from any object that supports subsetting with brackets ([]). In addition, it supports weighted sampling, that is, you can specify a probability for each element:

grades <- sample(c("A", "B", "C"), size = 20, replace = TRUE,
prob = c(0.25, 0.5, 0.25))
grades
## [1] "C" "B" "B" "B" "C" "C" "C" "C" "C" "B" "B" "A" "A" "C"
## [15] "B" "B" "A" "B" "A" "C" 

We can use table() to see the number of occurrences of each value:

table(grades)
## grades
## A B C
## 4 8 8 

Working with random distributions

In numeric simulations, it is more often the case that we need to draw samples from a random distribution rather than from a given vector. R provides a good variety of built-in functions to work with popular probability distributions. In this section, we will see how R provides basic statistical tools to work with R objects that represent sample data. These tools can be used to work mainly with numeric vectors.

In R, it is very easy to generate random numbers following a statistical distribution. The most commonly used two distributions are uniform distribution and normal distribution.

In a statistical sense, it is equally probable to draw any value from a uniform distribution within a given range. We can call runif(n) to generate n random numbers from a uniform distribution over [0,1]:

runif(5)
## [1] 0.8894535 0.1804072 0.6293909 0.9895641 0.1302889 

To generate random numbers within a non-default interval, specify min and max:

runif(5, min = -1, max = 1)
## [1] -0.3386789  0.7302411  0.5551689  0.6546069  0.2066487 

If we generate 1000 random numbers using runif(1000) and draw the points, we will get a scatter plot (a plot to show X-Y points) as follows:

Working with random distributions

The histogram shows that the random numbers we generated distribute almost evenly across each interval from 0 to 1, which is consistent with uniform distribution.

Another distribution that is most commonly seen in the real world is the normal distribution. Similar to runif(), we can use rnorm() to generate random numbers following a standard normal distribution:

rnorm(5)
## [1]  0.7857579  1.1820321 -0.9558760 -1.0316165  0.4336838 

You may notice that the random generator functions share the same interface. The first argument of both runif() and rnorm() is n, the number of values to generate, and the rest of the arguments are the parameters of the random distribution itself. As for a normal distribution, its parameters are mean and standard deviation (sd):

rnorm(5, mean = 2, sd = 0.5)
## [1] 1.597106 1.971534 2.374846 3.023233 2.033357 

The plot generated is as shown:

Working with random distributions

From the preceding graphics, it is obvious that the points are not evenly distributed but concentrate on the mean instead. As we know, statistical distributions can be described by certain formulas. To access these formulas in theory, R provides a family of functions for each built-in random distribution. More specifically, for uniform distribution, R provides its probability density function dunif(), cumulative density function punif(), quantile function qunif(), and random generator runif(). For normal distribution, the corresponding names are dnorm()pnorm(), and qnorm(). The same naming scheme of density function, cumulative density function, quantile function, as well as random generator also applies to other distributions R supports.

In addition to these two most commonly used statistical distributions, R also provides functions for discrete distributions, such as binomial distribution, and continuous distributions, such as exponential distribution. You can run ?Distributions to see a full list of supported distributions. The features of those distributions are beyond the scope of this book. If you are not familiar with them but are interested in the features of these distributions, you may read any textbook  on probability theory or visit Wikipedia (https://en.wikipedia.org/wiki/Probability_distribution) for more details.

R supports many distributions, and each of them has corresponding functions. Fortunately we don't need to remember a lot of different function names because they all follow the same naming convention.

Computing summary statistics

For a given dataset, we often need some summary statistics to get an initial impression on it. R provides a set of functions to compute summary statistics for a numeric vector, including mean, median, standard deviation, variance, maximum, minimum, range, and quantiles. For multiple numeric vectors, we can compute the covariance matrix and correlation matrix.

The following examples show how we use the built-in functions to compute these summary statistics. First, we generate a random numeric vector of length 50 from standard normal distribution:

x <- rnorm(50) 

To compute the arithmetic sample mean value of x, we call mean():

mean(x)
## [1] -0.1051295 

This is equivalent to:

sum(x) / length(x)
## [1] -0.1051295 

However, mean() supports trimming a fraction of observations from each end of the input data:

mean(x, trim = 0.05)
## [1] -0.141455 

If x contains a few outliers far from other values, the mean value obtained from the preceding equation should be more robust since the outliers are omitted from the input.

An alternative measure of the representative location of a sample data is the sample median. For a given sample, half of the observations are higher than the median, and the other half are lower than the median. The median can be a robust measure if there are a few extreme values in the data. For x, the sample median is:

median(x)
## [1] -0.2312157 

In addition to location measures such as mean and median, variation measures are important too. To compute the standard deviation of a numeric vector, we use sd():

sd(x)
## [1] 0.8477752 

To compute the variance, we use var():

var(x)
## [1] 0.7187228 

To simply get the extreme values in the data, we use min() and max():

c(min = min(x), max = max(x))
## min max
## -1.753655  2.587579 

Alternatively, you can use range() to directly get both:

range(x)
## [1] -1.753655  2.587579 

Sometimes, the data is not regularly distributed. In this case, the location measures and variation measures suffer from such irregularity and may produce misleading results. Here, we should probably take a look at the values at critical quantiles of the data:

quantile(x)
## 0% 25% 50% 75% 100%
## -1.7536547 -0.6774037 -0.2312157  0.2974412  2.5875789 

To see more quantiles, specify more values for the probs argument:

quantile(x, probs = seq(0, 1, 0.1))
## 0% 10% 20% 30%
## -1.753654706 -1.116231750 -0.891186551 -0.504630513
## 40% 50% 60% 70%
## -0.412239924 -0.231215699 0.009806393 0.177344522
## 80% 90% 100%
##  0.550510144  0.968607716  2.587578887 

If the data is not regularly distributed, the gap of values between two quantiles can be very large or small, compared to others. A shortcut for this is to use summary(), which directly gives the most commonly used summary statistics, including four quantiles, median, and mean:

summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.7540 -0.6774 -0.2312 -0.1051  0.2974  2.5880 

Note that the minimum values and the maximum values are the 0 percent quantile and 100 percent quantile, respectively.

In fact, summary() is a generic function that works for many types of objects and has different behaviors. For example, summary() works with data frames too:

df <- data.frame(score = round(rnorm(100, 80, 10)),
grade = sample(letters[1:3], 100, replace = TRUE))
summary(df)
## score grade
## Min. : 60.00 a:34
## 1st Qu.: 73.00 b:38
## Median : 79.00 c:28
## Mean : 79.65
## 3rd Qu.: 86.00
##  Max.   :107.00 

It can be seen that for a numeric column, summary() shows the summary statistics. For columns of other types, it may just simply show a table of value occurrences.

Computing covariance and correlation matrix

The preceding examples introduced the most commonly used summary statistics for one vector. For two or more vectors, we can compute the covariance matrix and the correlation matrix.

The following code generates another vector that is correlated with x:

y <- 2 * x + 0.5 * rnorm(length(x)) 

We can compute the covariance between x and y:

cov(x, y)
## [1] 1.419859 

We can also compute the correlation coefficient:

cor(x, y)
## [1] 0.9625964 

These two functions also work with more than two vectors. If we need to compute the covariance and correlation matrix of more than two vectors, we need to input a matrix or a data frame. In the following example, we generate another random vector z of the same length of x. This time, z follows a uniform distribution and does not depend on either x or y. We use cbind() to create a three-column matrix and compute the covariance matrix of them:

z <- runif(length(x))
m1 <- cbind(x, y, z)
cov(m1)
##           x          y          z
## x 0.7187228 1.41985899 0.04229950
## y 1.4198590 3.02719645 0.07299981
## z 0.0422995 0.07299981 0.08005535 

Similarly, we can call cor() directly with the matrix to compute the correlation matrix.

cor(m1)
##           x         y         z
## x 1.0000000 0.9625964 0.1763434
## y 0.9625964 1.0000000 0.1482881
## z 0.1763434 0.1482881 1.0000000 

Since y is generated by a linear relationship with x, plus some noise, we should expect that x and y are highly correlated, but the same thing should not happen with z. The correlation matrix looks consistent with our expectation. To draw such a conclusion in a statistical sense, we need to perform rigorous statistical tests, which is beyond the scope of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset