Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Using statistical functions

R is highly productive in doing statistical computing and modeling since it provides a good variety of functions ranging from random sampling to statistical testing. The functions in the same category share a common interface. In this section, I will demonstrate a number of examples so that you can draw inferences about the usage of other similar functions.

Sampling from a vector

In statistics, the study of a population often begins with a random sample of it. The sample() function is designed for drawing a random sample from a given vector or list. In default, sample() draws a sample without replacement. For example, the following code draws a sample of five from a numeric vector without replacement:

sample(1:6, size = 5)
## [1] 2 6 3 1 4

With replace = TRUE, the sampling is done with replacement:

sample(1:6, size = 5, replace = TRUE)
## [1] 3 5 3 4 2

Although sample() is often used to draw samples from a numeric vector, it also works with other types of vectors:

sample(letters, size = 3)
## [1] "q" "w" "g"

It even works with lists:

sample(list(a = 1, b = c(2, 3), c = c(3, 4, 5)), size = 2)
## $b
## [1] 2 3
##
## $c
## [1] 3 4 5

In fact, sample() is capable of sampling from any object that supports subsetting with brackets ([]). In addition, it supports weighted sampling, that is, you can specify a probability for each element:

grades <- sample(c("A", "B", "C"), size = 20, replace = TRUE,
prob = c(0.25, 0.5, 0.25))
grades
## [1] "C" "B" "B" "B" "C" "C" "C" "C" "C" "B" "B" "A" "A" "C"
## [15] "B" "B" "A" "B" "A" "C"

We can use table() to see the number of occurrences of each value:

table(grades)
## grades
## A B C
## 4 8 8

Working with random distributions

In numeric simulations, it is more often the case that we need to draw samples from a random distribution rather than from a given vector. R provides a good variety of built-in functions to work with popular probability distributions. In this section, we will see how R provides basic statistical tools to work with R objects that represent sample data. These tools can be used to work mainly with numeric vectors.

In R, it is very easy to generate random numbers following a statistical distribution. The most commonly used two distributions are uniform distribution and normal distribution.

In a statistical sense, it is equally probable to draw any value from a uniform distribution within a given range. We can call runif(n) to generate n random numbers from a uniform distribution over [0,1]:

runif(5)
## [1] 0.8894535 0.1804072 0.6293909 0.9895641 0.1302889

To generate random numbers within a non-default interval, specify min and max:

runif(5, min = -1, max = 1)
## [1] -0.3386789  0.7302411  0.5551689  0.6546069  0.2066487

If we generate 1000 random numbers using runif(1000) and draw the points, we will get a scatter plot (a plot to show X-Y points) as follows:

The histogram shows that the random numbers we generated distribute almost evenly across each interval from 0 to 1, which is consistent with uniform distribution.

Another distribution that is most commonly seen in the real world is the normal distribution. Similar to runif(), we can use rnorm() to generate random numbers following a standard normal distribution:

rnorm(5)
## [1]  0.7857579  1.1820321 -0.9558760 -1.0316165  0.4336838

You may notice that the random generator functions share the same interface. The first argument of both runif() and rnorm() is n, the number of values to generate, and the rest of the arguments are the parameters of the random distribution itself. As for a normal distribution, its parameters are mean and standard deviation (sd):

rnorm(5, mean = 2, sd = 0.5)
## [1] 1.597106 1.971534 2.374846 3.023233 2.033357

The plot generated is as shown:

From the preceding graphics, it is obvious that the points are not evenly distributed but concentrate on the mean instead. As we know, statistical distributions can be described by certain formulas. To access these formulas in theory, R provides a family of functions for each built-in random distribution. More specifically, for uniform distribution, R provides its probability density function dunif(), cumulative density function punif(), quantile function qunif(), and random generator runif(). For normal distribution, the corresponding names are dnorm(), pnorm(), and qnorm(). The same naming scheme of density function, cumulative density function, quantile function, as well as random generator also applies to other distributions R supports.

In addition to these two most commonly used statistical distributions, R also provides functions for discrete distributions, such as binomial distribution, and continuous distributions, such as exponential distribution. You can run ?Distributions to see a full list of supported distributions. The features of those distributions are beyond the scope of this book. If you are not familiar with them but are interested in the features of these distributions, you may read any textbook on probability theory or visit Wikipedia (https://en.wikipedia.org/wiki/Probability_distribution) for more details.

R supports many distributions, and each of them has corresponding functions. Fortunately we don't need to remember a lot of different function names because they all follow the same naming convention.

Computing summary statistics

For a given dataset, we often need some summary statistics to get an initial impression on it. R provides a set of functions to compute summary statistics for a numeric vector, including mean, median, standard deviation, variance, maximum, minimum, range, and quantiles. For multiple numeric vectors, we can compute the covariance matrix and correlation matrix.

The following examples show how we use the built-in functions to compute these summary statistics. First, we generate a random numeric vector of length 50 from standard normal distribution:

x <- rnorm(50)

To compute the arithmetic sample mean value of x, we call mean():

mean(x)
## [1] -0.1051295

This is equivalent to:

sum(x) / length(x)
## [1] -0.1051295

However, mean() supports trimming a fraction of observations from each end of the input data:

mean(x, trim = 0.05)
## [1] -0.141455

If x contains a few outliers far from other values, the mean value obtained from the preceding equation should be more robust since the outliers are omitted from the input.

An alternative measure of the representative location of a sample data is the sample median. For a given sample, half of the observations are higher than the median, and the other half are lower than the median. The median can be a robust measure if there are a few extreme values in the data. For x, the sample median is:

median(x)
## [1] -0.2312157

In addition to location measures such as mean and median, variation measures are important too. To compute the standard deviation of a numeric vector, we use sd():

sd(x)
## [1] 0.8477752

To compute the variance, we use var():

var(x)
## [1] 0.7187228

To simply get the extreme values in the data, we use min() and max():

c(min = min(x), max = max(x))
## min max
## -1.753655  2.587579

Alternatively, you can use range() to directly get both:

range(x)
## [1] -1.753655  2.587579

Sometimes, the data is not regularly distributed. In this case, the location measures and variation measures suffer from such irregularity and may produce misleading results. Here, we should probably take a look at the values at critical quantiles of the data:

quantile(x)
## 0% 25% 50% 75% 100%
## -1.7536547 -0.6774037 -0.2312157  0.2974412  2.5875789

To see more quantiles, specify more values for the probs argument:

quantile(x, probs = seq(0, 1, 0.1))
## 0% 10% 20% 30%
## -1.753654706 -1.116231750 -0.891186551 -0.504630513
## 40% 50% 60% 70%
## -0.412239924 -0.231215699 0.009806393 0.177344522
## 80% 90% 100%
##  0.550510144  0.968607716  2.587578887

If the data is not regularly distributed, the gap of values between two quantiles can be very large or small, compared to others. A shortcut for this is to use summary(), which directly gives the most commonly used summary statistics, including four quantiles, median, and mean:

summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.7540 -0.6774 -0.2312 -0.1051  0.2974  2.5880

Note that the minimum values and the maximum values are the 0 percent quantile and 100 percent quantile, respectively.

In fact, summary() is a generic function that works for many types of objects and has different behaviors. For example, summary() works with data frames too:

df <- data.frame(score = round(rnorm(100, 80, 10)),
grade = sample(letters[1:3], 100, replace = TRUE))
summary(df)
## score grade
## Min. : 60.00 a:34
## 1st Qu.: 73.00 b:38
## Median : 79.00 c:28
## Mean : 79.65
## 3rd Qu.: 86.00
##  Max.   :107.00

It can be seen that for a numeric column, summary() shows the summary statistics. For columns of other types, it may just simply show a table of value occurrences.

Computing covariance and correlation matrix

The preceding examples introduced the most commonly used summary statistics for one vector. For two or more vectors, we can compute the covariance matrix and the correlation matrix.

The following code generates another vector that is correlated with x:

y <- 2 * x + 0.5 * rnorm(length(x))

We can compute the covariance between x and y:

cov(x, y)
## [1] 1.419859

We can also compute the correlation coefficient:

cor(x, y)
## [1] 0.9625964

These two functions also work with more than two vectors. If we need to compute the covariance and correlation matrix of more than two vectors, we need to input a matrix or a data frame. In the following example, we generate another random vector z of the same length of x. This time, z follows a uniform distribution and does not depend on either x or y. We use cbind() to create a three-column matrix and compute the covariance matrix of them:

z <- runif(length(x))
m1 <- cbind(x, y, z)
cov(m1)
##           x          y          z
## x 0.7187228 1.41985899 0.04229950
## y 1.4198590 3.02719645 0.07299981
## z 0.0422995 0.07299981 0.08005535

Similarly, we can call cor() directly with the matrix to compute the correlation matrix.

cor(m1)
##           x         y         z
## x 1.0000000 0.9625964 0.1763434
## y 0.9625964 1.0000000 0.1482881
## z 0.1763434 0.1482881 1.0000000

Since y is generated by a linear relationship with x, plus some noise, we should expect that x and y are highly correlated, but the same thing should not happen with z. The correlation matrix looks consistent with our expectation. To draw such a conclusion in a statistical sense, we need to perform rigorous statistical tests, which is beyond the scope of this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Using statistical functions

Create new playlist

Sign In

Sign Up

Using statistical functions

Sampling from a vector

Working with random distributions

Computing summary statistics

Computing covariance and correlation matrix

Table of Contents for
Using statistical functions