Creating summary statistics

We will now cover some basic measures of a central tendency, dispersion, and simple plots. The first question that we will address is How does R handle missing values in calculations? To see what happens, create a vector with a missing value (NA in the R language), then sum the values of the vector with sum():

    > a <- c(1, 2, 3, NA)
    
    > sum(a)
    [1] NA

Unlike SAS, which would sum the non-missing values, R does not sum the non-missing values, but simply returns NA, indicating that at least one value is missing. Now, we could create a new vector with the missing value deleted but you can also include the syntax to exclude any missing values with na.rm = TRUE:

    > sum(a, na.rm = TRUE)
    [1] 6

Functions exist to identify measures of the central tendency and dispersion of a vector:

    > data <- c(4, 3, 2, 5.5, 7.8, 9, 14, 20)
    
    > mean(data)
    [1] 8.1625
    
    > median(data)
    [1] 6.65
    
    > sd(data)
    [1] 6.142112
    
    > max(data)
    [1] 20
    
    > min(data)
    [1] 2
    
    > range(data)
    [1]  2 20
    
    > quantile(data)
       0%   25%   50%   75%  100% 
     2.00  3.75  6.65 10.25 20.00

A summary() function is available that includes the mean, median, and quartile values:

    > summary(data)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      2.000   3.750   6.650   8.162  10.250  20.000

We can use plots to visualize the data. The base plot here will be barplot, then we will use abline() to include the mean and median. As the default line is solid, we will create a dotted line for median with lty = 2 to distinguish it from mean:

    > barplot(data)
    
    > abline(h = mean(data))
    
    > abline(h = median(data), lty = 2)

The output of the preceding command is as follows:

A number of functions are available to generate different data distributions. Here, we can look at one such function for a normal distribution with a mean of zero and a standard deviation of 1, using rnorm() to create 100 data points. We will then plot the values and also plot a histogram. Additionally, to duplicate the results, ensure that you use the same random seed with set.seed():

    > set.seed(1)
    
    > norm = rnorm(100)

This is the plot of the 100 data points:

    > plot(norm)

The output of the preceding command is as follows:

Finally, produce a histogram with hist(norm):

    > hist(norm)

The following is the output of the preceding command:

Table of Contents for Creating summary statistics

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating summary statistics