Time for action - deriving summary statistics

A sound way to initiate a deep data analysis is by deriving summary, or descriptive, statistics. These include simple, although highly informative, calculations such as means, standard deviations, and ranges, amongst others. Summary statistics are excellent for revealing overarching trends and patterns in a dataset. They provide us with a global understanding of our data.

For all calculations, we will store our summary statistics in new variables. For the time being, we will continue to focus on our head to head combat data.

  1. Calculate the means, as shown in the following example:
    > #use mean(data) to calculate the mean of a given dataset
    > #what was the mean number of Shu soldiers engaged in past
    head to head conflicts?
    > meanShuSoldiersHeadToHead <-
    mean(subsetHeadToHead$ShuSoldiersEngaged)
    > #what was the mean number of Wei soldiers engaged in past
    head to head conflicts?
    > meanWeiSoldiersHeadToHead <- mean(subsetHeadToHead$WeiSoldiersEngaged)
    > #what was the mean duration (in days) of past head to head
    conflicts?
    > meanDurationHeadToHead <- mean(subsetHeadToHead$DurationInDays)
    
  2. Display each of your mean variables in the R console:
    > #display the calculated means
    > meanShuSoldiersHeadToHead
    [1] 31341.67
    > meanWeiSoldiersHeadToHead
    [1] 33833.33
    > meanDurationHeadToHead
    [1] 77.93333
    
  3. Calculate the standard deviations, and consider the following:
    > #use sd(data) to calculate the standard deviation of a
    given dataset
    > #what was the standard deviation of Shu soldiers engaged in past
    head to head conflicts?
    > sdShuSoldiersHeadToHead <-
    sd(subsetHeadToHead$ShuSoldiersEngaged)
    > #what was the standard deviation of Wei soldiers engaged in
    past head to head conflicts?
    > sdWeiSoldiersHeadToHead <-
    sd(subsetHeadToHead$WeiSoldiersEngaged)
    > #what was the standard deviation of duration (in days)
    in past head to head conflicts?
    > sdDurationHeadToHead <- mean(subsetHeadToHead$DurationInDays)
    
  4. Display each of your standard deviation variables in the R console:
    > #display the calculated standard deviations
    > sdShuSoldiersHeadToHead
    [1] 31320.13
    > sdWeiSoldiersHeadToHead
    [1] 41192.22
    > sdDurationHeadToHead
    [1] 77.93333
    
  5. Calculate the ranges, as shown in the following:
    > #use range(data, ...) to calculate the range of a given dataset
    > #what was the range of Shu soldiers engaged in past head to
    head conflicts?
    > rangeShuSoldiersHeadToHead <-
    range(subsetHeadToHead$ShuSoldiersEngaged)
    > #what was the range of Wei soldiers engaged in past head to
    head conflicts?
    > rangeWeiSoldiersHeadToHead <-
    range(subsetHeadToHead$WeiSoldiersEngaged)
    > #what was the range of duration (in days) of past head to
    head conflicts?
    > rangeDurationHeadToHead <-
    range(subsetHeadToHead$DurationInDays)
    
  6. Display each of your range variables in the R console:
    > #display the calculated ranges
    > rangeShuSoldiersHeadToHead
    [1] 250 100000
    > rangeWeiSoldiersHeadToHead
    [1] 500 200000
    > rangeDurationHeadToHead
    [1] 30 120
    
  7. Display a general summary of the data:
    > #use the summary(object) function to generate a summary
    of a given object
    > #general summary of our head to head combat data
    > summaryHeadToHead <- summary(subsetHeadToHead)
    
  8. Display your summary variable in the R console. Your values should match the ones pictured in the following screenshot:
    > #display the head to head subset summary
    > summaryHeadToHead
    
    Time for action - deriving summary statistics

What just happened?

Through summary statistics, we have gained insights on the overall patterns in our data. Let us take a moment to discuss each one individually.

Means

You are already familiar with calculating means from our previous chapter. Here, we looked specifically at the mean soldier engagement and battle durations for past head to head conflicts. Again we see that the Wei forces tend to outnumber the Shu in battle. The average head to head battle has lasted 78 days.

Standard deviations

A standard deviation helps to depict the amount of variability present in a collection of data. The sd(data) function can be used to calculate the standard deviation of a given dataset. In our soldier engagement data, the Wei army had a higher standard deviation than the Shu army. This indicates that the Wei forces tended to enter battle with a more variable number of soldiers than the Shu forces. Since the Wei army usually outnumbered the Shu in past battles, it is expected that its standard deviation would be larger.

Ranges

The range of a dataset is composed of its minimum and maximum values. By using the range(data) function in R, we can list the minimum and maximum values of our data in a single command. Similar to the standard deviations, the Wei have a wider range of soldiers engaged than the Shu. This is a predictable outcome considering the Wei forces' larger numbers. The duration of past head to head conflicts ranged from 30 to a 120 days.

Note

Note that individual minimums and maximums can also be calculated using the min(data) and max(data) functions.

summary(object)

You also employed one of the most useful and versatile functions available to the R language. The summary(object) function generates descriptive statistics and other relevant calculations for an object automatically. In our case, the object was a dataset and our descriptive statistics included means, sums, medians, quartiles, minimums, and maximums. The wonderful thing about R's summary function is that it can be used on nearly any object. Depending on the type of object, the summary function will yield output that is relevant to that object. Therefore, it is not only a fast way to get an overall picture of your data, but it can be used in numerous situations. You should use summary(object) often, especially when you are beginning to analyze a dataset or want to inspect a newly created object.

Why use summary statistics?

You probably noticed that some of our summary statistic calculations yielded unsurprising and predictable results. This is not, however, reason to discount their value or an argument for abandoning them. In fact, using summary statistics to confirm that our data are normal is an essential early step in the data analysis process. In contrast, any value that stands out as peculiar in our summary statistics warrants further inspection. When this occurs, we may have discovered erroneous or outlying data points, or possibly counterintuitive or unforeseen trends and patterns.

For instance, the median duration of head to head battles (91 days) is noticeably higher than the mean duration (78 days). This may indicate that most battles tend to last on the longer side of our 30 to 120 day duration range and that our mean is being skewed downward by a small number of brief battles. By looking back at our head to head subset, we can confirm or deny this observation.

Pop quiz

  1. What is the major purpose of the summary(object) function in R?

    a. To provide summary statistics relevant to a given variable.

    b. To provide summary statistics relevant to a given dataset.

    c. To provide summary statistics relevant to a given object.

    d. To provide summary statistics relevant to a given subset.

  2. Which of the following is not a benefit of summary statistics?

    a. Summary statistics help provide overview information on a dataset.

    b. Summary statistics help answer very detailed questions about a dataset.

    c. Summary statistics help to validate a dataset.

    d. Summary statistics help to expose potential areas of concern and interest within a dataset.

Have a go hero

Now that you are familiar with deriving summary statistics, calculate the means, standard deviations, and ranges for each of the remaining battle methods surround, ambush, and fire. Also generate a summary of each subset. Follow a similar console structure and naming convention that we used with our head to head combat data. For example, you should create the following variables using your ambush data:

  • meanShuSoldiersAmbush, meanWeiSoldiersAmbush, meanDurationAmbush
  • sdShuSoldiersAmbush, sdWeiSoldiersAmbush, sdDurationAmbush
  • rangeShuSoldiersAmbush, rangeWeiSoldiersAmbush, rangeDurationAmbush
  • summaryAmbush
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset