Data understanding and preparation

To begin with, we will load the necessary packages in order to download the data and conduct the analysis. Please ensure that you have these packages installed prior to loading:

    > library(ggplot2) #support scatterplot

> library(psych) #PCA package

Let's also assume you've put the two .csv files into your working directory, so read the training data using the read.csv() function:

    > train <- read.csv("NHLtrain.csv")

Examine the data using the structure function, str(). For brevity, I've included only the first few lines of the output of the command:

    > str(train)
'data.frame': 30 obs. of 15 variables:
$ Team : Factor w/ 30 levels "Anaheim","Arizona",..: 1 2 3 4 5 6 7
8 9 10 ...
$ ppg : num 1.26 0.95 1.13 0.99 0.94 1.05 1.26 1 0.93 1.33 ...
$ Goals_For : num 2.62 2.54 2.88 2.43 2.79 2.39 2.85 2.59 2.6 3.23
...
$ Goals_Against: num 2.29 2.98 2.78 2.62 3.13 2.7 2.52 2.93 3.02
2.78 ...

The next thing that we will need to do is look at the variable names. 

    > names(train)
[1] "Team" "ppg" "Goals_For" "Goals_Against" "Shots_For"
[6] "Shots_Against" "PP_perc" "PK_perc" "CF60_pp" "CA60_sh"
[11] "OZFOperc_pp" "Give" "Take" "hits" "blks"

Let's go over what they mean:

  • Team: This is the team's city
  • ppg: The average points per game per the point calculation discussed earlier
  • Goals_For: The average goals the team scores per game
  • Goals_Against: The goals allowed per game
  • Shots_For: Shots on goal per game
  • Shots_Against: Opponent shots on goal per game
  • PP_perc: Percent of power play opportunities the team scores a goal
  • PK_perc: Percent of time the team does not allow a goal when their opponent is on the power play
  • CF60_pp: The team's Corsi Score per 60 minutes of power play time; Corsi Score is the sum of shots for (Shots_For), shot attempts that miss the net and shots blocked by the opponent
  • CA60_sh: The opponents Corsi Score per 60 minutes of opponent power play time i.e. the team is shorthanded
  • OZFOperc_pp: The percentage of face offs that took place in the offensive zone while the team was on the power play
  • Give: The average number per game that the team gives away the puck
  • Take: The average number per game that the team gains control of the puck 
  • hits: The average number of the team's bodychecks per game
  • blks: The average number per game of the team's blocking an opponent's shot on goal

We'll need to have the data standardized with mean 0 and standard deviation of 1. Once we do that we can create and plot the correlations of the input features using the cor.plot() function available in the psych package:

    > train.scale <- scale(train[, -1:-2])

> nhl.cor <- cor(train.scale)

> cor.plot(nhl.cor)

The following is the output of the preceding command:

A couple of things are of interest. Notice that Shots_For is correlated with Goals_For and conversely, Shots_Against with Goals_Against. There also is some negative correlation with PP_perc and PK_perc with Goals_Against.

As such, this should be an adequate dataset to extract several principal components. 

Please note that these are features/variables that I've selected based on my interest. There are a bunch of different statistics you can gather on your own and see if you can improve the predictive power.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset