To begin with, we will load the necessary packages in order to download the data and conduct the analysis. Please ensure that you have these packages installed prior to loading:
> library(ggplot2) #support scatterplot
> library(psych) #PCA package
Let's also assume you've put the two .csv files into your working directory, so read the training data using the read.csv() function:
> train <- read.csv("NHLtrain.csv")
Examine the data using the structure function, str(). For brevity, I've included only the first few lines of the output of the command:
> str(train)
'data.frame': 30 obs. of 15 variables:
$ Team : Factor w/ 30 levels "Anaheim","Arizona",..: 1 2 3 4 5 6 7
8 9 10 ...
$ ppg : num 1.26 0.95 1.13 0.99 0.94 1.05 1.26 1 0.93 1.33 ...
$ Goals_For : num 2.62 2.54 2.88 2.43 2.79 2.39 2.85 2.59 2.6 3.23
...
$ Goals_Against: num 2.29 2.98 2.78 2.62 3.13 2.7 2.52 2.93 3.02
2.78 ...
The next thing that we will need to do is look at the variable names.
> names(train)
[1] "Team" "ppg" "Goals_For" "Goals_Against" "Shots_For"
[6] "Shots_Against" "PP_perc" "PK_perc" "CF60_pp" "CA60_sh"
[11] "OZFOperc_pp" "Give" "Take" "hits" "blks"
Let's go over what they mean:
- Team: This is the team's city
- ppg: The average points per game per the point calculation discussed earlier
- Goals_For: The average goals the team scores per game
- Goals_Against: The goals allowed per game
- Shots_For: Shots on goal per game
- Shots_Against: Opponent shots on goal per game
- PP_perc: Percent of power play opportunities the team scores a goal
- PK_perc: Percent of time the team does not allow a goal when their opponent is on the power play
- CF60_pp: The team's Corsi Score per 60 minutes of power play time; Corsi Score is the sum of shots for (Shots_For), shot attempts that miss the net and shots blocked by the opponent
- CA60_sh: The opponents Corsi Score per 60 minutes of opponent power play time i.e. the team is shorthanded
- OZFOperc_pp: The percentage of face offs that took place in the offensive zone while the team was on the power play
- Give: The average number per game that the team gives away the puck
- Take: The average number per game that the team gains control of the puck
- hits: The average number of the team's bodychecks per game
- blks: The average number per game of the team's blocking an opponent's shot on goal
We'll need to have the data standardized with mean 0 and standard deviation of 1. Once we do that we can create and plot the correlations of the input features using the cor.plot() function available in the psych package:
> train.scale <- scale(train[, -1:-2])
> nhl.cor <- cor(train.scale)
> cor.plot(nhl.cor)
The following is the output of the preceding command:
A couple of things are of interest. Notice that Shots_For is correlated with Goals_For and conversely, Shots_Against with Goals_Against. There also is some negative correlation with PP_perc and PK_perc with Goals_Against.
As such, this should be an adequate dataset to extract several principal components.
Please note that these are features/variables that I've selected based on my interest. There are a bunch of different statistics you can gather on your own and see if you can improve the predictive power.