Data understanding and preparation

To begin, we will load the dataset named water and define the structure of the str() function as follows:

    > data(water)
    
    > str(water)
    'data.frame':   43 obs. of  8 variables:
    $ Year   : int  1948 1949 1950 1951 1952 1953 1954 
      1955 1956 1957 ...
    $ APMAM  : num  9.13 5.28 4.2 4.6 7.15 9.7 5.02 6.7 
       10.5 9.1 ...
    $ APSAB  : num  3.58 4.82 3.77 4.46 4.99 5.65 1.45 
       7.44 5.85 6.13 ...
    $ APSLAKE: num  3.91 5.2 3.67 3.93 4.88 4.91 1.77 
       6.51 3.38 4.08 ...
    $ OPBPC  : num  4.1 7.55 9.52 11.14 16.34 ...
    $ OPRC   : num  7.43 11.11 12.2 15.15 20.05 ...
    $ OPSLAKE: num  6.47 10.26 11.35 11.13 22.81 ...
    $ BSAAM  : int  54235 67567 66161 68094 107080 
       67594 65356 67909 92715 70024 ...

Here we have eight features and one response variable, BSAAM. The observations start in 1943 and run for 43 consecutive years. Since for this exercise we are not concerned with what year the observations occurred in, it makes sense to create a new data frame excluding the year vector. This is quite easy to do. With one line of code, we can create the new data frame, and then verify that it works with the head() function:

    > socal.water <- water[ ,-1] #new dataframe with 
      the deletion of 
      column 1
    
    > head(socal.water)
      APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
    1  9.13  3.58    3.91  4.10  7.43    6.47  54235
    2  5.28  4.82    5.20  7.55 11.11   10.26  67567
    3  4.20  3.77    3.67  9.52 12.20   11.35  66161
    4  4.60  4.46    3.93 11.14 15.15   11.13  68094
    5  7.15  4.99    4.88 16.34 20.05   22.81 107080
    6  9.70  5.65    4.91  8.88  8.15    7.41  67594

With all the features being quantitative, it makes sense to look at the correlation statistics and then produce a matrix of scatterplots. The correlation coefficient or Pearson's r, is a measure of both the strength and direction of the linear relationship between two variables. The statistic will be a number between -1 and 1, where -1 is the total negative correlation and +1 is the total positive correlation. The calculation of the coefficient is the covariance of the two variables divided by the product of their standard deviations. As previously discussed, if you square the correlation coefficient, you will end up with R-squared.

There are a number of ways to produce a matrix of correlation plots. Some prefer to produce heatmaps, but I am a big fan of what is produced with the corrplot package. It can produce a number of different variations including ellipse, circle, square, number, shade, color, and pie. I like the ellipse method, but feel free to experiment with the others. Let's load the corrplot package, create a correlation object using the base cor() function, and examine the following results:

    > library(corrplot)
    
    > water.cor <- cor(socal.water)
    
    > water.cor
    APMAM      APSAB    APSLAKE      OPBPC      
    APMAM   1.0000000 0.82768637 0.81607595 0.12238567 
    APSAB   0.8276864 1.00000000 0.90030474 0.03954211 
    APSLAKE 0.8160760 0.90030474 1.00000000 0.09344773 
    OPBPC   0.1223857 0.03954211 0.09344773 1.00000000 
    OPRC    0.1544155 0.10563959 0.10638359 0.86470733 
    OPSLAKE 0.1075421 0.02961175 0.10058669 0.94334741 
    BSAAM   0.2385695 0.18329499 0.24934094 0.88574778 
             OPRC    OPSLAKE     BSAAM
    APMAM   0.1544155 0.10754212 0.2385695
    APSAB   0.1056396 0.02961175 0.1832950
    APSLAKE 0.1063836 0.10058669 0.2493409
    OPBPC   0.8647073 0.94334741 0.8857478
    OPRC    1.0000000 0.91914467 0.9196270
    OPSLAKE 0.9191447 1.00000000 0.9384360
    BSAAM   0.9196270 0.93843604 1.0000000

So, what does this tell us? First of all, the response variable is highly and positively correlated with the OP features with OPBPC as 0.8857, OPRC as 0.9196, and OPSLAKE as 0.9384. Also note that the AP features are highly correlated with each other and the OP features as well. The implication is that we may run into the issue of multi-collinearity. The correlation plot matrix provides a nice visual of the correlations as follows:

    > corrplot(water.cor, method = "ellipse")

The output of the preceding code snippet is as follows:

Another popular visual is a scatterplot matrix. This can be called with the pairs() function. It reinforces what we saw in the correlation plot in the previous output:

    > pairs(~ ., data = socal.water)

The output of the preceding code snippet is as follows:

Table of Contents for Data understanding and preparation

Create new playlist

Sign In

Sign Up

Table of Contents for
Data understanding and preparation