PCA using H2O

One of the greatest difficulties encountered in multivariate statistical analysis is the problem of displaying a dataset with many variables. Fortunately, in datasets with many variables, some pieces of data are often closely related to each other. This is because they actually contain the same information, as they measure the same quantity that governs the behavior of the system. These are therefore redundant variables that add nothing to the model we want to build. We can then simplify the problem by replacing a group of variables with a new variable that encloses the information content.

PCA generates a new set of variables, among them uncorrelated, called principal components; each main component is a linear combination of the original variables. All principal components are orthogonal to each other, so there is no redundant information. The principal components as a whole constitute an orthogonal basis for the data space. The goal of PCA is to explain the maximum amount of variance with the fewest number of principal components. It is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space that retains the maximum amount of information about the variables. A principal component is therefore a combination of the original variables after a linear transformation.

In the following example, we use h2o to achieve PCA. The prcomp() function is used find the principal components of a set of input features. This is unsupervised learning:

library(h2o)
h2o.init()

ausPath = system.file("extdata", "australia.csv", package="h2o")
australia.hex = h2o.uploadFile(path = ausPath)
summary(australia.hex)

pca_model=h2o.prcomp(training_frame = australia.hex,
k = 8,
transform = "STANDARDIZE")

summary(pca_model)
barplot(as.numeric(pca_model@model$importance[2,]),
main="Pca model",
xlab="Pca component",
ylab="Proportion of Variance")

Now, let's go through the code to understand how to apply the h2o package to apply PCA.

We can proceed with loading the library:

library(h2o)

This command loads the library into the R environment. The following function initiates the h2o engine with a maximum memory size of 2 GB and two parallel cores:

h2o.init()

The following messages are returned:

> h2o.init()
Connection successful!

R is connected to the H2O cluster:
H2O cluster uptime: 5 hours 40 minutes
H2O cluster version: 3.10.5.3
H2O cluster version age: 2 months and 18 days
H2O cluster name: H2O_started_from_R_lavoro_huu267
H2O cluster total nodes: 1
H2O cluster total memory: 2.63 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.4.1 (2017-06-30)

We follow the directions on the R prompt:

c1=h2o.init(max_mem_size = "2G", 
nthreads = 2,
ip = "localhost",
port = 54321)

The h20.init function initiates the h2o engine with a maximum memory size of 2 GB and two parallel cores. The following commands load the data into the R environment:

ausPath = system.file("extdata", "australia.csv", package="h2o")
australia.hex = h2o.uploadFile(path = ausPath)

The first instruction generates the path that contains the file to upload. To upload a file in a directory local to your h2o instance, use h2o.uploadFile(), which can also upload data local to your h2o instance in addition to your R session. In the parentheses, specify the h2o reference object in R and the complete URL or normalized file path for the file. Let's see now that it's inside:

summary(australia.hex)

Now let's print a brief summary of the dataset:

To perform PCA on the given dataset, we will use the prcomp() function:

pca_model=h2o.prcomp(training_frame = australia.hex, 
k = 8,
transform = "STANDARDIZE")

Now let's print a brief summary of the model:

summary(pca_model)

In the following figure, we see a summary of the PCA model:

To better understand the results, we can make a scree plot of the percent variability explained by each principal component. The percent variability explained is contained in the model importance variables from the PCA model.

The following figure shows a scree plot of the percent variability explained by each principal component:

The bar plot shows the proportion of variance for each principal component; as you can see, the first two components have about 70 percent of the variance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset