Principal Component Analysis

One of the greatest difficulties encountered in multivariate statistical analysis is the problem of displaying a dataset with many variables. Fortunately, in datasets with many variables, some parts of data are often closely related to each other. This is because they actually contain the same information, as they measure the same quantity that governs the behavior of the system. These are therefore redundant variables that add nothing to the model we want to build. We can then simplify the problem by replacing a group of variables with a new variable that encloses the information content. The following figure shows redundant data in a table:

Figure 8.7: Redundant data in a table

PCA generates a new set of variables, among them uncorrelated, called principal components; each main component is a linear combination of the original variables. All principal components are orthogonal to each other, so there is no redundant information. The principal components as a whole constitute an orthogonal basis for the data space. The goal of PCA is to explain the maximum amount of variance with the least number of principal components. PCA is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space which retains maximum amount of information about the variables. A principal component is therefore a combination of the original variables after a linear transformation.

In MATLAB, PCA is performed by using the pca() function; it returns the principal component coefficients, also known as loadings, for the n-by-p data matrix given by the users. Rows of this matrix correspond to observations and columns correspond to variables. The coefficient matrix is p-by-p. Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. By default, PCA centers the data and uses the Singular Value Decomposition (SVD) algorithm.

As always, we begin by getting the data to be analyzed.

To get the data, we draw on the large collection of data available from the UCI Machine Learning Repository at the following link:
http://archive.ics.uci.edu/ml

In this study, we use a seeds dataset, which contains measurements of the geometrical properties of kernels belonging to three different varieties of wheat. As reported on the site, the examined group comprises kernels belonging to three different varieties of wheat (Kama, Rosa, and Canadian), with 70 elements each, randomly selected for the experiment. High-quality visualization of the internal kernel structure was detected using a soft X-ray technique. The images were recorded on 13 x 18 cm X-ray KODAK plates. Studies were conducted using combine-harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

The seeds dataset is multivariate, consisting of 210 instances. Seven geometric parameters of the wheat kernel are used as real-valued attributes organizing an instance. These seven attributes are:

  • Area A
  • Perimeter P
  • Compactness C = 4*pi*A/P^2
  • Length of kernel
  • Width of kernel
  • Asymmetry coefficient
  • Length of kernel groove

The seeds dataset contains 210 records of 3 kinds of wheat seed specification; there are 70 points of each kind.

To start, we download the data from the UCI Machine Learning Repository and save it in our current folder. To do this, we will use the websave() function; it saves content from the web service specified by the URL address and writes it to a file. We insert the URL address to the specific dataset into a variable named url:

>> url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt';

Now, we save the content into a file named seeds_dataset.csv:

>> websave('seeds_dataset.csv',url);

We set the names of the variables in accordance with the previous example:

>> varnames = {'Area'; 'Perimeter'; 'Compactness'; 'LengthK'; 'WidthK';'AsymCoef';'LengthKG';'Seeds'};

Read the data into a table and specify the variable names:

>> Seeds_dataset = readtable('seeds_dataset.csv');
>> Seeds_dataset.Properties.VariableNames = varnames;

So, the data is now available in the MATLAB workspace in table form. Now we can perform a PCA. Before we start, just a moment! Do you remember the missing values in the dataset used in stepwise regression? Let's see if they are also in this dataset:

>> MissingValue = ismissing(Seeds_dataset);

This command returns a logical array that indicates which elements of an array or table contain missing values. To extract only the rows that contain missing values, simply type:

>> RowsMissValue = find(any(MissingValue==1,2));

This command returns a 22x1 column vector representing the number of the rows that contain missing values. To avoid problems in subsequent calculations, we should remove these 22 rows from Seeds_dataset. To this end, we will use the rmmissing() function to remove missing entries from an array or table:

>> Seeds_dataset =  rmmissing(Seeds_dataset);

From a 221x8 table, the Seeds_dataset now contains 199x8 double data. The first 15 rows of the Seeds_dataset table have been cleaned, as shown here:

Figure 8.8: First 15 rows of the Seeds_dataset table cleaned (no NaN values are included)

Before passing our data to the pca() function, we need to give a first look to what we've got in the Seeds_dataset table. The first seven columns contain the measured variables, while the eighth variable is the type of seed. Let's first try to find out whether the variables are related to each other. We can do this using the plotmatrix() function to create a matrix of sub-axes containing scatter plots of the columns of a matrix. But we have a table, so they need to extract only the measured variables into a single matrix:

>> VarMeas = table2array((Seeds_dataset(:,1:7)));
>> SeedClass = table2array((Seeds_dataset(:,8)));

Now we can use the plotmatrix() function:

>> plotmatrix(VarMeas)

The sub-axes along the diagonal are histogram plots of the data in the corresponding column of the matrix, while the rest of the plots are scatter plots of the columns of a matrix (the subplot in the ith row and jth column of the matrix is a scatter plot of the ith column against the jth column), as shown in the following figure:

Figure 8.9: Scatter plot matrix of the measured variables

As we can see in Figure 8.9, scatter plot matrices are a great way to roughly determine whether you have a linear correlation between multiple variables. This is particularly helpful in locating specific variables that might have mutual correlations, indicating a possible redundancy of data. As already mentioned, in the diagonal are histogram plots of the measured variables, which provide us with a measure of the distribution of its values. The rest of the plots are scatter plots of the matrix columns. Specifically, each plot is present twice; the plots on the ith row are the same as those in the ith column (mirror image).

From a first analysis of Figure 8.8, you can see several plots showing a linear relationship between the variables. This is the case of the plot showing the relationship between Area and Perimeter, as well as between Perimeter and LengthK, just to name a couple; while no correlation can be found for other pairs of variables. For example, the variable LengthKG has no correlation with any variable; in fact, the data is distributed throughout the plot area.

This first visual analysis can be confirmed by calculating the linear correlation between the measured variables. We will use the corr() function to return a matrix containing the pairwise linear correlation coefficient (r) between each pair of columns in the matrix given by the user:

>> r = corr(VarMeas)
r =
1.0000 0.9944 0.6099 0.9511 0.9710 -0.2228 0.8627
0.9944 1.0000 0.5318 0.9729 0.9455 -0.2110 0.8895
0.6099 0.5318 1.0000 0.3740 0.7622 -0.3294 0.2270
0.9511 0.9729 0.3740 1.0000 0.8627 -0.1697 0.9321
0.9710 0.9455 0.7622 0.8627 1.0000 -0.2531 0.7482
-0.2228 -0.2110 -0.3294 -0.1697 -0.2531 1.0000 -0.0033
0.8627 0.8895 0.2270 0.9321 0.7482 -0.0033 1.0000

The correlation coefficient has to be between –1.0 and 1.0 (0= no correlation; -1/1=high negative/positive correlation). What is suggested in Figure 8.8 is confirmed by the preceding table. To better understand the concept of correlation, we analyze in detail the first row of plots shown in Figure 8.9; for each of them, we provide the value of the newly calculated correlation coefficient, as shown in the following figure:

Figure 8.10: Scatter plot of the first variable versus the others with relative correlation coefficient

From Figure 8.10, we can understand a lot:

  • In the first plot to the left, the data points are placed on a straight line, showing a correlation of nearly +1, as confirmed by the correlation coefficient (0.9944).
  • In the second plot, the data shows an increasing trend but is less grouped on the straight line, indicating a moderate linear relationship (0.6099).
  • In the third plot, the data is again placed on a straight line showing a correlation of nearly +1, but to a lesser extent than the first plot, as confirmed by the correlation coefficient (0.9511).
  • A similar thing happens for the fourth plot (0.9710).
  • The fifth plot, however, shows a different trend than the ones seen so far. In it, the trend is decreasing (r is negative) and the data is extremely scattered. Indeed, the correlation coefficient is close to zero (-0.2228).
  • Finally, the sixth plot shows a strong uphill (positive) linear relationship (0.8627).

One may ask why we must analyze both the graphs and the coefficients. In some cases, the plots tell us what r does not. Indeed, if the scatter plot doesn’t indicate there’s at least somewhat of a linear relationship, the correlation calculation doesn't mean much. Then, two things can happen:

  • If no relationship at all exists, calculating the correlation doesn't make sense, because correlation only applies to linear relationships
  • If a strong relationship exists but it's not linear, the correlation may be misleading, because in some cases a strong curved relationship exists

That's why it's critical to examine the scatter plot. As we can notice in the r matrix, the correlation among some variables is as high as 0.85. This means that there is a good linear correlation between some variables, so there is some data redundancy in the data matrix measured. PCA constructs new independent variables that are linear combinations of the original variables. It's time to run PCA:

>> [coeff,score,latent,tsquared,explained,mu] = pca(VarMeas);

This code returns the following information:

  • coeff: The principal component coefficients
  • score: The principal component scores
  • latent: The principal component variances
  • tsquared: Hotelling's T-squared statistic for each observation
  • explained: The percentage of the total variance explained by each principal component
  • mu: The estimated mean of each variable

When the variables are in different units or the difference in the variance of different columns is substantial, scaling of the data or use of weights is often preferable. By default, the pca() function centers the matrix given by the user by subtracting column means before computing singular value decomposition or eigenvalue decomposition.

Furthermore, to perform our PCA, three different algorithms are available:

  • Singular value decomposition ('svd')
  • Eigenvalue decomposition of the covariance matrix ('eig')
  • Alternating least squares algorithm ('als')

By default, the pca() function uses the singular value decomposition algorithm.

Now, we analyze the results obtained by applying the pca() function in detail. Let's start with the coeff variable, which contains the principal component coefficients:

>> coeff
coeff =
0.8852 0.0936 -0.2625 0.2034 0.1394 -0.2780 -0.0254
0.3958 0.0532 0.2779 -0.5884 -0.5721 0.2922 0.0659
0.0043 -0.0030 -0.0578 0.0581 0.0524 0.0452 0.9942
0.1286 0.0285 0.3967 -0.4293 0.7905 0.1267 0.0003
0.1110 0.0008 -0.3168 0.2392 0.1268 0.8986 -0.0804
-0.1195 0.9903 -0.0659 -0.0262 0.0030 -0.0027 0.0011
0.1290 0.0832 0.7671 0.6057 -0.0973 0.1078 0.0092

The rows of coeff contain the coefficients for the seven variables contained in the VarMeas matrix, and its columns correspond to seven principal components. Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance. Each column contains coefficients that determine the linear combination of the starting variables that represent the information in the new dimensional space.

A generic principal component is defined as a linear combination of the original variables p weighed for a vector u. The first principal component is the linear combination of p variables with higher variance; the second is the linear combination of p variables with an immediately lower variance, subject to the constraint of being orthogonal to the previous component; and so on. For example, the first principal component is represented by the following equation:

PC1 = 0.8852* Area + 0.3958   * Perimeter + 0.0043 * Compactness + 0.1286 * LengthK + 0.1110 * WidthK - 0.1195 * AsymCoef + 0.1290 * LengthKG

There are as many principal components as there are observed variables, and each one is obtained as a linear combination at maximum variance under the non-correlation constraint with all the previous ones.

The second output, score, contains the coordinates of the original data in the new dimensional space defined by the principal components. The score matrix is of the same size as the input data matrix:

>> size(VarMeas)
ans =
199 7
>> size(score)
ans =
199 7

Our purpose was to represent the dataset in the new space characterized by a smaller number of dimensions. Then, we create a plot of the first two columns of score (the first two principal components).

Remember, the score contains the coordinates of the original data in the new coordinate system defined by the principal components.

To make the graph more understandable, we group the data per class:

>> gscatter(score(:,1),score(:,2),SeedClass,'brg','xo^')

In the following figure, the scatter plot for the new feature obtained (the first two principal components) is shown:

Figure 8.11: The scatter plot for the first two principal components

Figure 8.11 clearly classifies the seeds into three relative classes. The different points are distributed in different areas of the plot, leaving only minimal uncertainty in border areas. To locate such points, we can use the gname() function. This function displays a figure window and waits for you to press a mouse button. Moving the mouse over the plot displays a pair of cross hairs. If you position the cross hairs near a point with the mouse and click once, the plot displays the label corresponding to that point (right-click on a point to remove its label). When you are done labeling points, press Enter or Esc to stop labeling.

The window with the scatter plot is already open, so we can type the command:

>> gname

By clicking on the points near the borders that seem uncertain, we can locate the record number for that observation (row number of the dataset), as shown in the following figure:

Figure 8.12: Points with classification uncertainty

From an analysis of Figure 8.12, we can derive the number of the row corresponding to the observation that, according to the plot, is suitable for uncertain classification. The purpose is to analyze the variable's values for such occurrences so as to understand the reason for its ambiguous positioning.

We now return to the results provided by using the pca() function. The third output obtained, named latent, is a vector containing the variance explained by the corresponding principal component. Each column of score has a sample variance equal to the corresponding row of latent. As already mentioned before, the column is in descending order of component variance:

>> latent
latent =
10.8516
2.0483
0.0738
0.0127
0.0028
0.0016
0.0000

It is convenient to chart these values to understand the meaning:

>> plot(latent)
>> xlabel('Principal Component')
>> ylabel('Variance Explained ')

A scree plot displays the explained variance associated with a principal component in descending order versus the number of the principal component. We can use scree plots in PCA to visually assess which components explain most of the variability in the data, as shown in the following figure:

Figure 8.13: Variance explained for each principal component

Generally, we extract the components on the steep slope. The components on the shallow slope contribute little to the solution. The last big drop occurs between the second and third components, so we choose the first two components.

To better understand this choice, we can make a scree plot of the percent variability explained by the first two principal component. The percent variability explained is returned in the explained variables from the pca() function.

>> figure()
>> pareto(explained(1:2))
>> xlabel('Principal Component')
>> ylabel('Variance Explained (%)')
Figure 8.14: Scree plot of the percent variability explained by each principal component

In Figure 8.14, there are two pieces of information; the bar plot shows the proportion of variance for each principal component, while the upper line shows the cumulative variance explained by the first two components. An analysis of Figure 8.14 confirms the choice made earlier, since the first two main components offer more than 99 percent of the explained variance.

Finally, we just have to visualize both the principal component coefficients for each variable and the principal component scores for each observation in a single plot. This type of plot is named biplot:

Biplots are a type of exploration plot used to simultaneously display graphic information on the samples and variables of a data matrix. Samples are displayed as points, while variables are displayed as vectors.

>>biplot(coeff(:,1:2),'scores',score(:,1:2),'varlabels',varnames(1:7));
Figure 8.15: Biplot of the principal component coefficients for each variable and principal component scores for each observation

All seven variables are represented in this biplot by a vector, and the direction and length of the vector indicate how each variable contributes to the two principal components in the plot. For example, the first principal component, on the horizontal axis, has positive coefficients for six variables; only the AsymCoef variable has a negative coefficient. That is why six vectors are directed into the right half of the plot and only one is directed into the left half. The largest coefficient in the first principal component is the first element, corresponding to the variable Area.

The second principal component, on the vertical axis, has positive coefficients for six variables and only one negative coefficient for the Compactness variables (close to zero). Moreover, the length of the vectors makes us understand clearly what the weight of each variable is in the corresponding principal component. It is clear then, that the Area variable assumes a preponderant weight over the others in the first principal component. The same applies to the AsymCoef variable for the second principal component.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset