Describing differences by discriminant analysis

Discriminant analysis is a statistical analysis dating back to Fisher (1936 - Linear Discriminant Analysis (LDA)), as we have already mentioned earlier. It is a method for describing, through a one-dimensional function, the difference between two or more groups and allocating each observation to the group of origin. This is a classification problem, where the groups are known a priori and one or more new observations are classified into one of the known groups based on the measured characteristics. Each observation must have a score on one or more quantitative predictor measures, and a score on a group measure. Discriminant analysis is useful in determining whether a set of variables is effective in predicting category membership.

In MATLAB, the discriminant analysis model is created on the following assumptions:

  • Each class generates data using a multivariate normal distribution. That is, the model assumes that data generated has a Gaussian mixture distribution.
  • For linear discriminant analysis, the model has the same covariance matrix for each class, only the means vary.
  • For quadratic discriminant analysis, both means and covariances of each class vary.

The discriminant analysis model is created to minimize the expected classification cost. Based on these assumptions, the model equation assumes the following form:

Where:

  • Y is the predicted classification
  • K is the number of classes
  • P(k|x) is the posterior probability of class k for observation x
  • C(y|k) is the cost of classifying an observation as y when its true class is k

In the previous paragraph, we used the data in the fisheriris dataset, already available in MATLAB, which was used by Fisher as an example of LDA. Why change right now? So we will refer to the same dataset in the next example. Remember, to import the data, simply type:

>> load fisheriris

To perform a discriminant analysis in the MATLAB environment, we can use the fitcdiscr() function, that returns a fitted discriminant analysis model based on the input and output variables provided. The model estimates the parameters of a Gaussian distribution for each class. Train a discriminant analysis model using the entire dataset:

>> DiscrModel = fitcdiscr(meas,species)
DiscrModel =
ClassificationDiscriminant
ResponseName: 'Y'
CategoricalPredictors: []
ClassNames: {'setosa' 'versicolor' 'virginica'}
ScoreTransform: 'none'
NumObservations: 150
DiscrimType: 'linear'
Mu: [3×4 double]
Coeffs: [3×3 struct]

To access the model properties, use dot notation. In the summary printed on the command-line interface, we can notice, at the second-last row, the Mu variable. It represents the group means for each feature (height and width for sepal and petal) for each predictor. To display it onscreen, simply type:

>> DiscrModel.Mu
ans =
5.0060 3.4280 1.4620 0.2460
5.9360 2.7700 4.2600 1.3260
6.5880 2.9740 5.5520 2.0260

Each value represent, the mean of the variables for each class, the first row for setosa  species, the second row for versicolor, and the last row for virginica. In the last row of the summary shown by the fitcdiscr() function, there is the Coeffs variable. What is this? Let's visualize the content:

>> DiscrModel.Coeffs
ans =
3×3 struct array with fields:
DiscrimType
Const
Linear
Class1
Class2

This is an n-by-n structured array, where n is the number of classes. In our case, a 3-by-3 structured array, because there are 3 classes. Each structured array contains coefficients of the linear boundaries between two classes. Why are we now talking about boundaries? This is simple; the model divides the n-dimensional space into areas belonging to a specific class. The boundaries just delimit those zones. If a new item falls into a specific area, identified by such boundaries, it will be attributed to that class.

So Coeffs(i,j) contains boundaries data between class i and class j. The equation of the boundary between these classes is:

Const + Linear * x = 0

Where x is a vector containing the features of the distribution. In this regard, we only use the petals features to construct the model, so that we can represent everything on a two-dimensional space.

>> X = [meas(:,3) meas(:,4)];

Train a discriminant analysis model using only the petals features:

>> DiscrModelPetal = fitcdiscr(X,species)
DiscrModelPetal =
ClassificationDiscriminant
ResponseName: 'Y'
CategoricalPredictors: []
ClassNames: {'setosa' 'versicolor' 'virginica'}
ScoreTransform: 'none'
NumObservations: 150
DiscrimType: 'linear'
Mu: [3×2 double]
Coeffs: [3×3 struct]

Display the species distribution on the scatter plot:

>> gscatter(meas(:,3), meas(:,4), species,'rgb','osd');

Retrieve the coefficients for the linear boundary between the setosa and the versicolor classes (first and second class in the order).

>> Const12 = DiscrModelPetal.Coeffs(1,2).Const;
>> Linear12 = DiscrModelPetal.Coeffs(1,2).Linear;

Plot the curve that separates the first and second classes:

>> hold on
>> Bound12 = @(x1,x2) Const12 + Linear12(1)*x1 + Linear12(2)*x2;
>> B12 = ezplot(Bound12,[0 7.2 0 2.8]);
>> B12.Color = 'r';
>> B12.LineWidth = 2;

Retrieve the coefficients for the linear boundary between the versicolor and the verginica classes (second and third class in the order).

>> Const23 = DiscrModelPetal.Coeffs(2,3).Const;
>> Linear23 = DiscrModelPetal.Coeffs(2,3).Linear;

Plot the curve that separates the first and second classes:

>> Bound23 = @(x1,x2) Const23 + Linear23 (1)*x1 + Linear23 (2)*x2;
>> B23 = ezplot(Bound23,[0 7.2 0 2.8]);
>> B23.Color = 'b';
>> B23.LineWidth = 2;

Finally, set the axis label and title:

>> xlabel('Petal Length')
>> ylabel('Petal Width')
>> title('{f Linear Classification by Discriminant Analysis}')

The following figure shows a scatter plot of the Fisher Iris distribution with the boundaries between the species.

Figure 5.7: Scatter plot of Fisher Iris distribution with the boundaries between the species

We now check the correct functioning of the algorithm by classifying three new points that, as shown in Figure 5.8, fall in the three floral typology areas. These points are:

  • P1: Petal Length = 2 cm; Petal Width = 0.5 cm
  • P2: Petal Length = 5 cm; Petal Width = 1.5 cm
  • P3: Petal Length = 6 cm; Petal Width = 2 cm

Let's first create two vectors containing the petal length and petal width of the new data respectively.

>> NewPointsX=[2 5 6];
>> NewPointsY=[0.5 1.5 2];

To predict the classes of the new data, we can use the predict() function, which returns a vector of predicted class labels for the predictor data provided, based on the trained discriminant analysis classification model.

>> LabelsNewPoints = predict(DiscrModelPetal,[NewPointsX' NewPointsY'])
LabelsNewPoints =
3×1 cell array
'setosa'
'versicolor'
'virginica'

Now, we plot the three points in the scatter plot with boundaries to verify correct classification:

>> plot(NewPointsX,NewPointsY,'*')

In the following figure, the new data points in the scatter plot, with boundaries, are shown. It is easy to check that the classification offered by the predict() function coincides with what we can do visually:

Figure 5.8: Scatter plot with the boundaries between the species and new data points

As can be seen in Figure 5.8, the classification performed by the discriminant analysis is very good, except for some values that fall around the boundaries between the versicolor and virginica species. To improve the results, if possible, set the DiscrimType name-value pair to pseudoLinear or pseudoQuadratic.

To test the performance of the model, compute the resubstitution error.

>> DiscrModelResubErr = resubLoss(DiscrModel)
DiscrModelResubErr =
0.0200

The result indicates that 2 percent of the observations are misclassified by the linear discriminant function. To understand how these errors are distributed, we can compute the confusion matrix. Before computing the confusion matrix, we collect the model predictions for available data, and then compute the confusion matrix.

>> PredictedValue = predict(DiscrModel,meas);
>> ConfMat = confusionmat(species,PredictedValue)
ConfMat =
50 0 0
0 48 2
0 1 49

As expected, there are only three errors (two percent of the available values) and they refer to the two versicolor and virginica species. We can see which ones they are by drawing X through the misclassified points:

>> Err = ~strcmp(PredictedValue,species);
>> gscatter(meas(:,3), meas(:,4), species,'rgb','osd');
>> hold on
>> plot(meas(Err,3), meas(Err,4), 'kx');
>> xlabel('Petal length');
>> ylabel('Petal width');

Let's analyze the code just written down. We first found incorrect values using the strcmp() function, which compares two strings and returns 1 (true) if the two are identical and 0 (false) otherwise. Subsequently, we traced a scatter plot grouped by species, using the two Petal length and Petal width features. Then, we added the indicators at the wrong values, and finally we added axis labels.

The following figure shows a scatter plot grouped by species with incorrect values indicated:

Figure 5.9: Scatter plot grouped by species with incorrect values indicated

As expected, the three errors are located on the boundary between the two species versicolor and virginica.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset