Discriminant analysis application
LDA is performed in the MASS package, which we have already loaded so that we can access the biopsy data. The syntax is very similar to the lm() and glm() functions.
We can now begin fitting our LDA model, which is as follows:
> lda.fit <- lda(class ~ ., data = train)
> lda.fit
Call:
lda(class ~ ., data = train)
Prior probabilities of groups:
benign malignant
0.6371308 0.3628692
Group means:
thick u.size u.shape adhsn s.size nucl
chrom
benign 2.9205 1.30463 1.41390 1.32450 2.11589
1.39735 2.08278
malignant 7.1918 6.69767 6.68604 5.66860 5.50000
7.67441 5.95930
n.nuc mit
benign 1.22516 1.09271
malignant 5.90697 2.63953
Coefficients of linear discriminants:
LD1
thick 0.19557291
u.size 0.10555201
u.shape 0.06327200
adhsn 0.04752757
s.size 0.10678521
nucl 0.26196145
chrom 0.08102965
n.nuc 0.11691054
mit -0.01665454
This output shows us that Prior probabilities of groups are approximately 64 percent for benign and 36 percent for malignancy. Next is Group means. This is the average of each feature by their class. Coefficients of linear discriminants are the standardized linear combination of the features that are used to determine an observation's discriminant score. The higher the score, the more likely that the classification is malignant.
The plot() function in LDA will provide us with a histogram and/or the densities of the discriminant scores, as follows:
> plot(lda.fit, type = "both")
The following is the output of the preceding command:
We can see that there is some overlap in the groups, indicating that there will be some incorrectly classified observations.
The predict() function available with LDA provides a list of three elements: class, posterior, and x. The class element is the prediction of benign or malignant, the posterior is the probability score of x being in each class, and x is the linear discriminant score. Let's just extract the probability of an observation being malignant:
> train.lda.probs <- predict(lda.fit)$posterior[,
2]
> misClassError(trainY, train.lda.probs)
[1] 0.0401
> confusionMatrix(trainY, train.lda.probs)
0 1
0 296 13
1 6 159
Well, unfortunately, it appears that our LDA model has performed much worse than the logistic regression models. The primary question is to see how this will perform on the test data:
> test.lda.probs <- predict(lda.fit, newdata =
test)$posterior[, 2]
> misClassError(testY, test.lda.probs)
[1] 0.0383
> confusionMatrix(testY, test.lda.probs)
0 1
0 140 6
1 2 61
That's actually not as bad as I thought, given the lesser performance on the training data. From a correctly classified perspective, it still did not perform as well as logistic regression (96 percent versus almost 98 percent with logistic regression).
We will now move on to fit a QDA model. In R, QDA is also part of the MASS package and the function is qda(). Building the model is rather straightforward again, and we will store it in an object called qda.fit, as follows:
> qda.fit = qda(class ~ ., data = train)
> qda.fit
Call:
qda(class ~ ., data = train)
Prior probabilities of groups:
benign malignant
0.6371308 0.3628692
Group means:
Thick u.size u.shape adhsn s.size nucl chrom
n.nuc
benign 2.9205 1.3046 1.4139 1.3245 2.1158
1.3973 2.0827 1.2251
malignant 7.1918 6.6976 6.6860 5.6686 5.5000
7.6744 5.9593 5.9069
mit
benign 1.092715
malignant 2.639535
As with LDA, the output has Group means but does not have the coefficients because it is a quadratic function as discussed previously.
The predictions for the train and test data follow the same flow of code as with LDA:
> train.qda.probs <- predict(qda.fit)$posterior[,
2]
> misClassError(trainY, train.qda.probs)
[1] 0.0422
> confusionMatrix(trainY, train.qda.probs)
0 1
0 287 5
1 15 167
> test.qda.probs <- predict(qda.fit, newdata =
test)$posterior[, 2]
> misClassError(testY, test.qda.probs)
[1] 0.0526
> confusionMatrix(testY, test.qda.probs)
0 1
0 132 1
1 10 66
We can quickly tell that QDA has performed the worst on the training data with the confusion matrix, and it has classified the test set poorly with 11 incorrect predictions. In particular, it has a high rate of false positives.