Mastering Machine Learning with R（Second Edition）

上QQ阅读APP看书，第一时间看更新

Logistic regression with cross-validation

The purpose of cross-validation is to improve our prediction of the test set and minimize the chance of overfitting. With the K-fold cross-validation, the dataset is split into K equal-sized parts. The algorithm learns by alternatively holding out one of the K-sets; it fits a model to the other K-1 parts, and obtains predictions for the left-out K-set. The results are then averaged so as to minimize the errors, and appropriate features are selected. You can also perform the Leave-One-Out-Cross-Validation (LOOCV) method, where K is equal to 1. Simulations have shown that the LOOCV method can have averaged estimates that have a high variance. As a result, most machine learning experts will recommend that the number of K-folds should be 5 or 10.

An R package that will automatically do CV for logistic regression is the bestglm package. This package is dependent on the leaps package that we used for linear regression. The syntax and formatting of the data require some care, so let's walk through this in detail:

    > library(bestglm)
    Loading required package: leaps

After loading the package, we will need our outcome coded to 0 or 1. If left as a factor, it will not work. The other requirement to utilize the package is that your outcome, or y, is the last column and all the extraneous columns have been removed. A new data frame will do the trick for us, via the following code:

    > X <- train[, 1:9]
    > Xy <- data.frame(cbind(X, trainY))

Here is the code to run in order to use the CV technique with our data:

    > bestglm(Xy = biopsy.cv, IC="CV", 
      CVArgs=list(Method="HTF", K=10, 
      REP=1), family=binomial)

The syntax, Xy = Xy, points to our properly formatted data frame. IC = "CV" tells the package that the information criterion to use is cross-validation. CVArgs are the CV arguments that we want to use. The HTF method is K-fold, which is followed by the number of folds, K = 10, and we are asking it to do only one iteration of the random folds with REP = 1. Just as with glm(), we will need to use family = binomial. On a side note, you can use bestglm for linear regression as well by specifying family = gaussian. So, after running the analysis, we will end up with the following output, giving us three features for Best Model, such as thick, u.size, and nucl. The statement on Morgan-Tatar search simply means that a simple exhaustive search was done for all the possible subsets, as follows:

    Morgan-Tatar search since family is non-gaussian.
    CV(K = 10, REP = 1)
    BICq equivalent for q in (7.16797006619085e-05, 
    0.273173435514231)
    Best Model:
    Estimate Std. Error   z value     Pr(>|z|)
    (Intercept) -7.8147191 0.90996494 -8.587934 
     8.854687e-18
    thick        0.6188466 0.14713075  4.206100 
     2.598159e-05
    u.size       0.6582015 0.15295415  4.303260 
     1.683031e-05
    nucl         0.5725902 0.09922549  5.770596 
     7.899178e-09

We can put these features in glm() and then see how well the model did on the test set. A predict() function will not work with bestglm, so this is a required step:

    > reduce.fit <- glm(class ~ thick + u.size + nucl, 
       family = binomial, data = train)

As before, the following code allows us to compare the predicted labels with the actual ones on the test set:

    > test.cv.probs <- predict(reduce.fit, newdata = 
       test, type = "response")
    > misClassError(testY, test.cv.probs)
    [1] 0.0383
    > confusionMatrix(testY, test.cv.probs)
        0    1
    0 139    5
    1   3   62

The reduced feature model is slightly less accurate than the full feature model, but all is not lost. We can utilize the bestglm package again, this time using the best subsets with the information criterion set to BIC:

    > bestglm(Xy = Xy, IC = "BIC", family = binomial)
    Morgan-Tatar search since family is non-gaussian.
    BIC
    BICq equivalent for q in (0.273173435514231, 
      0.577036596263757)
    Best Model:
     Estimate Std. Error   z value     Pr(>|z|)
    (Intercept) -8.6169613 1.03155250 -8.353391 
      6.633065e-17
    thick        0.7113613 0.14751510  4.822295 
      1.419160e-06
    adhsn        0.4537948 0.15034294  3.018398 
      2.541153e-03
    nucl         0.5579922 0.09848156  5.665956 
      1.462068e-08
    n.nuc        0.4290854 0.11845720  3.622282 
      2.920152e-04

These four features provide the minimum BIC score for all possible subsets. Let's try this and see how it predicts the test set, as follows:

    > bic.fit <- glm(class ~ thick + adhsn + nucl + 
       n.nuc, family = binomial, data = train)
    > test.bic.probs <- predict(bic.fit, newdata = 
       test, type = "response")
    > misClassError(testY, test.bic.probs)
    [1] 0.0239
    > confusionMatrix(testY, test.bic.probs)
        0    1
    0 138    1
    1   4   66

Here we have five errors, just like the full model. The obvious question then is: which one is better? In any normal situation, the rule of thumb is to default to the simplest or the easiest-to-interpret model, given equality of generalization. We could run a completely new analysis with a new randomization and different ratios of the train and test sets among others. However, let's assume for a moment that we've exhausted the limits of what logistic regression can do for us. We will come back to the full model and the model that we developed on a BIC minimum at the end and discuss the methods of model selection. Now, let's move on to our discriminant analysis methods, which we will also include as possibilities in the final recommendation.