Mastering Machine Learning with R（Second Edition）

上QQ阅读APP看书，第一时间看更新

Qualitative features

A qualitative feature, also referred to as a factor, can take on two or more levels such as Male/Female or Bad/Neutral/Good. If we have a feature with two levels, say gender, then we can create what is known as an indicator or dummy feature, arbitrarily assigning one level as 0 and the other as 1. If we create a model with just the indicator, our linear model would still follow the same formulation as before, that is, Y = B0 + B1x + e. If we code the feature as male being equal to 0 and female equal to 1, then the expectation for male would just be the intercept B0, while for female it would be B0 + B1x. In the situation where you have more than two levels of the feature, you can create n-1 indicators; so, for three levels you would have two indicators. If you created as many indicators as levels, you would fall into the dummy variable trap, which results in perfect multi-collinearity.

We can examine a simple example to learn how to interpret the output. Let's load the ISLR package and build a model with the Carseats dataset using the following code snippet:

    > library(ISLR)
    
    > data(Carseats)
    
    > str(Carseats)
    
    'data.frame':   400 obs. of  11 variables:
    $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
    $ CompPrice  : num  138 111 113 117 141 124 115 136 
       132 132 ...
    $ Income     : num  73 48 35 100 64 113 105 81 110 
       113 ...
    $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
    $ Population : num  276 260 269 466 340 501 45 425 
       108 131 ...
    $ Price      : num  120 83 80 97 128 72 108 120 124        
       124 ...
    $ ShelveLoc  : Factor w/ 3 levels 
       "Bad","Good","Medium": 1 2 3 3 1 
      1 3 2 3 3 ...
    $ Age        : num  42 65 59 55 38 78 71 67 76 76 
      ...
    $ Education  : num  17 10 12 14 13 16 15 10 10 17 
      ...
    $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 
      2 2 1 2 2 1 1 
      ...
    $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 
      2 1 2 1 2 1 2 
      ..

For this example, we will predict the sales of Carseats using just Advertising, a quantitative feature and the qualitative feature ShelveLoc, which is a factor of three levels: Bad, Good, and Medium. With factors, R will automatically code the indicators for the analysis. We build and analyze the model as follows:

    > sales.fit <- lm(Sales ~ Advertising + ShelveLoc, 
       data = Carseats)
    
    > summary(sales.fit)
    
    Call:
    lm(formula = Sales ~ Advertising + ShelveLoc, data = 
    Carseats)
    
    Residuals:
        Min      1Q  Median      3Q     Max
    -6.6480 -1.6198 -0.0476  1.5308  6.4098
    
    Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
    (Intercept)      4.89662    0.25207  19.426  < 2e-
      16 ***
    Advertising      0.10071    0.01692   5.951 5.88e-
      09 ***
    ShelveLocGood    4.57686    0.33479  13.671  < 2e-
      16 ***
    ShelveLocMedium  1.75142    0.27475   6.375 5.11e-
      10 ***
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 
      '.' 0.1 ' ' 1
    
    Residual standard error: 2.244 on 396 degrees of 
      freedom
    Multiple R-squared:  0.3733,    Adjusted R-squared:  
      0.3685
    F-statistic: 78.62 on 3 and 396 DF,  p-value: < 
      2.2e-16

If the shelving location is good, the estimate of sales is almost double of that when the location is bad, given an intercept of 4.89662. To see how R codes the indicator features, you can use the contrasts() function:

    > contrasts(Carseats$ShelveLoc)
    
            Good Medium
    Bad       0      0
    Good      1      0
    Medium    0      1