Univariate linear regression
We begin by looking at a simple way to predict a quantitative response, Y, with one predictor variable, x, assuming that Y has a linear relationship with x. The model for this can be written as, Y = B0 + B1x + e. We can state it as the expected value of Y being a function of the parameters B0 (the intercept) plus B1 (the slope) times x, plus an error term e. The least squares approach chooses the model parameters that minimize the Residual Sum of Squares (RSS) of the predicted y values versus the actual Y values. For a simple example, let's say we have the actual values of Y1 and Y2 equal to 10 and 20 respectively, along with the predictions of y1 and y2 as 12 and 18. To calculate RSS, we add the squared differences RSS = (Y1 - y1)2 + (Y2 - y2)2, which, with simple substitution, yields (10 - 12)2 + (20 - 18)2 = 8.
I once remarked to a peer during our Lean Six Sigma Black Belt training that it's all about the sum of squares; understand the sum of squares and the rest will flow naturally. Perhaps that is true, at least to some extent.
Before we begin with an application, I want to point out that, if you read the headlines of various research breakthroughs, you should do so with a jaded eye and a skeptical mind as the conclusion put forth by the media may not be valid. As we shall see, R, and any other software for that matter, will give us a solution regardless of the inputs. However, just because the math makes sense and a high correlation or R-squared statistic is reported doesn't mean that the conclusion is valid.
To drive this point home, let's have a look at the famous Anscombe dataset, which is available in R. The statistician Francis Anscombe produced this set to highlight the importance of data visualization and outliers when analyzing data. It consists of four pairs of X and Y variables that have the same statistical properties but when plotted show something very different. I have used the data to train colleagues and to educate business partners on the hazards of fixating on statistics without exploring the data and checking assumptions. I think this is a good place to start should you have a similar need. It is a brief digression before moving on to serious modeling:
> #call up and explore the data
> data(anscombe)
> attach(anscombe)
> anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8.04 9.14 7.46 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
3 13 13 13 8 7.58 8.74 12.74 7.71
4 9 9 9 8 8.81 8.77 7.11 8.84
5 11 11 11 8 8.33 9.26 7.81 8.47
6 14 14 14 8 9.96 8.10 8.84 7.04
7 6 6 6 8 7.24 6.13 6.08 5.25
8 4 4 4 19 4.26 3.10 5.39 12.50
9 12 12 12 8 10.84 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.42 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89
As we shall see, each of the pairs has the same correlation coefficient: 0.816. The first two are as follows:
> cor(x1, y1) #correlation of x1 and y1
[1] 0.8164205
> cor(x2, y1) #correlation of x2 and y2
[1] 0.8164205
The real insight here, as Anscombe intended, is when we plot all the four pairs together, as follows:
> par(mfrow = c(2,2)) #create a 2x2 grid for
plotting
> plot(x1, y1, main = "Plot 1")
> plot(x2, y2, main = "Plot 2")
> plot(x3, y3, main = "Plot 3")
> plot(x4, y4, main = "Plot 4")
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
The output of the preceding code is as follows:
As we can see, Plot 1 appears to have a true linear relationship, Plot 2 is curvilinear, Plot 3 has a dangerous outlier, and Plot 4 is driven by one outlier. There you have it, a cautionary tale about the dangers of solely relying on correlation.