Multicollinearity
As mentioned in the opening paragraphs of this section, the number of variables is a little high for comfort. When there is a high number of variables the chances of multicollinearity increases. Multicollinearity is when two or more variables are correlated with each other somehow.
From a cursory glance at the data, we can tell that is in fact true. A simple thing to note is GarageArea is correlated with GarageCars. In real life, this makes sense—a garage that can take two cars would be logically larger in area compared to a garage that can only store one car. Likewise, zoning is highly correlated with the neighborhood.
A good way to think about the variables is in terms of information included in the variables. Sometimes, the variables have information that overlaps. For example, when GarageArea is 0, that overlaps with the GarageType of NA—after all, if you have no garage, the area of your garage is zero.
The difficult part is going through the list of variables, and deciding which to keep. It's something of an art that has help from algorithms. In fact, the first thing we're going to do is to find out how correlated a variable is with another variable. We do this by calculating the correlation matrix, then plotting out a heatmap.
To calculate the correlation matrix, we simply use the function in Gonum with this snippet:
m64, err := tensor.ToMat64(Xs, tensor.UseUnsafe())
mHandleErr(err)
corr := stat.CorrelationMatrix(nil, m64, nil)
hm, err := plotHeatMap(corr, newHdr)
mHandleErr(err)
hm.Save(60*vg.Centimeter, 60*vg.Centimeter, "heatmap.png")
Let's go through this line by line:
m64, err := tensor.ToMat64(Xs, tensor.UseUnsafe()) performs the conversion from *tensor.Dense to mat.Mat64. Because we don't want to allocate an additional chunk of memory, and we've determined that it's safe to actually reuse the data in the matrix, we pass in a tensor.UseUnsafe() function option that tells Gorgonia to reuse the underlying memory in the Gonum matrix.
stat.CorrelationMatrix(nil, m64, nil) calculates the correlation matrix. The correlation matrix is a triangular matrix—a particularly useful data structure that the Gonum package provides. It is a clever little data structure for this use case because the matrix is mirrored along the diagonal.
Next, we plot heatmap using the following snippet of code:
type heatmap struct {
x mat.Matrix
}
func (m heatmap) Dims() (c, r int) { r, c = m.x.Dims(); return c, r }
func (m heatmap) Z(c, r int) float64 { return m.x.At(r, c) }
func (m heatmap) X(c int) float64 { return float64(c) }
func (m heatmap) Y(r int) float64 { return float64(r) }
type ticks []string
func (t ticks) Ticks(min, max float64) []plot.Tick {
var retVal []plot.Tick
for i := math.Trunc(min); i <= max; i++ {
retVal = append(retVal, plot.Tick{Value: i, Label: t[int(i)]})
}
return retVal
}
func plotHeatMap(corr mat.Matrix, labels []string) (p *plot.Plot, err error) {
pal := palette.Heat(48, 1)
m := heatmap{corr}
hm := plotter.NewHeatMap(m, pal)
if p, err = plot.New(); err != nil {
return
}
hm.NaN = color.RGBA{0, 0, 0, 0} // black
// add and adjust the prettiness of the chart
p.Add(hm)
p.X.Tick.Label.Rotation = 1.5
p.Y.Tick.Label.Font.Size = 6
p.X.Tick.Label.Font.Size = 6
p.X.Tick.Label.XAlign = draw.XRight
p.X.Tick.Marker = ticks(labels)
p.Y.Tick.Marker = ticks(labels)
// add legend
l, err := plot.NewLegend()
if err != nil {
return p, err
}
thumbs := plotter.PaletteThumbnailers(pal)
for i := len(thumbs) - 1; i >= 0; i-- {
t := thumbs[i]
if i != 0 && i != len(thumbs)-1 {
l.Add("", t)
continue
}
var val float64
switch i {
case 0:
val = hm.Min
case len(thumbs) - 1:
val = hm.Max
}
l.Add(fmt.Sprintf("%.2g", val), t)
}
// this is a hack. I place the legends between the axis and the actual heatmap
// because if the legend is on the right, we'd need to create a custom canvas to take
// into account the additional width of the legend.
//
// So instead, we shrink the legend width to fit snugly within the margins of the plot and the axes.
l.Left = true
l.XOffs = -5
l.ThumbnailWidth = 5
l.Font.Size = 5
p.Legend = l
return
}
The plotter.NewHeatMap function expects an interface, which is why I wrapped mat.Mat in the heatmap data structure, which provides the interface for the plotter to draw a heatmap. This pattern will become more and more common in the coming chapters—wrapping a data structure just to provide an additional interface to other functions. They are cheap and readily available and should be used to the fullest extent.
A large portion of this code involves a hack for the labels. The way Gonum plots work, is that when the canvas size is calculated, the label is considered to be inside the plot. To be able to draw the labels outside the plot, a lot of extra code would have to be written. So, instead, I shrunk the labels to fit into the gutter between the axis and the plot itself as to not overlay into important areas of the plot:
Of particular note in this heatmap are the white streaks. We expect a variable to correlate with itself completely. But if you notice, there are areas of white lines that are somewhat parallel to the diagonal white line. These are total correlations. We will need to remove them.
Heatmaps are nice to look at but are quite silly. The human eye isn't great at telling hues apart. So what we're going to do is also report back the numbers. The correlation between variables is between -1 and 1. We're particularly interested in correlations that are close to either end.
This snippet prints the results:
// heatmaps are nice to look at, but are quite ridiculous.
var tba []struct {
h1, h2 string
corr float64
}
for i, h1 := range newHdr {
for j, h2 := range newHdr {
if c := corr.At(i, j); math.Abs(c) >= 0.5 && h1 != h2 {
tba = append(tba, struct {
h1, h2 string
corr float64
}{h1: h1, h2: h2, corr: c})
}
}
}
fmt.Println("High Correlations:")
for _, a := range tba {
fmt.Printf("\t%v-%v: %v\n", a.h1, a.h2, a.corr)
}
Here I use an anonymous struct, instead of a named struct, because we're not going to reuse the data—it's solely for printing. An anonymous tuple would suffice. This is not the best practice in most cases.
This correlation plot shows only the correlation of the independent variables. To truly understand multicollinearity, we would have to find the correlation of each variable to each other, and to the dependent variable. This will be left as an exercise for the reader.
Ultimately, multicollinearity can only be detected after running a regression. The correlation plot is simply a shorthand way of guiding the inclusion and exclusion of variables. The actual process of removing multicollinearity is an iterative one, often with other statistics such as the variance inflation factor to lend a hand in deciding what to include and what not to include.
For the purpose of this chapter, I've identified multiple variables to be included—and the majority of variables are excluded. This can be found in the const.go file. The commented out lines in the ignored list are what was included in the final model.
As mentioned in the opening paragraph of this section, it's really a bit of an art, aided by algorithms.