The regression
And so, now we're ready to build the regression model. Bear in mind that this section is highly iterative in real life. I will describe the iterations, but will only share the model that I chose to settle on.
The github.com/sajari/regression package does an admirable job. But we want to extend the package a little to be able to compare models and the coefficients of the parameters. So I wrote this function:
func runRegression(Xs [][]float64, Ys []float64, hdr []string) (r *regression.Regression, stdErr []float64) {
r = new(regression.Regression)
dp := make(regression.DataPoints, 0, len(Xs))
for i, h := range hdr {
r.SetVar(i, h)
}
for i, row := range Xs {
if i < 3 {
log.Printf("Y %v Row %v", Ys[i], row)
}
dp = append(dp, regression.DataPoint(Ys[i], row))
}
r.Train(dp...)
r.Run()
// calculate StdErr
var sseY float64
sseX := make([]float64, len(hdr)+1)
meanX := make([]float64, len(hdr)+1)
for i, row := range Xs {
pred, _ := r.Predict(row)
sseY += (Ys[i] - pred) * (Ys[i] - pred)
for j, c := range row {
meanX[j+1] += c
}
}
sseY /= float64(len(Xs) - len(hdr) - 1) // n - df ; df = len(hdr) + 1
vecf64.ScaleInv(meanX, float64(len(Xs)))
sseX[0] = 1
for _, row := range Xs {
for j, c := range row {
sseX[j+1] += (c - meanX[j+1]) * (c - meanX[j+1])
}
}
sseY = math.Sqrt(sseY)
vecf64.Sqrt(sseX)
vecf64.ScaleInvR(sseX, sseY)
return r, sseX
}
runRegression will perform the regression analysis, and print the outputs of the standard errors of the coefficients. It is an estimate of the standard deviation of the coefficients—imagine this model being run many many times: each time the coefficients might be slightly different. The standard error simply reports amount of variation in the coefficients.
The standard errors are calculated with the help of the gorgonia.org/vecf64 package, which performs in-place operations for vectors. Optionally, you may choose to write them as loops.
This function also introduces us to the API for the github.com/sajari/regression package—to predict, simply use r.Predict(vars). This will be useful in cases where one would like to use this model for production.
For now, let us focus on the other half of the main function:
// do the regessions
r, stdErr := runRegression(it, YsBack, newHdr)
tdist := distuv.StudentsT{Mu: 0, Sigma: 1, Nu: float64(len(it) - len(newHdr) - 1), Src: rand.New(rand.NewSource(uint64(time.Now().UnixNano())))}
fmt.Printf("R^2: %1.3f\n", r.R2)
fmt.Printf("\tVariable \tCoefficient \tStdErr \tt-stat\tp-value\n")
fmt.Printf("\tIntercept: \t%1.5f \t%1.5f \t%1.5f \t%1.5f\n", r.Coeff(0), stdErr[0], r.Coeff(0)/stdErr[0], tdist.Prob(r.Coeff(0)/stdErr[0]))
for i, h := range newHdr {
b := r.Coeff(i + 1)
e := stdErr[i+1]
t := b / e
p := tdist.Prob(t)
fmt.Printf("\t%v: \t%1.5f \t%1.5f \t%1.5f \t%1.5f\n", h, b, e, t, p)
}
...
Here, we run the regression, and then we print the results. We don't just want to output the regression coefficients. We also want to output the standard errors, the t-statistic, and the P-value. This would give us some confidence over the estimated coefficients.
tdist := distuv.StudentsT{Mu: 0, Sigma: 1, Nu: float64(len(it) - len(newHdr) - 1), Src: rand.New(rand.NewSource(uint64(time.Now().UnixNano())))} creates a Student's t-distribution, which we will compare against our data. The t-statistic is very simply calculated by dividing the coefficient by the standard error.