Linear regression
Now that that's all done, let's do some linear regression! But first, let's clean up our code. We'll move our exploratory work so far into a function called exploration(). Then we will reread the file, split the dataset into training and testing dataset, and perform all the transformations before finally running the regression. For that, we will use github.com/sajari/regression and apply the regression.
The first part looks like this:
func main() {
// exploratory() // commented out because we're done with exploratory work.
f, err := os.Open("train.csv")
mHandleErr(err)
defer f.Close()
hdr, data, indices, err := ingest(f)
rows, cols, XsBack, YsBack, newHdr, newHints := clean(hdr, data, indices, datahints, ignored)
Xs := tensor.New(tensor.WithShape(rows, cols), tensor.WithBacking(XsBack))
it, err := native.MatrixF64(Xs)
mHandleErr(err)
// transform the Ys
for i := range YsBack {
YsBack[i] = math.Log1p(YsBack[i])
}
// transform the Xs
transform(it, newHdr, newHints)
// partition the data
shuffle(it, YsBack)
testingRows := int(float64(rows) * 0.2)
trainingRows := rows - testingRows
testingSet := it[trainingRows:]
testingYs := YsBack[trainingRows:]
it = it[:trainingRows]
YsBack = YsBack[:trainingRows]
log.Printf("len(it): %d || %d", len(it), len(YsBack))
...
We first ingest and clean the data, then we create an iterator for the matrix of Xs for easier access. We then transform both the Xs and the Ys. Finally, we shuffle the Xs, and partition them into a training dataset and a testing dataset.
Recall from the first chapter on knowing whether a model is good. A good model must be able to generalize to previously unseen combinations of values. To prevent overfitting, we must cross-validate our model.
In order to achieve that, we must only train on a limited subset of data, then use the model to predict on the test set of data. We can then get a score of how well it did when being run on the testing set.
Ideally, this should be done before the parsing of the data into the Xs and Ys. But we'd like to reuse the functions we wrote earlier, so we shan't do that. The separate functions of ingest and clean, however, allows you to do that. And if you visit the repository on GitHub, you will find that all the functions for such an act can easily be done.
For now, we simply take out 20% of the dataset, and set it aside. A shuffle is used to resample the rows so that we don't train on the same 80% every time.
Also, note that now the clean function takes ignored, while in the exploratory mode, it took nil. This, along with the shuffle, are important for cross-validation later on.