Go Machine Learning Projects
上QQ阅读APP看书,第一时间看更新

Ingestion and indexing

Perhaps the best way to index the data is to do it at the time of ingestion. We will use the encoding/csv package found in the Go standard library to ingest the data and build the index.

Before we dive into the code, let's look at the notion of an index, and how one might be built. While indexes are extremely commonly used in databases, they are applicable in any production system as well. The purpose of the index is to allow us to access data quickly.

We want to build an index that will allow us to know at any time which row(s) has the value. In systems with much larger datasets, a more complicated index structure (such as a B-Tree) might be used. In the case of this dataset, however, a map-based index would be more than sufficient.

This is what our index looks like: []map[string][]int—it's a slice of maps. The first slice is indexed by the columns—meaning if we want column 0, we simply get index[0], and get map[string][]int in return. The map tells us what values are in the columns (the key of the map), and what rows contain those values (the value of the map).

Now, the question turns to: how do you know which variables associate with which column? A more traditional answer would be to have something like map[string]int, where the key represents the variable name and the value represents the column number. While that is a valid strategy, I prefer to have []string as the associative map between the index and column name. Searching is O(N), but for the most part, if you have named variables, N is small. In future chapters, we shall see much much larger Ns.

So, we return the index of column names as []string or, in the case of reading CSVs, it's simply the first row, as shown in the following code snippet:

// ingest is a function that ingests the file and outputs the header, data, and index.
func ingest(f io.Reader) (header []string, data [][]string, indices []map[string][]int, err error) {
r := csv.NewReader(f)

// handle header
if header, err = r.Read(); err != nil {
return
}

indices = make([]map[string][]int, len(header))
var rowCount, colCount int = 0, len(header)
for rec, err := r.Read(); err == nil; rec, err = r.Read() {
if len(rec) != colCount {
return nil, nil, nil, errors.Errorf("Expected Columns: %d. Got %d columns in row %d", colCount, len(rec), rowCount)
}
data = append(data, rec)
for j, val := range rec {
if indices[j] == nil {
indices[j] = make(map[string][]int)
}
indices[j][val] = append(indices[j][val], rowCount)
}
rowCount++
}
return
}

Reading this code snippet, a good programmer would have alarm bells going off in their head. Why is everything a string? The answer to that is quite simple: we'll convert the types later. All we need right now is some basic count-based statistics for exploratory data analysis.

The key is in the indexes that are returned by the function. What we have is a column count of unique values. This is how to count them:

// cardinality counts the number of unique values in a column. 
// This assumes that the index i of indices represents a column.
func cardinality(indices []map[string][]int) []int {
retVal := make([]int, len(indices))
for i, m := range indices {
retVal[i] = len(m)
}
return retVal
}

With this, we can then analyze the cardinality of each individual column—that is how many distinct values there are. If there are as many distinct values as there are rows in each column, then we can be quite sure that the column is not categorical. Or, if we know that the column is categorical, and there are as many distinct values as there are rows, then we know for sure that the column cannot be used in a linear regression.

Our main function now looks like this:

func main() {
f, err := os.Open("train.csv")
mHandleErr(err)
hdr, data, indices, err := ingest(f)
mHandleErr(err)
c := cardinality(indices)

fmt.Printf("Original Data: \nRows: %d, Cols: %d\n========\n", len(data), len(hdr))
c := cardinality(indices)
for i, h := range hdr {
fmt.Printf("%v: %v\n", h, c[i])
}
fmt.Println("")

}

For completeness, this is the definition of mHandleError:

// mHandleErr is the error handler for the main function. 
// If an error happens within the main function, it is not
// unexpected for a fatal error to be logged and for the program to immediately quit.
func mHandleErr(err error){
if err != nil {
log.Fatal(err)
}
}

A quick go run *.go indicates this result (which has been truncated):

$ go run *.go
Rows: 1460
========
Id: 1460
MSSubClass: 15
MSZoning: 5
LotFrontage: 111
LotArea: 1073
SaleCondition: 6
SalePrice: 663

Alone, this tells us a lot of interesting facts, chief amongst which is that there is a lot more categorical data than there is continuous data. Additionally, for some columns that are indeed continuous in nature, there are only a few discrete values available. One particular example is the LowQualSF column—it's a continuous variable, but there are only 24 unique values.

We'd like to calculate the CEF of the discrete covariates for further analysis. But before that can happen, we would need to clean up the data. While we're at it, we might also want to create a logical grouping of data structures.