上QQ阅读APP看书，第一时间看更新

The predictive analytics workflow

We have been talking about training the model. What does that mean in practice?

In supervised learning, the dataset is usually split into three non-equal parts: training, validation, and test:

The training set on which you train your model. It has to be big enough to give the model as much information on the data as possible. This subset of the data is used by the algorithm to estimate the best parameters of the model. In our case, the SGD algorithm will use that training subset to find the optimal weights of the linear regression model.
The validation set is used to assess the performance of a trained model. By measuring the performance of the trained model on a subset that has not been used in its training, we have an objective assessment of its performance. That way we can train different models with different meta parameters and see which one is performing the best on the validation set. This is also called model selection. Note that this creates a feedback loop, since the validation dataset now has an influence on your model selection. Another model may have performed worse on that particular validation subset but overall better on new data.
The test set corresponds to data that is set aside until you have fully optimized your features and model. The test subset is also called the held-out dataset.

In real life, your model will face previously unforeseen data, since the ultimate raison d'etre for a model is to predict unseen data. Therefore, it is important to assess the performance of the model on data it has never encountered before. The held-out dataset is a proxy for yet unseen data. It is paramount to leave this dataset aside until the end. It should never be used to optimize the model or the data attributes.

These three subsets should be large enough to represent the real data accurately. More precisely, the distribution of all the variables should be equivalent in the three subsets. If the original dataset is ordered in some way, it is important to make sure that the data is shuffled prior to the train, validation, test split.

As mentioned previously, the model you choose based on its performance on the validation set may have had a positive bias toward that particular dataset. In order to minimize such a dependency, it is common to train and evaluate several models with the same parameter settings and to average the performance of the model over several training validation dataset pairs. This reduces the model selection dependence with regard to the specific distribution of variables in the validation dataset.

This third-split method is basic and as we've seen, the model could end up being dependent on some specificities of the validation subset. Cross-validation is a standard method to reduce that dependency and improve our model selection. Cross validation consists in carrying out several training/validation split and averaging the model performance on the different validation subsets. The most frequent cross validation technique is k-fold cross validation which consists in splitting the dataset in K-chunks, recursively using each part as validation and the k-1 other parts as training. Other cross validation techniques include Monte-Carlo cross validation where the different training and validation sets are randomly sampled from the initial dataset. We will implement Monte Carlo cross validation in a later chapter. Cross validation is not a feature included in the Amazon ML service and needs to be implemented programatically. In Amazon ML, the training and evaluation of a model is done on one training-validation split only.