上QQ阅读APP看书，第一时间看更新

Extracting features to predict outcomes

That available data needs to be accessible and meaningful in order for the algorithm to extract information.

Let's consider a simple example. Imagine that we want to predict the market price of a house in a given city. We can think of many variables that would be predictors of the price of a house: the number of rooms or bathrooms, the neighborhood, the surface, the heating system, and so on. These variables are called features, attributes, or predictors. The value that we want to predict is called the outcome or the target.

If we want our predictions to be reliable, we need several features. Predicting the price of a house based on its surface alone would not be very efficient. Many other factors influence the price of a house and our dataset should include as many as possible (with conditions).

It's often possible to add large numbers of attributes to a model to try to improve the predictions. For instance, in our housing pricing prediction, we could add all the characteristics of the house (bathroom, superficies, heating system, the number of windows). Some of these variables would bring more information to our pricing model and increase the accuracy of our predictions, while others would just add noise and confuse the algorithm. Adding new variables to a predicting model does not always improve the predictions.

In order to make reliable predictions, each of the new features you bring to your model must bring some valuable piece of information. However, this is also not always the case. As we will see in Chapter 2, Machine Learning Definitions and Concepts, correlated predictors can hurt the performances of the model.

Predictive analytics is built on several assumptions and conditions:

The value you are trying to predict is predictable and not just some random noise.
You have access to data that has some degree of association to the target.
The available dataset is large enough. Reliable predictions cannot be inferred from a dataset that is too small. (For instance, you can define and therefore predict a line with two points but you cannot infer data that follows a sine curve from only two points.)
The new data you will base future predictions on is similar to the one you parameterized and trained your model on.

You may have a great dataset, but that does not mean it will be efficient for predictions.

These conditions on the data are very general. In the case of SGD, the conditions are more constrained.