上QQ阅读APP看书，第一时间看更新

Polynomial regression

While in linear regression, the correlation between the independent and the dependent variables is best represented with a straight line, the real-life datasets are more complex and do not represent a linear relationship between cause and effect. The straight line equation does not fit the data points and hence cannot create an effective predictive model.

In such cases, we can consider using a higher-order quadratic equation for the predictor function. Given x as an independent variable and y as a dependent variable, the polynomial function takes the following forms:

These can be visualized with a small set of sample data as follows:

Figure 3.11 Polynomial prediction function

Note that the straight line cannot accurately represent the relationship between x and y. As we model the prediction function with higher-order functions, R²is improved. This means the model is able to be more accurate.

We may think that it will be best to use the highest possible order equation for the prediction function in order to get the best fitting model. However, that is not right because as we create the regression line that goes through all the data points, the model fails to accurately predict the outcomes for any data outside of the training sample (test data). This problem is called overfitting. On the other end, we may also encounter the problem of underfitting. This is when the model does not fit the training data well and hence performs poorly with the test data.