上QQ阅读APP看书，第一时间看更新

Accepting non-linear patterns

A linear regression model implies that the outcome can be estimated by a linear combination of the predictors. This, of course, is not always the case, as features often exhibit nonlinear patterns.

Consider the following graph, where Y axis depends on X axis but the relationship displays an obvious quadratic pattern. Fitting a line (y = aX + b) as a prediction model of Y as a function of X does not work:

Some models and algorithms are able to naturally handle non-linearities, for example, tree-based models or support vector machines with non-linear kernels. Linear regression and SGD are not.

Transformations: One way to deal with these nonlinear patterns in the context of linear regression is to transform the predictors. In the preceding simple example, adding the square of the predictor X to the model would give a much better result. The model would now be of the following form:

And as shown in the following diagram, the new quadratic model fits the data much better:

We are not restricted to the quadratic case, and a power function of higher order can be used to transform existing attributes and create new predictors. Other useful transformations could include taking the logarithm, exponential, sine and cosine, and so on. The Boxcox transformation (http://onlinestatbook.com/2/transformations/box-cox.html) is worth citing at this point. It's an efficient data transformation that reduces skewness and kurtosis of a variable distribution. It reshapes the variable distribution into one closer to a Gaussian distribution.

Splines are an excellent and more powerful alternative to polynomial interpolation. Splines are piece-wise polynomials that join smoothly. At their simplest level, splines consists of lines that are connected together at different points. Splines are not available in Amazon ML.

Quantile binning is the Amazon ML solution to non-linearities. By splitting the data into N bins, you remove any non-linearities in the bin's intervals. Although binning has several drawbacks (http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous), the main one being that information is discarded in the process, it has been shown to generate excellent prediction performance in the Amazon ML platform.