上QQ阅读APP看书，第一时间看更新

Imputation of missing data

When dealing with not-so-perfect or incomplete datasets, a missing register may not add value to the model in itself, but all the other elements of the row could be useful to the model. This is especially true when the model has a high percentage of incomplete values, so no row can be discarded.

The main question in this process is "how do you interpret a missing value?" There are many ways, and they usually depend on the problem itself.

A very naive approach could be set the value to zero, supposing that the mean of the data distribution is 0. An improved step could be to relate the missing data with the surrounding content, assigning the average of the whole column, or an interval of n elements of the same columns. Another option is to use the column's median or most frequent value.

Additionally, there are more advanced techniques, such as robust methods and even k-nearest neighbors, that we won't cover in this book.