Effective Amazon Machine Learning
上QQ阅读APP看书,第一时间看更新

Classic datasets versus real-world datasets

Data scientists and machine-learning practitioners often use classic datasets to demonstrate the behavior of certain models. The Iris dataset, composed of 150 samples of three types of iris flowers, is one of the most commonly used to demonstrate or to teach predictive analytics. It has been around since 1936!

The Boston housing dataset and the Titanic dataset are other very popular datasets for predictive analytics. For text classification, the Reuters or the 20 newsgroups text datasets are very common, while image recognition datasets are used to benchmark deep learning models. These classic datasets are used to establish baselines when evaluating the performances of algorithms and models. Their characteristics are well known, and data scientists know what performances to expect.

These classic datasets can be downloaded:

However, classic datasets can be weak equivalents of real datasets, which have been extracted and aggregated from a perse set of sources: databases, APIs, free form documents, social networks, spreadsheets, and so on. In a real-life situation, the data scientist must often deal with messy data that has missing values, absurd outliers, human errors, weird formatting, strange inputs, and skewed distributions.

The first task in a predictive analytics project is to clean up the data. In the following section, we will look at the main issues with raw data and what strategies can be applied. Since we will ultimately be using a linear model for our predictions, we will process the data with that in mind.