上QQ阅读APP看书，第一时间看更新

Classic datasets versus real-world datasets

Data scientists and machine-learning practitioners often use classic datasets to demonstrate the behavior of certain models. The Iris dataset, composed of 150 samples of three types of iris flowers, is one of the most commonly used to demonstrate or to teach predictive analytics. It has been around since 1936!

The Boston housing dataset and the Titanic dataset are other very popular datasets for predictive analytics. For text classification, the Reuters or the 20 newsgroups text datasets are very common, while image recognition datasets are used to benchmark deep learning models. These classic datasets are used to establish baselines when evaluating the performances of algorithms and models. Their characteristics are well known, and data scientists know what performances to expect.

These classic datasets can be downloaded:

Iris: http://archive.ics.uci.edu/ml/datasets/Iris
Boston housing: https://archive.ics.uci.edu/ml/datasets/Housing
Titanic dataset: https://www.kaggle.com/c/titanic or http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/
Reuters: https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection
20 newsgroups: http://scikit-learn.org/stable/datasets/twenty_newsgroups.html
Image recognition and deep learning: http://deeplearning.net/datasets/

However, classic datasets can be weak equivalents of real datasets, which have been extracted and aggregated from a perse set of sources: databases, APIs, free form documents, social networks, spreadsheets, and so on. In a real-life situation, the data scientist must often deal with messy data that has missing values, absurd outliers, human errors, weird formatting, strange inputs, and skewed distributions.

The first task in a predictive analytics project is to clean up the data. In the following section, we will look at the main issues with raw data and what strategies can be applied. Since we will ultimately be using a linear model for our predictions, we will process the data with that in mind.

本周热推：

Foxtable数据库应用开发宝典业务数智化：从数字化到数智化的体系化解决方案活用数据：驱动业务的数据分析实战一本书读懂大数据数字IC设计入门（微课视频版）