Hands-On Predictive Analytics with Python

上QQ阅读APP看书，第一时间看更新

Diamond prices – data collection and preparation

Great! The project, together with your proposed solution, has been approved and now it is time for the second phase in the predictive analytics process: data collection and preparation. Finally, it's time for us to get our hands dirty!

As mentioned in the first chapter, the data collection process is entirely dependent on the project. Sometimes you will need to get the data yourself using some extract, transform, load (ETL) technologies, sometimes you will need access to some internal database, or you may get access to external data via services such as Bloomberg or Quandl, from public APIs, and so on. The point is that this process is so unique to any predictive analytics project that we won't be able to say too much about it. In the rest of the chapter, we will introduce some of the recurrent topics you will find in this phase: missing values, outliers, feature transformations, and so on.

Now, back to our example, consider the following scenarios:

We already have a dataset provided to us, so the data has been collected, but now we need to prepare it.
As we stated in Chapter 1, The Predictive Analytics Process, the goal of this stage is to get a dataset that is ready for analysis.
Fortunately for us, the dataset is already cleaned and almost ready for analysis, unlike most projects in the real world, where a good portion of your time will be spent cleaning and preparing the dataset.
In our case (intentionally), very little data preparation needs to be done for this project; similarly to the data collection process, data cleaning is very much unique to each project.

Data cleaning often takes a lot of time and effort. There is no standard way to proceed, since this process is unique to every dataset. It includes identifying corrupt, incomplete, useless, or incorrect data and replacing or removing such pieces of data from the dataset. Almost always, a programming language such as Python is used for this process because of its many libraries, as well as for its capability at handling regular expressions.

Most of the time, after cleaning the data, you will arrive at a dataset that looks like the one we have; let's show the code for loading the dataset:

# loading important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

# Loading the data
DATA_DIR = '../data'
FILE_NAME = 'diamonds.csv'
data_path = os.path.join(DATA_DIR, FILE_NAME)
diamonds = pd.read_csv(data_path)
diamonds.shape

After running the preceding code, we found that our dataset has 53940 rows and 10 columns:

(53940, 10)

Now, it is time for us to check if the dataset is ready for analysis; let's begin by checking the summary statistics of the numerical variables of the dataset:

diamonds.describe()

This is what we get:

This output is very convenient for quickly checking for strange values in the numerical variables; for instance, given the definitions of all of them, we would not expect to find negative values, and indeed, based on the minima (min row) all values are non-negative, which is good.

Let's begin our analysis with the carat column. The maximum value for the carat column seems to be a little too high; why would 5.01 be considered high? Well, considering the 75th percentile, which is close to 1.0, and the standard deviation (0.47), the maximum value is more than eight standard deviations from the 75th percentile, which is definitely a big difference.

This diamond with a carat of 5.01 is a candidate for consideration as an outlier: a value that is so distant from the typical range of variability of the values that it may indicate an error in the measurement or recording of the data.

Even if the outlier is a legitimate value, it may be so rare that it may be appropriate to exclude it from the analysis, since we are almost always interested in the generality of what we are analyzing. For example, in a study of the income of the general population of the USA, would you include Jeff Bezos in your sample? Probably not. Now, we won't be doing anything at this moment about the rare heavy diamond, let's just make a mental note about the current scenario:

Let's continue with the next columns, depth, and table; since by definition these two quantities are percentages, all values should be between 0 and 100, which is the case, so everything looks OK with those columns.
Now, let's take a look at the descriptive statistics for the price column; remember this one is our target.
The cheapest diamond we observe is one with a price of USD 326, the mean price is almost USD 4,000, and the most expensive diamond has a price of USD 18,823; could this price be an outlier?
Let's quickly evaluate how far, in terms of standard deviations, this price is from the 75th percentile: (18,823 - 5,324.25) / 3,989.4 = 3.38 standard deviations.
So, although it is indeed very expensive, given the high variability observed in the prices (a standard deviation of 3,989.4), I would not consider the maximum as an outlier.