Descriptive Statistics
As mentioned before, descriptive statistics and inferential statistics are the two main categories in the field of statistics. With descriptive statistics, our goal is to compute specific quantities that can convey important information about—or in other words, describe—our data.
From within descriptive statistics, there are two main subcategories: central tendency statistics and dispersion statistics. The actual terms are suggestive of their respective meaning: central tendency statistics are responsible for describing the center of the distribution of the given data, while dispersion statistics convey information about the spread or range of the data away from its center.
One of the clearest examples of this distinction is from the familiar normal distribution, whose statistics include a mean and a standard deviation. The mean, which is calculated to be the average of all the values from the probability distribution, is suitable for estimating the center of the distribution. In its standard form, as we have seen, the normal distribution has a mean of 0, indicating that its data revolves around point 0 on the axis.
The standard deviation, on the other hand, represents how much the data points vary from the mean. Without going into much detail, in a normal distribution, it is calculated to be the mean distance from the mean of the distribution. A low-valued standard deviation indicates that the data does not deviate too much from the mean, while a high-valued standard deviation implies that the inpidual data points are quite different from the mean.
Overall, these types of statistics and their characteristics can be summarized in the following table:
There are also other, more specialized descriptive statistics, such as skewness, which measures the asymmetry of the data distribution, or kurtosis, which measures the sharpness of the distribution peak. However, these are not as commonly used as the ones we listed previously, and therefore will not be covered in this chapter.
In the next subsection, we will start discussing each of the preceding statistics in more depth, starting with central tendency measures.
Central Tendency
Formally, the three commonly used central tendency statistics are the mean, the median, and the mode. The median is defined as the middlemost value when all the data points are ordered along the axis. The mode, as we have mentioned before, is the value that occurs the most. Due to their characteristics, the mean and the median are only applicable for numerical data, while the mode is often used on categorical data.
All three of these statistics capture the concept of central tendency well by representing the center of a dataset in different ways. This is also why they are often used as replacements for missing values in an attribute. As such, with a missing numerical value, you can choose either the mean or the median as a potential replacement, while the mode could be used if a categorical attribute contains missing values.
In particular, it is actually not arbitrary that the mean is often used to fill in missing values in a numerical attribute. If we were to fit a probability distribution to the given numerical attribute, the mean of that attribute would actually be the sample mean, an estimation of the true population mean. Another term for the population mean is the expected value of an unknown value within that population, which, in other words, is what we should expect an arbitrary value from that population to be.
This is why the mean, or the expectation of a value from the corresponding distribution, should be used to fill in missing values in certain cases. While it is not exactly the case for the median, a somewhat similar argument can be made for its role in replacing missing numerical values. The mode, on the other hand, is a good estimation for missing categorical values, being the most commonly occurring value in an attribute.
Dispersion
Different from central tendency statistics, dispersion statistics, again, attempt to quantify how much variation there is in a dataset. Some common dispersion statistics are the standard deviation, the range (the difference between the maximum and the minimum), and quartiles.
The standard deviation, as we have mentioned, calculates the difference between each data point and the mean of a numerical attribute, squares them, takes their average, and finally takes the square root of the result. The further away the inpidual data points are from the mean, the larger this quantity gets, and vice versa. This is why it is a good indicator of how dispersed a dataset is.
The range—the distance between the maximum and the minimum, or the 0- and 100-percent quartiles—is another, simpler way to describe the level of dispersion of a dataset. However, because of its simplicity, sometimes this statistic does not convey as much information as the standard deviation or the quartiles.
A quartile is defined to be a threshold below which a specific portion of a given dataset falls. For example, the median, the middlemost value of a numerical dataset, is the 50-percent quartile for that dataset, as (roughly) half of the dataset is less than that number. Similarly, we can compute common quartile quantities such as the 5-, 25-, 75-, and 95-percent quartiles. These quartiles are arguably more informative in terms of quantifying how dispersed our data is than the range, as they can account for different distributions of the data.
In addition, the interquartile range, another common dispersion statistic, is defined to be the difference between the 25- and 75-percent quartiles of a dataset.
So far, we have discussed the concepts of central tendency statistics and dispersion statistics. Let's go through a quick exercise to reinforce some of these important ideas.
Exercise 3.03: Visualizing Probability Density Functions
In Exercise 2.04, Visualization of Probability Distributions of Chapter 2, Python's Main Tools for Statistics, we considered the task of comparing the PDF of a probability distribution against the histogram of its sampled data. Here, we will implement an extension of that program, where we also visualize various descriptive statistics for each of these distributions:
- In the first cell of a new Jupyter notebook, import NumPy and Matplotlib:
import numpy as np
import matplotlib.pyplot as plt
- In a new cell, randomly generate 1,000 samples from the normal distribution using np.random.normal(). Compute the mean, median, and the 25- and 75-percent quartiles descriptive statistics as follows:
samples = np.random.normal(size=1000)
mean = np.mean(samples)
median = np.median(samples)
q1 = np.percentile(samples, 25)
q2 = np.percentile(samples, 75)
- In the next cell, visualize the samples using a histogram. We will also indicate where the various descriptive statistics are by drawing vertical lines—a red vertical line at the mean point, a black one at the median, a blue line at each of the quartiles:
plt.hist(samples, bins=20)
plt.axvline(x=mean, c='red', label='Mean')
plt.axvline(x=median, c='black', label='Median')
plt.axvline(x=q1, c='blue', label='Interquartile')
plt.axvline(x=q2, c='blue')
plt.legend()
plt.show()
Note here that we are combining the specification of the label argument in various plotting function calls and the plt.legend() function. This will help us create a legend with appropriate labels, as can be seen here:
One thing is of interest here: the mean and the median almost coincide on the x axis. This is one of the many mathematically convenient features of a normal distribution that is not found in many other distributions: its mean is equal to both its median and its mode.
- Apply the same process to a Beta distribution with parameters 2 and 5, as follows:
samples = np.random.beta(2, 5, size=1000)
mean = np.mean(samples)
median = np.median(samples)
q1 = np.percentile(samples, 25)
q2 = np.percentile(samples, 75)
plt.hist(samples, bins=20)
plt.axvline(x=mean, c='red', label='Mean')
plt.axvline(x=median, c='black', label='Median')
plt.axvline(x=q1, c='blue', label='Interquartile')
plt.axvline(x=q2, c='blue')
plt.legend()
plt.show()
This should generate a graph similar to the following:
- Apply the same process to a Gamma distribution with parameter 5, as follows:
samples = np.random.gamma(5, size=1000)
mean = np.mean(samples)
median = np.median(samples)
q1 = np.percentile(samples, 25)
q2 = np.percentile(samples, 75)
plt.hist(samples, bins=20)
plt.axvline(x=mean, c='red', label='Mean')
plt.axvline(x=median, c='black', label='Median')
plt.axvline(x=q1, c='blue', label='Interquartile')
plt.axvline(x=q2, c='blue')
plt.legend()
plt.show()
This should generate a graph similar to the following:
With this exercise, we have learned how to compute various descriptive statistics of a dataset using NumPy and visualize them in a histogram.
Note
To access the source code for this specific section, please refer to https://packt.live/2YTobpm.
You can also run this example online at https://packt.live/2CZf26h.
In addition to computing descriptive statistics, Python also offers other additional methods to describe data, which we will discuss in the next section.
Python-Related Descriptive Statistics
Here, we will examine two intermediate methods for describing data. The first is the describe() method, to be called on a DataFrame object. From the official documentation (which can be found at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html), the function "generate(s) descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset's distribution, excluding NaN values."
Let's see the effect of this method in action. First, we will create a sample dataset with a numerical attribute, a categorical attribute, and an ordinal one, as follows:
df = pd.DataFrame({'numerical': np.random.normal(size=5),\
'categorical': ['a', 'b', 'a', 'c', 'b'],\
'ordinal': [1, 2, 3, 5, 4]})
Now, if we were to call the describe() method on the df variable, a tabular summary would be generated:
df.describe()
The output is as follows:
numerical ordinal
count 5.000000 5.000000
mean -0.251261 3.000000
std 0.899420 1.581139
min -1.027348 1.000000
25% -0.824727 2.000000
50% -0.462354 3.000000
75% -0.192838 4.000000
max 1.250964 5.000000
As you can see, each row in the printed output denotes a different descriptive statistic about each attribute in our dataset: the number of values (count), mean, standard deviation, and various quartiles. Since both the numerical and ordinal attributes were interpreted as numerical data (given the data they contain), describe() only generates these reports for them by default. The categorical column, on the other hand, was excluded. To force the reports to apply to all columns, we can specify the include argument as follows:
df.describe(include='all')
The output is as follows:
numerical categorical ordinal
count 5.000000 5 5.000000
unique NaN 3 NaN
top NaN a NaN
freq NaN 2 NaN
mean -0.251261 NaN 3.000000
std 0.899420 NaN 1.581139
min -1.027348 NaN 1.000000
25% -0.824727 NaN 2.000000
50% -0.462354 NaN 3.000000
75% -0.192838 NaN 4.000000
max 1.250964 NaN 5.000000
This forces the method to compute other statistics that apply for categorical data, such as the number of unique values (unique), the mode (top), and the count/frequency of the mode (freq). As we have discussed, most of the descriptive statistics for numerical data do not apply for categorical data and vice versa, which is why NaN values are used in the preceding reports to indicate such a non-application.
Overall, the describe() method from pandas offers a quick way to summarize and obtain an overview of a dataset and its attributes. This especially comes in handy during exploratory data analysis tasks, where we'd like to broadly explore a new dataset that we are not familiar with yet.
The second descriptive statistics-related method that is supported by Python is the visualization of boxplots. Obviously, a boxplot is a visualization technique that is not unique to the language itself, but Python, specifically its seaborn library, provides a rather convenient API, the sns.boxplot() function, to facilitate the process.
Theoretically, a boxplot is another method to visualize the distribution of a numerical dataset. It, again, can be generated with the sns.boxplot() function:
sns.boxplot(np.random.normal(2, 5, size=1000))
plt.show()
This code will produce a graph roughly similar to the following:
In the preceding boxplot, the blue box in the middle denotes the interquartile range of the input data (from the 25- to 75-percent quartile). The vertical line in the middle of the box is the median, while the two thresholds on the left and right but outside of the box denote the minimum and maximum of the input data, respectively.
It is important to note that the minimum is calculated to be the 25-percent quartile minus the interquartile range multiplied by 1.5, and the maximum the 75-percent quartile plus the interquartile range also multiplied by 1.5. It is common practice to consider any number outside of this range between the minimum and the maximum to be outliers, visualized as black dots in the preceding graph.
In essence, a boxplot can represent the statistics computed by the describe() function from pandas visually. What sets this function from seaborn apart from other visualization tools is the ease in creating multiple boxplots given a criterion provided by seaborn.
Let's see this in this next example, where we extend the sample dataset to 1000 rows with random data generation:
df = pd.DataFrame({'numerical': np.random.normal(size=1000),\
'categorical': np.random.choice\
(['a', 'b', 'c'], size=1000),\
'ordinal': np.random.choice\
([1, 2, 3, 4, 5], size=1000)})
Here, the 'numerical' attribute contains random draws from the standard normal distribution, the 'categorical' attribute contains values randomly chosen from the list ['a', 'b', 'c'], while 'ordinal' also contains values randomly chosen from a list, [1, 2, 3, 4, 5].
Our goal with this dataset is to generate a slightly more complex boxplot visualization—a boxplot representing the distribution of the data in 'numerical' for the different values in 'categorical'. The general process is to split the dataset into different groups, each corresponding to the unique value in 'categorical', and for each group, we'd like to generate a boxplot using the respective data in the 'numerical' attribute.
However, with seaborn, we can streamline this process by specifying the x and y arguments for the sns.boxplot() function. Specifically, we will have our x axis contain the different unique values in 'categorical' and the y axis represent the data in 'numerical' with the following code:
sns.boxplot(y='numerical', x='categorical', data=df)
plt.show()
This will generate the following plot:
The visualization contains what we wanted to display: the distribution of the data in the 'numerical' attribute, represented as boxplots and separated by the unique values in the 'categorical' attribute. Considering the unique values in 'ordinal', we can apply the same process as follows:
sns.boxplot(y='numerical', x='ordinal', data=df)
plt.show()
This will generate the following graph:
As you can imagine, this method of visualization is ideal when we'd like to analyze the differences in the distribution of a numerical attribute with respect to categorical or ordinal data.
And that concludes the topic of descriptive statistics in this chapter. In the next section, we will talk about the other category of statistics: inferential statistics.