The Statistics and Calculus with Python Workshop
上QQ阅读APP看书,第一时间看更新

Types of Data in Statistics

In statistics, there are two main types of data: categorical data and numerical data. Depending on which type an attribute or a variable in your dataset belongs to, its data processing, modeling, analysis, and visualization techniques might differ. In this section, we will explain the details of these two main data types and discuss relevant points for each of them, which are summarized in the following table:

Figure 3.1: Data type comparison

For the rest of this section, we will go into more detail about each of the preceding comparisons, starting with categorical data in the next subsection.

Categorical Data

When an attribute or a variable is categorical, the possible values it can take belong to a predetermined and fixed set of values. For example, in a weather-related dataset, you might have an attribute to describe the overall weather for each day, in which case that attribute might be among a list of discrete values such as "sunny", "windy", "cloudy", "rain", and so on. A cell in this attribute column must take on one of these possible values; a cell cannot contain, for example, a number or an unrelated string like "apple". Another term for this type of data is nominal data.

Because of the nature of the data, in most cases, there is no ordinal relationship between the possible values of a categorical attribute. For example, there is no comparison operation that can be applied to the weather-related data we described previously: "sunny" is neither greater than or less than "windy", and so on. This is to be contrasted with numerical data, which, although we haven't discussed it yet, expresses clear ordinality.

On the topic of differences between data types, let's now go through a number of points to keep in mind when working with categorical data.

If an unknown variable that is a categorical attribute is to be modeled using a probability distribution, a categorical distribution will be required. Such a distribution describes the probability that the variable is one out of K predefined possible categories. Luckily for us, most of the modeling will be done in the backend of various statistical/machine learning models when we call them from their respective libraries, so we don't have to worry about the problem of modeling right now.

In terms of data processing, an encoding scheme is typically used to convert the categorical values in an attribute to numerical, machine-interpretable values. As such, string values, which are highly common in categorical data, cannot be fed to a number of models that only take in numerical data.

For example, some tend to use the simple encoding of assigning each possible value with a positive integer and replacing them with their respective numerical value. Consider the following sample dataset (stored in the variable named weather_df):

weather_df

The output will be as follows:

     temp weather

0 55 windy

1 34 cloudy

2 80 sunny

3 75 rain

4 53 sunny

Now, you could potentially call the map() method on the weather attribute and pass in the dictionary {'windy': 0, 'cloudy': 1, 'sunny': 2, 'rain': 3} (the map() method simply applies the mapping defined by the dictionary on the attribute) to encode the categorical attribute like so:

weather_df['weather_encoded'] = weather_df['weather'].map(\

                                {'windy': 0, 'cloudy': 1, \

                                 'sunny': 2, 'rain': 3})

This DataFrame object will now hold the following data:

weather_df

The output is as follows:

     temp weather weather_encoded

0 55 windy 0

1 34 cloudy 1

2 80 sunny 2

3 75 rain 3

4 53 sunny 2

We see that the categorical column weather has been successfully converted to numerical data in weather_encoded via a one-to-one mapping. However, this technique can be potentially dangerous: the new attribute implicitly places an order on its data. Since 0 < 1 < 2 < 3, we are inadvertently imposing the same ordering on the original categorical data; this is especially dangerous if the model we are using specifically interprets that as truly numerical data.

This is the reason why we must be careful when transforming our categorical attributes into a numerical form. We have actually already discussed a certain technique that is able to convert categorical data without imposing a numerical relationship in the previous chapter: one-hot encoding. In this technique, we create a new attribute for every unique value in a categorical attribute. Then, for each row in the dataset, we place a 1 in a newly created attribute if that row has the corresponding value in the original categorical attribute and 0 in the other new attributes.

The following code snippet reiterates how we can implement one-hot encoding with pandas and what effect it will have on our current sample weather dataset:

pd.get_dummies(weather_df['weather'])

This will produce the following output:

     cloudy rain sunny windy

0 0 0 0 1

1 1 0 0 0

2 0 0 1 0

3 0 1 0 0

4 0 0 1 0

Among the various descriptive statistics that we will discuss later in this chapter, the mode — the value that appears the most — is typically the only statistic that can be used on categorical data. As a consequence of this, when there are values missing from a categorical attribute in our dataset and we'd like to fill them with a central tendency statistic, a concept we will define later on in this chapter, the mode is the only one that should be considered.

In terms of making predictions, if a categorical attribute is the target of our machine learning pipeline (as in, if we want to predict a categorical attribute), classification models are needed. As opposed to regression models, which make predictions on numerical, continuous data, classification models, or classifiers for short, keep in mind the possible values their target attribute can take and only predict among those values. Thus, when deciding which machine learning model(s) you should train on your dataset to predict categorical data, make sure to only use classifiers.

The last big difference between categorical data and numerical data is in visualization techniques. A number of visualization techniques were discussed in the previous chapter that are applicable for categorical data, two of the most common of which are bar graphs (including stacked and grouped bar graphs) and pie charts.

These types of visualization focus on the portion of the whole dataset each unique value takes up.

For example, with the preceding weather dataset, we can create a pie chart using the following code:

weather_df['weather'].value_counts().plot.pie(autopct='%1.1f%%')

plt.ylabel('')

plt.show()

This will create the following visualization:

Figure 3.2: Pie chart for weather data

We can see that in the whole dataset, the value 'sunny' occurs 40 percent of the time, while each of the other values occurs 20 percent of the time.

We have so far covered most of the biggest theoretical differences between a categorical attribute and a numerical attribute, which we will discuss in the next section. However, before moving on, there is another subtype of the categorical data type that should be mentioned: binary data.

A binary attribute, whose values can only be True and False, is a categorical attribute whose set of possible values contains the two Boolean values mentioned. Since Boolean values can be easily interpreted by machine learning and mathematical models, there is usually not a need to convert a binary attribute into any other form.

In fact, binary attributes that are not originally in the Boolean form should be converted into True and False values. We encountered an example of this in the sample student dataset in the previous chapter:

student_df

The output is as follows:

     name sex class gpa num_classes

0 Alice female FY 90 4

1 Bob male SO 93 3

2 Carol female SR 97 4

3 Dan male SO 89 4

4 Eli male JR 95 3

5 Fran female SR 92 2

Here, the column 'sex' is a categorical attribute whose values can either be 'female' or 'male'. So instead, what we can do to make this data more machine-friendly (while ensuring no information will be lost or added in) is to binarize the attribute, which we have done via the following code:

student_df['female_flag'] = student_df['sex'] == 'female'

student_df = student_df.drop('sex', axis=1)

student_df

The output is as follows:

     name class gpa num_classes female_flag

0 Alice FY 90 4 True

1 Bob SO 93 3 False

2 Carol SR 97 4 True

3 Dan SO 89 4 False

4 Eli JR 95 3 False

5 Fran SR 92 2 True

Note

Since the newly created column 'female_flag' contains all the information from the column 'sex' and only that, we can simply drop the latter from our dataset.

Aside from that, binary attributes can be treated as categorical data in any other way (processing, making predictions, and visualization).

Let's now apply what we have discussed so far in the following exercise.

Exercise 3.01: Visualizing Weather Percentages

In this exercise, we are given a sample dataset that includes the weather in a specific city across five days. This dataset can be downloaded from https://packt.live/2Ar29RG. We aim to visualize the categorical information in this dataset to examine the percentages of different types of weather using the visualization techniques for categorical data that we have discussed so far:

  1. In a new Jupyter notebook, import pandas, Matplotlib, and seaborn and use pandas to read in the aforementioned dataset:

    import pandas as pd

    import matplotlib.pyplot as plt

    import seaborn as sns

    weather_df = pd.read_csv('weather_data.csv')

    weather_df.head()

    When the first five rows of this dataset are printed out, you should see the following output:

    Figure 3.3: The weather dataset

    As you can see, each row of this dataset tells us what the weather was on a given day in a given city. For example, on day 0, it was sunny in St Louis while it was cloudy in New York.

  2. In the next code cell in the notebook, compute the counts (the numbers of occurrences) for all the weather types in our dataset and visualize that information using the plot.bar() method:

    weather_df['weather'].value_counts().plot.bar()

    plt.show()

    This code will produce the following output:

    Figure 3.4: Counts of weather types

  3. Visualize the same information we have in the previous step as a pie chart using the plot.pie(autopct='%1.1f%%') method:

    weather_df['weather'].value_counts().plot.pie(autopct='%1.1f%%')

    plt.ylabel('')

    plt.show()

    This code will produce the following output:

    Figure 3.5: Counts of weather types

  4. Now, we would like to visualize these counts of weather types, together with the information on what percentage each weather type accounts for in each city. First, this information can be computed using the groupby() method, as follows:

    weather_df.groupby(['weather', 'city'])['weather'].count()\

                                            .unstack('city')

    The output is as follows:

    city New York San Francisco St Louis

    weather

    cloudy 3.0 NaN 3.0

    rain 1.0 NaN 1.0

    sunny 1.0 4.0 1.0

    windy NaN 1.0 NaN

    We see that this object contains the information that we wanted. For example, looking at the cloudy row in the table, we see that the weather type cloudy occurs three times in New York and three times in St Louis. There are multiple places where we have NaN values, which denote non-occurrences.

  5. We finally visualize the table we have in the previous step as a stacked bar plot:

    weather_df.groupby(['weather', 'city'])\

                       ['weather'].count().unstack('city')\

                       .fillna(0).plot(kind='bar', stacked=True)

    plt.show()

    This will produce the following plot:

Figure 3.6: Counts of weather types with respect to cities

Throughout this exercise, we have put our knowledge regarding categorical data into practice to visualize various types of counts computed from a sample weather dataset.

Note

To access the source code for this specific section, please refer to https://packt.live/2ArQAtw.

You can also run this example online at https://packt.live/3gkIWAw.

With that, let's move on to the second main type of data: numerical data.

Numerical Data

The term proves to be intuitive in helping us understand what type of data this is. A numerical attribute should contain numerical and continuous values or real numbers. The values belonging to a numerical attribute can have a specific range; for example, they can be positive, negative, or between 0 and 1. However, an attribute being numerical implies that its data can take any value within its given range. This is notably different from values in a categorical attribute, which only belong to a given discrete set of values.

There are many examples of numerical data: the height of the members of a population, the weight of the students in a school, the price of houses that are for sale in certain areas, the average speed of track-and-field athletes, and so on. As long as the data can be represented as real-valued numbers, it is most likely numerical data.

Given its nature, numerical data is vastly different from categorical data. In the following text, we will lay out some of the most important differences with respect to statistics and machine learning that we should keep in mind.

As opposed to a few probability distributions that can be used to model categorical data, there are numerous probability distributions for numerical data. These include the normal distribution (also known as the bell curve distribution), the uniform distribution, the exponential distribution, the Student's t distribution, and many more. Each of these probability distributions is designed to model specific types of data. For example, the normal distribution is typically used to model quantities with linear growth such as age, height, or students' test scores, while the exponential distribution models the amount of time between the occurrences of a given event.

It is important, therefore, to research what specific probability distribution is suitable for the numerical attribute that you are attempting to model. An appropriate distribution will allow for coherent analysis as well as accurate predictions; on the other hand, an unsuitable choice of probability distribution might lead to unintuitive and incorrect conclusions.

On another topic, many processing techniques can be applied to numerical data. Two of the most common of these include scaling and normalization.

Scaling involves adding and/or multiplying all the values in a numerical attribute by a fixed quantity to scale the range of the original data to another range. This method is used when statistical and machine learning models can only handle values within a given range (for example, positive numbers or numbers between 0 and 1 can be processed and analyzed more easily).

One of the most commonly used scaling techniques is the min-max scaling method, which is explained by the following formula, where a and b are positive numbers:

Figure 3.7: Formula for min-max scaling

X' and X denote the data after and before the transformation, while Xmax and Xmin denote the maximum and minimum values within the data, respectively. It can be mathematically proven that the output of the formula is always greater than a and less than b, but we don't need to go over that here. We will come back to this scaling method again in our next exercise.

As for normalization, even though this term is sometimes used interchangeably with scaling, it denotes the process of specifically scaling a numerical attribute to the normalized form with respect to its probability distribution. The goal is for us to obtain a transformed dataset that nicely follows the shape of the probability distribution we have chosen.

For example, say the data we have in a numerical attribute follows a normal distribution with a mean of 4 and a standard deviation of 10. The following code randomly generates that data synthetically and visualizes it:

samples = np.random.normal(4, 10, size=1000)

plt.hist(samples, bins=20)

plt.show()

This produces the following plot:

Figure 3.8: Histogram for normally distributed data

Now, say you have a model that assumes the standard form of the normal distribution for this data, where the mean is 0 and the standard deviation is 1, and if the input data is not in this form, the model will have difficulty learning from it. Therefore, you'd like to somehow transform the preceding data into this standard form, without sacrificing the true pattern (specifically the general shape) of the data.

Here, we can apply the normalization technique for normally distributed data, in which we subtract the true mean from the data points and pide the result by the true standard deviation. This scaling process is more generally known as a standard scaler. Since the preceding data is already a NumPy array, we can take advantage of vectorization and perform the normalization as follows:

normalized_samples = (samples - 4) / 10

plt.hist(normalized_samples, bins=20)

plt.show()

This code will generate the histogram for our newly transformed data, which is shown here:

Figure 3.9: Histogram for normalized data

We see that while the data has been successfully shifted to the range we want, now it centers around 0 and most of the data lies between -3 and 3, which is the standard form of the normal distribution, but the general shape of the data has not been altered. In other words, the relative differences between the data points have not been changed.

On an additional note, in practice, when the true mean and/or the true standard deviation are not available, we can approximate those statistics with the sample mean and standard deviation as follows:

sample_mean = np.mean(samples)

sample_sd = np.std(samples)

With a large number of samples, these two statistics offer a good approximation that can be further used for this type of transformation. With that, we can now feed this normalized data to our statistical and machine learning models for further analysis.

Speaking of the mean and the standard deviation, those two statistics are usually used to describe numerical data. To fill in missing values in a numerical attribute, central tendency measures such as the mean and the median are typically used. In some special cases such as a time-series dataset, you can use more complex missing value imputation techniques such as interpolation, where we estimate the missing value to be somewhere in between the ones immediately before and after it in a sequence.

When we'd like to train a predictive model to target a numerical attribute, regression models are used. Instead of making predictions on which possible categorical values an entry can take like a classifier, a regression model looks for a reasonable prediction across a continuous numerical range. As such, similar to what we have discussed, we must take care to only apply regression models on datasets whose target values are numerical attributes.

Finally, in terms of visualizing numerical data, we have seen a wide range of visualization techniques that we can use. Immediately before this, we saw histograms being used to describe the distribution of a numerical attribute, which tells us how the data is dispersed along its range.

In addition, line graphs and scatter plots are generally good tools to visualize patterns of an attribute with respect to other attributes. (For example, we plotted the PDF of various probability distributions as line graphs.) Lastly, we also saw a heatmap being used to visualize a two-dimensional structure, which can be applied to represent correlations between numerical attributes in a dataset.

Before we move on with our next topic of discussion, let's performa quick exercise on the concept of scaling/normalization. Again, one of the most popular scaling/normalization methods is called Min-Max scaling, which allows us to transform all values in a numerical attribute into any arbitrary range [a, b]. We will explore this method next.

Exercise 3.02: Min-Max Scaling

In this exercise, we will write a function that facilitates the process of applying Min-Max scaling to a numerical attribute. The function should take in three parameters: data, a, and b. While data should be a NumPy array or a pandas Series object, a and b should be real-valued positive numbers denoting the endpoints of the numerical range that data should be transformed into.

Referring back to the formula included in the Numerical Data section, Min-Max scaling is given by the following:

Figure 3.10: Formula for min-max scaling

Let's have a look at the steps that need to be followed to meet our goal:

  1. Create a new Jupyter notebook and in its first code cell, import the libraries that we will be using for this exercise, as follows:

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    In the dataset that we will be using, the first column is named 'Column 1' and contains 1,000 samples from a normal distribution with a mean of 4 and a standard deviation of 10. The second column is named 'Column 2' and contains 1,000 samples from a uniform distribution from 1 to 2. The third column is named 'Column 3' and contains 1,000 samples from a Beta distribution with parameters 2 and 5. In the next code cell, read in the 'data.csv' file, which we generated for you beforehand (and which can be found at https://packt.live/2YTrdKt), as a DataFrame object using pandas and print out the first five rows:

    df = pd.read_csv('data.csv')

    df.head()

    You should see the following numbers:

         Column 1 Column 2 Column 3

    0 -1.231356 1.305917 0.511994

    1 7.874195 1.291636 0.155032

    2 13.169984 1.274973 0.183988

    3 13.442203 1.549126 0.391825

    4 -8.032985 1.895236 0.398122

  2. In the next cell, write a function named min_max_scale() that takes in three parameters: data, a, and b. As mentioned, data should be an array of values in an attribute of a dataset, while a and b specify the range that the input data is to be transformed into.
  3. Given the (implicit) requirement we have about data (a NumPy array or a pandas Series object—both of which can utilize vectorization), implement the scaling function with vectorized operations:

    def min_max_scale(data, a, b):

        data_max = np.max(data)

        data_min = np.min(data)

        return a + (b - a) * (data - data_min) / (data_max \

                                                  - data_min)

  4. We will consider the data in the 'Column 1' attribute first. To observe the effect that this function will have on our data, let's first visualize the distribution of what we currently have:

    plt.hist(df['Column 1'], bins=20)

    plt.show()

    This code will generate a plot that is similar to the following:

    Figure 3.11: Histogram of unscaled data

  5. Now, use the same plt.hist() function to visualize the returned value of the min_max_scale() function when called on df['Column 1'] to scale that data to the range [-3, 3]:

    plt.hist(min_max_scale(df['Column 1'], -3, 3), bins=20)

    plt.show()

    This will produce the following:

    Figure 3.12: Histogram of scaled data

    We see that while the general shape of the data distribution remains the same, the range of the data has been effectively changed to be from -3 to 3.

  6. Go through the same process (visualizing the data before and after scaling with histograms) for the 'Column 2' attribute. First, we visualize the original data:

    plt.hist(df['Column 2'], bins=20)

    plt.show()

  7. Now we visualize the scaled data, which should be scaled to the range [0, 1]:

    plt.hist(min_max_scale(df['Column 2'], 0, 1), bins=20)

    plt.show()

  8. The second block of code should produce a graph similar to the following:

    Figure 3.13: Histogram of scaled data

  9. Go through the same process (visualizing the data before and after the scaling with histograms) for the 'Column 3' attribute. First, we visualize the original data:

    plt.hist(df['Column 3'], bins=20)

    plt.show()

  10. Now we visualize the scaled data, which should be scaled to the range [10, 20]:

    plt.hist(min_max_scale(df['Column 3'], 10, 20), \

                              bins=20)

    plt.show()

  11. The second block of code should produce a graph similar to the following:

Figure 3.14: Histogram of scaled data

In this exercise, we have considered the concept of scaling/normalization for numerical data in more detail. We have also revisited the plt.hist() function as a method to visualize the distribution of numerical data.

Note

To access the source code for this specific section, please refer to https://packt.live/2VDw3JP.

You can also run this example online at https://packt.live/3ggiPdO.

The exercise concludes the topic of numerical data in this chapter. Together with categorical data, it makes up most of the data types that you might see in a given dataset. However, there is actually another data type in addition to these two, which is less common, as we will discuss in the next section.

Ordinal Data

Ordinal data is somewhat of a combination of categorical data (values in an ordinal attribute belonging to a specific given set) and numerical data (where the values are numbers—this fact implies an ordered relationship between them). The most common examples of ordinal data are letter scores ("A", "B", "C", "D", and "E"), integer ratings (for example, on a scale of 1 to 10), or quality ranking (for example, "excellent", "okay", and "bad", where "excellent" implies a higher level of quality than "okay", which in itself is better than "bad").

Since entries in an ordinal attribute can only take on one out of a specific set of values, categorical probability distributions should be used to model this type of data. For the same reason, missing values in an ordinal attribute can be filled out using the mode of the attribute, and visualization techniques for categorical data can be applied to ordinal data as well.

However, other processes might prove different from what we have discussed for categorical data. In terms of data processing, you could potentially assign a one-to-one mapping between each ordinal value and a numerical value/range.

In the letter score example, it is commonly the case that the grade "A" corresponds to the range [90, 100] in the raw score, and other letter grades have their own continuous ranges as well. In the quality ranking example, "excellent", "okay", and "bad" can be mapped to 10, 5, and 0, respectively, as an example; however, this type of transformation is undesirable unless the degree of difference in quality between the values can be quantified.

In terms of fitting a machine learning model to the data and having it predict unseen values of an ordinal attribute, classifiers should be used for this task. Furthermore, since ranking is a unique task that constitutes many different learning structures, considerable effort has been dedicated to machine-learning ranking, where models are designed and trained specifically to predict ranking data.

This discussion concludes the topic of data types in statistics and machine learning. Overall, we have learned that there are two main data types commonly seen in datasets: categorical and numerical data. Depending on which type your data belongs to, you will need to employ different data processing, machine learning, and visualization techniques.

In the next section, we will talk about descriptive statistics and how they can be computed in Python.