Python Machine Learning Blueprints
上QQ阅读APP看书,第一时间看更新

groupby

Let's now look at an operation that is highly useful, but often difficult for new pandas users to get their heads around: the .groupby() function. We'll walk through a number of examples step by step in order to illustrate the most important functionality.

The groupby operation does exactly what it says: it groups data based on some class or classes you choose. Let's take a look at a simple example using our iris dataset. We'll go back and reimport our original iris dataset, and run our first groupby operation:

Here, data for each species is partitioned and the mean for each feature is provided. Let's take it a step further now and get full descriptive statistics for each species:

Statistics for each species

And now, we can see the full breakdown bucketed by species. Let's now look at some other groupby operations we can perform. We saw previously that petal length and width had some relatively clear boundaries between species. Now, let's examine how we might use groupby to see that:

In this case, we have grouped each unique species by the petal width they were associated with. This is a manageable number of measurements to group by, but if it were to become much larger, we would likely need to partition the measurements into brackets. As we saw previously, that can be accomplished by means of the apply function.

Let's now take a look at a custom aggregation function:

In this code, we grouped petal width by species using the .max() and .min() functions, and a lambda function that returns a maximum petal width less than the minimum petal width.

We've only just touched on the functionality of the groupby function; there is a lot more to learn, so I encourage you to read the documentation available at  http://pandas.pydata.org/pandas-docs/stable/.

Hopefully, you now have a solid base-level understanding of how to manipulate and prepare data in preparation for our next step, which is modeling. We will now move on to discuss the primary libraries in the Python machine learning ecosystem.