Python Machine Learning Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it...

Let's see how to preprocess data in Python:

Let's start by importing the library:

>> from sklearn import preprocessing

The sklearn library is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms, including support vector machines (SVMs), random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries, NumPy and SciPy.

To understand the outcome of mean removal on our data, we first visualize the mean and standard deviation of the vector we have just created:

>> print("Mean: ",data.mean(axis=0))
>> print("Standard Deviation: ",data.std(axis=0))

The mean() function returns the sample arithmetic mean of data, which can be a sequence or an iterator. The std() function returns the standard deviation, a measure of the distribution of the array elements. The axis parameter specifies the axis along which these functions are computed (0 for columns, and 1 for rows).

The following results are returned:

Mean: [ 1.33333333 1.93333333 -0.06666667 -2.53333333]
Standard Deviation: [1.24721913 2.44449495 1.60069429 3.30689515]

Now we can proceed with standardization:

>> data_standardized = preprocessing.scale(data)

The preprocessing.scale() function standardizes a dataset along any axis. This method centers the data on the mean and resizes the components in order to have a unit variance.

Now we recalculate the mean and standard deviation on the standardized data:

>> print("Mean standardized data: ",data_standardized.mean(axis=0))
>> print("Standard Deviation standardized data: ",data_standardized.std(axis=0))

The following results are returned:

Mean standardized data: [ 5.55111512e-17 -1.11022302e-16 -7.40148683e-17 -7.40148683e-17]
Standard Deviation standardized data: [1. 1. 1. 1.]

You can see that the mean is almost 0 and the standard deviation is 1.