Python Machine Learning Cookbook（Second Edition）

上QQ阅读APP看书，第一时间看更新

How to do it…

Let's see how to build a linear regressor in Python:

Create a file called regressor.py and add the following lines:

filename = "VehiclesItaly.txt"
X = []
y = []
with open(filename, 'r') as f:
    for line in f.readlines():
        xt, yt = [float(i) for i in line.split(',')]
        X.append(xt)
        y.append(yt)

We just loaded the input data into X and y, where X refers to the independent variable (explanatory variables) and y refers to the dependent variable (response variable). Inside the loop in the preceding code, we parse each line and split it based on the comma operator. We then convert them into floating point values and save them in X and y.

When we build a machine learning model, we need a way to validate our model and check whether it is performing at a satisfactory level. To do this, we need to separate our data into two groups—a training dataset and a testing dataset. The training dataset will be used to build the model, and the testing dataset will be used to see how this trained model performs on unknown data. So, let's go ahead and split this data into training and testing datasets:

num_training = int(0.8 * len(X))
num_test = len(X) - num_training

import numpy as np

# Training data
X_train = np.array(X[:num_training]).reshape((num_training,1))
y_train = np.array(y[:num_training])

# Test data
X_test = np.array(X[num_training:]).reshape((num_test,1))
y_test = np.array(y[num_training:])

First, we have put aside 80% of the data for the training dataset and the remaining 20% is for the testing dataset. Then, we have built four arrays: X_train, X_test,y_train, and y_test.

We are now ready to train the model. Let's create a regressor object, as follows:

from sklearn import linear_model

# Create linear regression object
linear_regressor = linear_model.LinearRegression()

# Train the model using the training sets
linear_regressor.fit(X_train, y_train)

First, we have imported linear_model methods from the sklearn library, which are methods used for regression, wherein the target value is expected to be a linear combination of the input variables. Then, we have used the LinearRegression() function, which performs ordinary least squares linear regression. Finally, the fit() function is used to fit the linear model. Two parameters are passed—training data (X_train), and target values (y_train).

We just trained the linear regressor, based on our training data. The fit() method takes the input data and trains the model. To see how it all fits, we have to predict the training data with the model fitted:

y_train_pred = linear_regressor.predict(X_train)

To plot the outputs, we will use the matplotlib library as follows:

import matplotlib.pyplot as plt
plt.figure()
plt.scatter(X_train, y_train, color='green')
plt.plot(X_train, y_train_pred, color='black', linewidth=4)
plt.title('Training data')
plt.show()

When you run this in the Terminal, the following diagram is shown:

In the preceding code, we used the trained model to predict the output for our training data. This wouldn't tell us how the model performs on unknown data, because we are running it on the training data. This just gives us an idea of how the model fits on training data. Looks like it's doing okay, as you can see in the preceding diagram!
Let's predict the test dataset output based on this model and plot it, as follows:

y_test_pred = linear_regressor.predict(X_test)
plt.figure()
plt.scatter(X_test, y_test, color='green')
plt.plot(X_test, y_test_pred, color='black', linewidth=4)
plt.title('Test data')
plt.show()

When you run this in the Terminal, the following output is returned:

As you might expect, there's a positive association between a state's population and the number of vehicle registrations.