Mastering Machine Learning on AWS
上QQ阅读APP看书,第一时间看更新

Building a Naive Bayes model through SageMaker notebooks

Let's get started with SageMaker notebooks. This tool will help us run the code that will train our model. SageMaker, among other things, allows us to create notebook instances that host Jupyter Notebooks. Jupyter is a web UI that allows a data scientist or programmer to code interactively by creating paragraphs of code that are executed on demand. It works as an IDE, but with the additional ability to render the output of the code in visually relevant forms (for example, charts, tables, and markdown), and also supports writing paragraphs in different languages within the same notebook. We will use notebooks extensively throughout this book, and we recommend its use as a way to share and present data science findings. It allows users to achieve reproducible research, as the code necessary for a particular research objective can be validated and reproduced by re-running the code paragraphs in the notebook.

You can learn more on SageMaker's AWS console page at https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/dashboard.

Let's look at what the AWS Sagemaker console page looks in the following screenshot:

Click on Add repository, choose your authentication mechanism, and add the repository found at https://github.com/mg-um/mastering-ml-on-aws: 

Before creating the notebook instance, it is possible that you would want to attach a Git repository so that the notebooks available with this book are attached to the notebook, and so are made available immediately as you will see later:

We can now proceed to launch a notebook instance. There are several options to configure the hardware, networking, and security of the server that will host the notebook. However, we will not go into much detail for now, and will accept the defaults. The AWS documentation is an excellent resource if you want to limit access or power-up the AWS machine.

Since we attached the Git repository, once you open Jupyter, you should see the notebooks we created for this book, and you can re-run them, modify them, or improve them:

In this section, we focus on the train_scikit Python notebook and go over code snippets to explain how we can build and test a model for out tweet classification problem. We encourage you to run all the paragraphs of this notebook to get an idea of the purpose of this notebook. 

The first thing we will do is load the stopwords and the two sets of tweets into variables:

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy import sparse

SRC_PATH = '/home/ec2-user/SageMaker/mastering-ml-on-aws/chapter2/'
stop_words = [word.strip() for word in open(SRC_PATH + 'stop_words.txt').readlines()]
with open(SRC_PATH + 'dem.txt', 'r') as file:
dem_text = [line.strip('\n') for line in file]
with open(SRC_PATH + 'gop.txt', 'r') as file:
gop_text = [line.strip('\n') for line in file]

We will then proceed to use the utilities in scikit-learn to construct our matrix. In order to do that, we will use a CountVectorizer class, which is a class that allocates the different words into columns while at the same time filtering the stopwords. We will consider both sets of tweets; for our example, we'll just use the first 1200 words:

vectorizer = CountVectorizer(input=dem_text + gop_text,
stop_words=stop_words,
max_features=1200)

Through vectorizer we can now construct two matrices, one for Republican party tweets and one for Democratic party tweets:

dem_bow = vectorizer.fit_transform(dem_text)
gop_bow = vectorizer.fit_transform(gop_text)

These two bag-of-words matrices (dem_bow and gop_bow) are represented in a sparse data structure to minimize memory usage, but can be examined by converting them to arrays:

>>> gop_bow.toarray()

array([[0, 0, 1, ..., 0, 1, 0],
[0, 0, 0, ..., 0, 0, 1],
[0, 1, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 1, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 1, 0]], dtype=int64)

In order to train our model, we need to provide two arrays. The BoW matrix (for both parties), which we will call x, and the labels (class variables) for each of the tweets. To construct this, we will vertically stack both matrices (for each party):

x = sparse.vstack((dem_bow, gop_bow))

To construct the labels vector, we will just assemble a vector with ones for Democrat positions and zeros for Republican positions:

ones = np.ones(200)
zeros = np.zeros(200)
y = np.hstack((ones, zeros))

Before we train our models, we will split the tweets (rows on our x matrix) randomly so that some are used to build a model and others are used to check whether the model predicts the correct political party (label):

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=42)

Now that we have our training and testing datasets, we proceed to train our model using Naive Bayes (a Bernoulli Naive Bayes, since our matrices are ones or zeros):

from sklearn.naive_bayes import BernoulliNB
naive_bayes = BernoulliNB()
model = naive_bayes.fit(x_train, y_train)

As you can see in the preceding code, it is very simple to fit a Naive Bayes model. We need to provide the training matrices and the labels. A model is now capable of predicting the label (political party) of arbitrary tweets (as long as we have them as a BoWs matrix representation). Fortunately, we had separated some of the tweets for testing, so we can run these through the model and see how often the model predicts the right label (note that we know the actual party that wrote the tweet for every tweet in the testing dataset).

To get the predictions it's as simple as invoking the predict method of the model:

y_predictions = model.predict(x_test)

Now, we can see how many of the predictions match the ground truth:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predictions)

The output score of the code block is 0.95.

In this example, we are using accuracy as an evaluation metric. Accuracy can be calculated using formula 7:

Formula 7

There are various evaluation metrics that a data scientist can use to evaluate ML algorithm. We will present evaluation measures such as precision, recall, F1 measure, root mean squared error (RMSE), and area under curve (AUC) in our next chapters for different examples. Evaluation metrics should be selected based on the business need of implementing an algorithm, and should indicate whether or not the ML algorithm is performing at the standards required to achieve a task. 

Since this is the first example we are working on, we will use the simplest evaluation measure, which is accuracy. As specified in formula 7, accuracy is the ratio of correct predictions to the total number of predictions made by the classifier. It turns out that our Naive Bayes model is very accurate, with an accuracy of 95%. It is possible that some words, such as the names of members of each party, can quickly make the model give a correct prediction. We will explore this using decision trees in Chapter 4, Predicting User Behavior with Tree-Based Methods.

Note that, during this process, we had to prepare and transform the data in order to fit a model. This process is very common, and both scikit-learn and Spark support the concept of pipelines, which allow the data scientist to declare the necessary transformations needed to build a model without having to manually obtain intermediary results.

In the following code snippet, we can see an alternative way to produce the same model by creating a pipeline with the following two stages:

  • Count vectorizer
  • Naive Bayes trainer:
from sklearn.pipeline import Pipeline
x_train, x_test, y_train, y_test = train_test_split(dem_text + gop_text, y, test_size=0.25, random_state=5)
pipeline = Pipeline([('vect', vectorizer), ('nb', naive_bayes)])
pipeline_model = pipeline.fit(x_train, y_train)
y_predictions = pipeline_model.predict(x_test)
accuracy_score(y_test, y_predictions)

This allows our modeling to be a bit more concise and declarative. By calling the pipeline.fit() method, the library applies any necessary transformations or estimations necessary. Note that, in this case, we split the raw texts (rather than the matrices) as the fit() method now receives the raw input. As we shall see in the next section, pipelines can contain two kinds of stages, transformers and estimators, depending on whether the stage needs to compute a model out of the data, or simply transform the data declaratively.