Using SageMaker's BlazingText built-in ML service
We saw how to perform ML tasks using scikit-learn and Apache Spark libraries. However, sometimes it's more appropriate to use an ML service. SageMaker provides ways for us to create, tune, and deploy models supporting a variety of built-in ML algorithms by just invoking a service. In a nutshell, you need to place the data in S3 (an Amazon service to store large amounts of data) and call the SageMaker service providing all the necessary details (actual ML algorithm, the location of the data, which kind and how many machines should be used for training). In this section, we go through the process of training our model for predicting tweets through SageMaker's BlazingText ML service. BlazingText is an algorithm that supports text classification using word2vec, which is a way to transform words into vectors in a way that captures precise syntactic and semantic word relationships. We won't dive into the details of SageMaker's architecture yet, but we will present the reader how we would use this AWS service as an alternative to scikit-learn or Spark.
We will start by importing the SakeMaker libraries, creating a session, and obtaining a role (which is the role that the notebook instance is using (see https://aws.amazon.com/blogs/aws/iam-roles-for-ec2-instances-simplified-secure-access-to-aws-service-apies-from-ec2).
Additionally, we specify the S3 bucket we will be using to store all our data and models:
import sagemaker from sagemaker import get_execution_role import json import boto3
sess = sagemaker.Session() role = get_execution_role() bucket = "mastering-ml-aws" prefix = "chapter2/blazingtext"
The next step is to put some data in S3 for training. The expected format for BlazingText is to have each line in the __label__X TEXT format. In our case, this means prefixing each tweet with a label representing the originating party:
__label__1 We are forever g..
__label__0 RT @AFLCIO: Scott Walker.
__label__0 Democrats will hold this
__label__1 Congratulations to hundreds of thousands ...
To do that, we perform some preprocessing of our tweets and prefix the right label:
with open(SRC_PATH + 'dem.txt', 'r') as file:
dem_text = ["__label__0 " + line.strip('\n') for line in file]
with open(SRC_PATH + 'gop.txt', 'r') as file:
gop_text = ["__label__1 " + line.strip('\n') for line in file]
corpus = dem_text + gop_text
We then proceed to create the sets for training and testing as text files:
from sklearn.model_selection import train_test_split
corpus_train, corpus_test = train_test_split(corpus, test_size=0.25, random_state=42)
corpus_train_txt = "\n".join(corpus_train)
corpus_test_txt = "\n".join(corpus_test)
with open('tweets.train', 'w') as file:
file.write(corpus_train_txt)
with open('tweets.test', 'w') as file:
file.write(corpus_test_txt)
Once we have our training and validation text files, we upload them into S3:
train_path = prefix + '/train'
validation_path = prefix + '/validation'
sess.upload_data(path='tweets.train', bucket=bucket, key_prefix=train_path)
sess.upload_data(path='tweets.test', bucket=bucket, key_prefix=validation_path)
s3_train_data = 's3://{}/{}'.format(bucket, train_path)
s3_validation_data = 's3://{}/{}'.format(bucket, validation_path)
We then proceed to instantiate Estimator by specifying all the necessary details: the type and amount of machines to be used for training, as well as the location of the path in S3 where the models will be stored:
container = sagemaker.amazon.amazon_estimator.get_image_uri('us-east-1', "blazingtext", "latest")
s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)
bt_model = sagemaker.estimator.Estimator(container,
role,
train_instance_count=1,
train_instance_type='ml.c4.4xlarge',
train_volume_size = 30,
train_max_run = 360000,
input_mode= 'File',
output_path=s3_output_location,
sagemaker_session=sess)
As we discussed in the Naive Bayes model on SageMaker notebooks using Apache Spark section, an estimator is capable of creating models by processing training data. The next step will be to fit the model providing the training data:
bt_model.set_hyperparameters(mode="supervised", epochs=10, min_count=3, learning_rate=0.05, vector_dim=10, early_stopping=False, patience=5, min_epochs=5, word_ngrams=2) train_data = sagemaker.session.s3_input(s3_train_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix') validation_data = sagemaker.session.s3_input(s3_validation_data, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix') data_channels = {'train': train_data, 'validation': validation_data}
bt_model.fit(inputs=data_channels, logs=True)
Before we train the model we need to specify the hyperparameters. We won't go into much detail about this algorithm in this section, but the reader can find the details in https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html.
This particular algorithm also takes the validation data, as it runs over the data several times (epochs) to improve the error. Once we fit the model, we can deploy the model as a web service so that applications can use it:
predictor = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
In our case, we will just hit the endpoint to get the predictions and evaluate the accuracy:
corpus_test_no_labels = [x[11:] for x in corpus_test] payload = {"instances" : corpus_test_no_labels} response = predictor.predict(json.dumps(payload)) predictions = json.loads(response) print(json.dumps(predictions, indent=2))
After running the preceding code we get the following output:
[ { "prob": [ 0.5003 ], "label": [ "__label__0" ] }, { "prob": [ 0.5009 ], "label": [ "__label__1" ] }...
As you can see in the preceding code, each prediction comes along with a probability (which we will ignore for now). Next, we compute how many of these labels matched the original one:
predicted_labels = [prediction['label'][0] for prediction in predictions]
predicted_labels[:4]
After running the preceding code, we get the following output:
['__label__0', '__label__1', '__label__0', '__label__0']
Then run the next line of code:
actual_labels = [x[:10] for x in corpus_test]
actual_labels[:4]
As you can see in the following output from the previous code block, some of the labels matched the actual value, while some didn't:
['__label__1', '__label__1', '__label__0', '__label__1']
Next, we run the following code to build a true or false Boolean vector depending on whether the actual value matches the predicted result:
matches = [(actual_label == predicted_label) for (actual_label, predicted_label) in zip(actual_labels, predicted_labels)]
matches[:4]
After running the preceding code, we get the following output:
[False, True, True, False]
After we run the preceding output, we will run the following code to calculate the ratio of cases that match in relation to the total instances:
matches.count(True) / len(matches)
The following output from the previous block shows the accuracy score:
0.61
We can see that the accuracy is lower than in our previous examples. This is for many reasons. For starters, we did not invest much in data preparation in this case (for example, no stopwords are used in this case). However, the main reason for the lower accuracy is due to the fact we're using such little data. These models work best on larger datasets.