Implementing linear regression through Apache Spark
You are likely interested in training regression models that can take huge datasets as input, beyond what you can do in scikit-learn. Apache Spark is a good candidate for this scenario. As we mentioned in the previous chapter, Apache Spark can easily run training algorithms on a cluster of machines using Elastic MapReduce (EMR) on AWS. We will explain how to set up EMR clusters in the next chapter. In this section, we'll explain how you can use the Spark ML library to train linear regression algorithms:
- The first step is to create a dataframe from our training data:
housing_df = sql.read.csv(SRC_PATH + 'train.csv', header=True, inferSchema=True)
The following screenshot shows the first few rows of the dataset:
- Typically, Apache Spark requires the input dataset to have a single column with a vector of numbers representing all the training features. In Chapter 2, Classifying Twitter Feeds with Naive Bayes, we used CountVectorizer to create such a column. In this chapter, since the vector values are already available in our dataset, we just need to construct such a column using a VectorAssembler transformer:
from pyspark.ml.feature import VectorAssembler
training_features = ['crim', 'zn', 'indus', 'chas', 'nox',
'rm', 'age', 'dis', 'tax', 'ptratio', 'lstat']
vector_assembler = VectorAssembler(inputCols=training_features,
outputCol="features")
df_with_features_vector = vector_assembler.transform(housing_df)
The following screenshot shows the first few rows of the df_with_features_vector dataset:
Note how the vector assembler created a new column called features, which assembles all the features that are used for training as vectors.
- As usual, we will split our dataframe into testing and training:
train_df, test_df = df_with_features_vector.randomSplit([0.8, 0.2],
seed=17)
- We can now instantiate our regressor and fit a model:
from pyspark.ml.regression import LinearRegression
linear = LinearRegression(featuresCol="features", labelCol="medv")
linear_model = linear.fit(train_df)
- By using this model, we find predictions for each value in the test dataset:
predictions_df = linear_model.transform(test_df)
predictions_df.show(3)
The output of the show() command is as follows:
- We can easily find the R2 value by using RegressionEvaluator:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(labelCol="medv",
predictionCol="prediction",
metricName="r2")
evaluator.evaluate(predictions_df)
In this case, we get an R2 of 0.688, which is a similar result to that of scikit-learn.