Apache Spark 2.x Machine Learning Cookbook

上QQ阅读APP看书，第一时间看更新

How it works...

A matrix can be thought of as columns of vectors. Matrices are the power tools for distributed computation involving linear algebra transformation. A variety of attributes or feature representation can be collected and operated upon via matrices.

In short, matrices are two-dimensional m x n arrays of numbers (usually real numbers) whose elements can be referenced using a two-element subscript, i and j:

A matrix is represented as follows:

A matrix transpose is represented as follows:

Matrix multiplication is represented as follows:

Vector matrix multiplication or "dot" product is represented as follows:

Distributed matrices in the Spark 2.0 ML library: In the next four recipes, we will cover the four types of distributed matrices in Spark. Spark provides full support for distributed matrices baked by RDDs right out of the box. The fact that Spark supports distributed computing does not relieve the developer from planning their algorithms with parallelism in mind.

The underlying RDDs provide full parallelism and fault tolerance over the underlying data that is stored in the matrix. Spark is bundled with MLLIB and LINALG, which jointly provide a public interface and support for matrices that are not local and need full cluster support due to their size or complexity of chained operations.

Spark ML provides four types of distributed matrices to support parallelism: RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix:

RowMatrix: Represents a row-oriented distributed matrix compatible with ML library
IndexedRowMatrix: Similar to RowMatrix with one additional benefit of indexing the rows. This is a specialized version of RowMatrix in which the matrix itself is created from the RDD of IndexedRow (Index, Vector) data structure. To visualize it, imagine a matrix where each row is a pair (long, RDD) and the work of pairing them (zip function) is done for you. This will allow you to carry the Index together with the RDD along its computational path in a given algorithm (matrix operations at scale)
CoordinateMatrix: A very useful format which is used for coordinates (for example, x, y, z coordinates in a projection space)
BlockMatrix: A distributed matrix made of blocks of locally maintained matrices

We cover the creation of the four types in a brief recipe and then quickly move to a more complicated (code and concept) use case involving RowMatrix which is a typical ML use case involving a massively parallel distributed matrix operation (for example, multiplication) with a local matrix.

If you plan to code or design large matrix operations, you must dig into the Spark internals such as core Spark and how staging, pipelining, and shuffling works in each version of Spark (continuous improvement and optimization in each version).

We also recommend the following before embarking on a large-scale matrix and optimization journey:

The source for matrix computations and optimization in Apache Spark is available at http://www.kdd.org/kdd2016/papers/files/adf0163-bosagh-zadehAdoi.pdf and https://pdfs.semanticscholar.org/a684/fc37c79a3276af12a21c1af1ebd8d47f2d6a.pdf.

The source for efficient large scale distributed matrix computation with Spark is available at https://www.computer.org/csdl/proceedings/big-data/2015/9926/00/07364023.pdf and http://dl.acm.org/citation.cfm?id=2878336&preflayout=flat

The source for exploring matrix dependency for efficient distributed matrix computation is available at http://net.pku.edu.cn/~cuibin/Papers/2015-SIGMOD-DMac.pdf and http://dl.acm.org/citation.cfm?id=2723712