Learning Data Mining with Python（Second Edition）

上QQ阅读APP看书，第一时间看更新

Distance metrics

A key underlying concept in data mining is that of distance. If we have two samples, we need to answer questions such as are these two samples more similar than the other two? Answering questions like these is important to the outcome of the data mining exercise.

The most common use is Euclidean distance, which is the real-world distance between two objects. If you were to plot the points on a graph and measure the distance with a ruler, the result would be the Euclidean distance.

A little more formally, the Euclidean distances between points a and b is the square root of the sum of the squared distances for each feature.

Euclidean distance is intuitive but provides poor accuracy if some features have larger values than a value of 0, known as a sparse matrix.

There are other distance metrics in use; two commonly employed ones are the Manhattan and Cosine distance.

The Manhattan distance is the sum of the absolute differences in each feature (with no use of square distances).

Intuitively we can imagine Manhattan distance of as the number of moves a Rook piece
(also called a Castle) in if it were limited to moving one square at a time. While the Manhattan distance does suffer if some features have larger values than others, the effect is not as dramatic as in the case of Euclidean points if it were limited to moving one square at a time. While the Manhattan distance does suffer if some features have larger values than others, the effect is not as dramatic as in the case of Euclidean.

The Cosine distance is better suited to cases where some features are larger than others and when there are lots of zeros in the dataset.

Intuitively, we draw a line from the origin to each of the samples and measure the angle between those lines. We can observe the differences between the algorithms in the following diagram:

In this example, each of the gray circles is exactly the same distance from the white circle. In (a), the distances are Euclidean, and therefore, similar distances fit around a circle. This distance can be measured using a ruler. In (b), the distances are Manhattan, also called City Block. We compute the distance by moving across rows and columns, like how a Rook (Castle) in Chess moves. Finally, in (c), we have the Cosine distance that is measured by computing the angle between the lines drawn from the sample to the vector and ignore the actual length of the line.

The distance metric chosen can have a large impact on the final performance.

For example, if you have many features, the Euclidean distance between random samples converges (due to the famous curse of dimensionality). Euclidean distances in high dimension have a hard time comparing samples, as the distances are always nearly the same!

Manhattan distance can be more stable in these circumstances, but if some features have very large values, this can overrule lots similarity in other features. For example, if feature A has values between 1 and 2, and another feature B has values between 1000 and 2000, in such a case feature A is unlikely to have any impact on the result. This problem can be addressed through normalization, which makes Manhattan (and Euclidean) distance more reliable with different features, which we will see later in this chapter.

Finally, Cosine distance is a good metric for comparing items with many features, but it discards some information about the length of the vector, which is useful in some applications. We would often use Cosine distance in text mining due to the large number of features inherent in text mining (see Chapter 6, Social Media Insight Using Naive Bayes).

Ultimately, either a theoretical approach is needed to determine which distance method is needed, or an empirical evaluation is needed to see which performed more effectively. I prefer the empirical approach, but either approach can yield good results.

For this chapter, we will stay with Euclidean distance, using other metrics in later chapters. If you'd like to experiment, then try setting the metric to Manhattan and see how that affects the results.