Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

t-SNE

This algorithm, proposed by Van der Mateen and Hinton and formally known as t-Distributed Stochastic Neighbor Embedding (t-SNE), is one of the most powerful manifold dimensionality reduction techniques. Contrary to the other methods, this algorithm starts with a fundamental assumption: the similarity between two N-dimensional points xi and xj can be represented as the conditional probability p(xj|xi) where each point is represented by a Gaussian distribution centered in xi and with variance σi. The variances are selected starting from the desired perplexity, defined as:

Low-perplexity values indicate a low uncertainty, and are normally preferable. In common t-SNE tasks, values in the range 10÷50 are normally acceptable.

The assumption on the conditional probabilities can be interpreted thinking that if two samples are very similar, the probability associated with the first sample conditioned to the second one is high, while dissimilar points yield low conditional probabilities. For example, thinking about images, a point centered in the pupil can have as neighbors some points belonging to an eyelash. In terms of probabilities, we can think that p(eyelash|pupil) is quite high, while p(nose|pupil) is obviously lower. t-SNE models these conditional probabilities as:

The probabilities p(xi|xi) are set to zero, so the previous formula can be extended to the whole graph. In order to solve the problem in an easier way, the conditional probabilities are also symmetrized:

The probability distribution so obtained represents the high-dimensional input relationship. As our goal is to reduce the dimensionality to a value M < N, we can think about a similar probabilistic representation for the target points φi, using a student-t distribution with one degree of freedom:

We want the low-dimensional distribution Q to be as close as possible to the high-dimensional distribution P; therefore, the aim of the t-SNE algorithm is to minimize the Kullback-Leibler pergence between P and Q:

The first term is the entropy of the original distribution P, while the second one is the cross-entropy H(P, Q), which has to be minimized to solve the problem. The best approach is based on a gradient-descent algorithm, but there are also some useful variations that can improve the performance discussed in Visualizing High-Dimensional Data Using t-SNEVan der Maaten L.J.P., Hinton G.E., Journal of Machine Learning Research 9 (Nov), 2008.