上QQ阅读APP看书，第一时间看更新

Example of label propagation based on Markov random walks

For this Python example of label propagation based on Markov random walks, we are going to use a bidimensional dataset containing 50 labeled samples belonging to two different classes, and 1,950 unlabeled samples:

from sklearn.datasets import make_blobs

nb_samples = 2000
nb_unlabeled = 1950
nb_classes = 2

X, Y = make_blobs(n_samples=nb_samples, 
                  n_features=2, 
                  centers=nb_classes,
                  cluster_std=2.5,
                  random_state=500)

Y[nb_samples - nb_unlabeled:] = -1

The plot of the dataset is shown in the following diagram (the crosses represent the unlabeled samples):

Partially labeled dataset

We can now create the graph (using n_neighbors=15) and the weight matrix:

import numpy as np

from sklearn.neighbors import kneighbors_graph

def rbf(x1, x2, sigma=1.0):
    d = np.linalg.norm(x1 - x2, ord=1)
    return np.exp(-np.power(d, 2.0) / (2 * np.power(sigma, 2)))

W = kneighbors_graph(X, n_neighbors=15, mode='connectivity', include_self=True).toarray()

for i in range(nb_samples):
    for j in range(nb_samples):
        if W[i, j] != 0.0:
            W[i, j] = rbf(X[i], X[j])

Now, we need to compute the unlabeled part of the unnormalized graph Laplacian and the unlabeled-labeled part of the matrix W:

D = np.diag(np.sum(W, axis=1))
L = D - W
Luu = L[nb_samples - nb_unlabeled:, nb_samples - nb_unlabeled:]
Wul = W[nb_samples - nb_unlabeled:, 0:nb_samples - nb_unlabeled,]
Yl = Y[0:nb_samples - nb_unlabeled]

At this point, it's possible to solve the linear system using the NumPy function np.linalg.solve(), which accepts as parameters the matrix A and the vector b of a generic system in the form Ax=b. Once we have the solution, we can merge the new labels with the original ones (where the unlabeled samples have been marked with -1). In this case, we don't need to convert the probabilities, because we are using 0 and 1 as labels. In general, it's necessary to use a threshold (0.5) to select the right label:

Yu = np.round(np.linalg.solve(Luu, np.dot(Wul, Yl)))
Y[nb_samples - nb_unlabeled:] = Yu.copy()

Replotting the dataset, we get:

Dataset after a complete Markov random walk label propagation

As expected, without any iteration, the labels have been successfully propagated to all samples in perfect compliance with the clustering assumption. Both this algorithm and label propagation can work using a closed-form solution, so they are very fast even when the number of samples is high; however, there's a fundamental problem regarding the choice of σ/γ for the RBF kernel. As the same authors Zhu and Ghahramani remark, there is no standard solution, but it's possible to consider when σ → 0 and when σ → ∞. In the first case, only the nearest point has an influence, while in the second case, the influence is extended to the whole sample space, and the unlabeled points tend to acquire the same label. The authors suggest considering the entropy of all samples, trying to find the best σ value that minimizes it. This solution can be very effective, but sometimes the minimum entropy corresponds to a label configuration that isn't impossible to achieve using these algorithms. The best approach is to try different values (at different scales) and select the one corresponding to a valid configuration with the lowest entropy. In our case, it's possible to compute the entropy of the unlabeled samples as:

The Python code to perform this computation is:

Pu = np.linalg.solve(Luu, np.dot(Wul, Yl))
H = -np.sum(Pu * np.log(Pu + 1e-6))

The term 1e-6 has been added to avoid numerical problems when the probability is null. Repeating this process for different values allows us to find a set of candidates that can be restricted to a single value with a direct evaluation of the labeling accuracy (for example, when there is no precise information about the real distribution, it's possible to consider the coherence of each cluster and the separation between them). Another approach is called class rebalancing, and it's based on the idea of reweighting the probabilities of unlabeled samples to rebalance the number of points belonging to each class when the new unlabeled samples are added to the set. If we have N labeled points and M unlabeled ones, with K classes, the weight factor w_j for the class j can be obtained as:

The numerator is the average computed over the labeled samples belonging to class k, while the denominator is the average over the unlabeled ones whose estimated class is k. The final decision about a class is no longer based only on the highest probability, but on: