Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

Example of label propagation

We can implement the algorithm in Python, using a test bidimensional dataset:

from sklearn.datasets import make_classification

nb_samples = 100
nb_unlabeled = 75

X, Y = make_classification(n_samples=nb_samples, n_features=2, n_informative=2, n_redundant=0, random_state=1000)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0

As in the other examples, we set y = 0 for all unlabeled samples (75 out of 100). The corresponding plot is shown in the following graph:

Partially labeled dataset

The dots marked with a cross are unlabeled. At this point, we can define the affinity matrix. In this case, we compute it using both methods:

from sklearn.neighbors import kneighbors_graph

nb_neighbors = 2

W_knn_sparse = kneighbors_graph(X, n_neighbors=nb_neighbors, mode='connectivity', include_self=True)
W_knn = W_knn_sparse.toarray()

The KNN matrix is obtained using the Scikit-Learn function kneighbors_graph() with the parameters n_neighbors=2 and mode='connectivity'; the alternative is 'distance', which returns the distances instead of 0 and 1 to indicate the absence/presence of an edge. The include_self=True parameter is useful, as we want to have Wii = 1.

For the RBF matrix, we need to define it manually:

import numpy as np

def rbf(x1, x2, gamma=10.0):
n = np.linalg.norm(x1 - x2, ord=1)
return np.exp(-gamma * np.power(n, 2))

W_rbf = np.zeros((nb_samples, nb_samples))

for i in range(nb_samples):
for j in range(nb_samples):
W_rbf[i, j] = rbf(X[i], X[j])

The default value for γ is 10, corresponding to a standard deviation σ equal to 0.22. When using this method, it's important to set a correct value for γ; otherwise, the propagation can degenerate in the predominance of a class (γ too small). Now, we can compute the degree matrices and its inverse. As the procedure is identical, from this point on we continue using the RBF affinity matrix:

D_rbf = np.diag(np.sum(W_rbf, axis=1))
D_rbf_inv = np.linalg.inv(D_rbf)

The algorithm is implemented using a variable threshold. The value adopted here is 0.01:

tolerance = 0.01

Yt = Y.copy()
Y_prev = np.zeros((nb_samples,))
iterations = 0

while np.linalg.norm(Yt - Y_prev, ord=1) > tolerance:
P = np.dot(D_rbf_inv, W_rbf)
Yt = np.dot(P, Yt)
Yt[0:nb_samples - nb_unlabeled] = Y[0:nb_samples - nb_unlabeled]
Y_prev = Yt.copy()

Y_final = np.sign(Yt)

The final result is shown in the following double plot:

Original dataset (left); dataset after a complete label propagation (right)

As it's possible to see, in the original dataset there's a round dot surrounded by square ones (-0.9, -1). As this algorithm keeps the original labels, we find the same situation after the propagation of labels. This condition could be acceptable, even if both the smoothness and clustering assumptions are contradicted. Assuming that it's reasonable, it's possible to force a correction by relaxing the algorithm:

tolerance = 0.01

Yt = Y.copy()
Y_prev = np.zeros((nb_samples,))
iterations = 0

while np.linalg.norm(Yt - Y_prev, ord=1) > tolerance:
P = np.dot(D_rbf_inv, W_rbf)
Yt = np.dot(P, Yt)
Y_prev = Yt.copy()

Y_final = np.sign(Yt)

In this way, we don't reset the original labels, letting the propagation change all those values that disagree with the neighborhood. The result is shown in the following plot:

Original dataset (left); dataset after a complete label propagation with overwrite (right)