Learning Data Mining with Python(Second Edition)
上QQ阅读APP看书,第一时间看更新

Implementing the OneR algorithm

OneR is a simple algorithm that simply predicts the class of a sample by finding the most frequent class for the feature values. OneR is shorthand for One Rule, indicating we only use a single rule for this classification by choosing the feature with the best performance. While some of the later algorithms are significantly more complex, this simple algorithm has been shown to have good performance in some real-world datasets.

The algorithm starts by iterating over every value of every feature. For that value, count the number of samples from each class that has that feature value. Record the most frequent class of the feature value, and the error of that prediction.

For example, if a feature has two values, 0 and 1, we first check all samples that have the value 0. For that value, we may have 20 in Class A, 60 in Class B, and a further 20 in Class C. The most frequent class for this value is B, and there are 40 instances that have different classes. The prediction for this feature value is B with an error of 40, as there are 40 samples that have a different class from the prediction. We then do the same procedure for the value 1 for this feature, and then for all other feature value combinations.

Once these combinations are computed, we compute the error for each feature by summing up the errors for all values for that feature. The feature with the lowest total error is chosen as the One Rule and then used to classify other instances.

In code, we will first create a function that computes the class prediction and error for a specific feature value. We have two necessary imports, defaultdict and itemgetter, that we used in earlier code:

from collections import defaultdict 
from operator import itemgetter

Next, we create the function definition, which needs the dataset, classes, the index of the feature we are interested in, and the value we are computing. It loops over each sample, and counts the number of time each feature value corresponds to a specific class. We then choose the most frequent class for the current feature/value pair:

def train_feature_value(X, y_true, feature, value):
# Create a simple dictionary to count how frequency they give certain
predictions
class_counts = defaultdict(int)
# Iterate through each sample and count the frequency of each
class/value pair
for sample, y in zip(X, y_true):
if sample[feature] == value:
class_counts[y] += 1
# Now get the best one by sorting (highest first) and choosing the
first item
sorted_class_counts = sorted(class_counts.items(), key=itemgetter(1),
reverse=True)
most_frequent_class = sorted_class_counts[0][0]
# The error is the number of samples that do not classify as the most
frequent class
# *and* have the feature value.
n_samples = X.shape[1]
error = sum([class_count for class_value, class_count in
class_counts.items()
if class_value != most_frequent_class])
return most_frequent_class, error

As a final step, we also compute the error of this rule. In the OneR algorithm, any sample with this feature value would be predicted as being the most frequent class. Therefore, we compute the error by summing up the counts for the other classes (not the most frequent). These represent training samples that result in error or an incorrect classification.

With this function, we can now compute the error for an entire feature by looping over all the values for that feature, summing the errors, and recording the predicted classes for each value.

The function needs the dataset, classes, and feature index we are interested in. It then iterates through the different values and finds the most accurate feature value to use for this specific feature, as the rule in OneR:

def train(X, y_true, feature): 
# Check that variable is a valid number
n_samples, n_features = X.shape
assert 0 <= feature < n_features
# Get all of the unique values that this variable has
values = set(X[:,feature])
# Stores the predictors array that is returned
predictors = dict()
errors = []
for current_value in values:
most_frequent_class, error = train_feature_value
(X, y_true, feature, current_value)
predictors[current_value] = most_frequent_class
errors.append(error)
# Compute the total error of using this feature to classify on
total_error = sum(errors)
return predictors, total_error

Let's have a look at this function in a little more detail.

After some initial tests, we then find all the unique values that the given feature takes. The indexing in the next line looks at the whole column for the given feature and returns it as an array. We then use the set function to find only the unique values:

    values = set(X[:,feature_index])

Next, we create our dictionary that will store the predictors. This dictionary will have feature values as the keys and classification as the value. An entry with key 1.5 and value 2 would mean that, when the feature has a value set to 1.5, classify it as belonging to class 2. We also create a list storing the errors for each feature value:

predictors = {} 
errors = []

As the main section of this function, we iterate over all the unique values for this feature and use our previously defined train_feature_value function to find the most frequent class and the error for a given feature value. We store the results as outlined earlier:

Finally, we compute the total errors of this rule and return the predictors along with this value:

total_error = sum(errors)
return predictors, total_error