Practical Data Analysis
上QQ阅读APP看书,第一时间看更新

Learning and classification

When we want to automatically identify to which category a specific value (categorical value) belongs, we need to implement an algorithm that can predict the most likely category for the value, based on the previous data. This is called Classification. In the words of Tom Mitchell:

"How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?"

The keyword here is learning (supervised learning in this case), and also how to train an algorithm to identify categorical elements. The common examples are spam classification , speech recognition , search engines , computer vision , and language detection ; but there are a large number of applications for a classifier. We can find two kinds of problems in classification. The binary classification is where we have only two categories (spam or not spam) and multiclass classification is where many categories are involved (for example, opinions can be positive, neutral, negative, and so on). We can find several algorithms for classification, the most frequently used are support vector machines, neural networks , decision trees , Naïve Bayes , and hidden Markov models . In this chapter, we will implement a probabilistic classification using Naïve Bayes algorithm, but in the following chapters we will implement several other classification algorithms for a variety of problems.

The general steps involved in supervised classification are shown in the following figure. First we will collect training data (previously classified), then we will perform feature extraction (relevant features for the categorization). Next, we will train the algorithm with the features vector. Once we get our trained classifier, we may insert new strings, extract their features, and send them to the classifier. Finally, the classifier will give us the most likely class (category) for the new string.

Additionally we will test the classifier accuracy by using a hand-classified test set. Due to this, we will split the data into two sets, the training data and the test data.