Classifying Twitter Feeds with Naive Bayes
Machine learning (ML) plays a major part in analyzing large datasets and extracting actionable insights from data. ML algorithms perform tasks such as predicting outcomes, clustering data to extract trends, and building recommendation engines. Knowledge of ML algorithms helps data scientists to understand the nature of data they are dealing with and plan what algorithms should be applied to achieve the desired outcomes from the data. Although there are multiple algorithms that can perform any task, it is important for data scientists to know the pros and cons of different ML algorithms. The decision to apply ML algorithms can be based on various factors, such as the size of the dataset, the budget for the clusters used for the training and deployment of ML models, and the cost of error rates. Although AWS offers a large number of options in terms of selecting and deploying ML models, a data scientist has to be knowledgeable in terms of what algorithms should be used in different situations.
In this part of the book, we present various popular ML algorithms and examples of applications where they can be applied effectively. We will explain the advantages and disadvantages of each algorithm and situations when these algorithms should be selected in AWS. As this book is written with data science students and professionals in mind, we will present a simple example of how the algorithms can be implemented using simple Python libraries, and then deployed on AWS clusters using Spark and AWS SageMaker for larger datasets. These chapters should help data scientists to get familiar with the popular ML algorithms and help them understand the nuances of implementing these algorithms in big data environments on AWS clusters.
Chapter 2, Classifying Twitter Feeds with Naive Bayes, Chapter 3, Predicting House Value with Regression Algorithms, Chapter 4, Predicting User Behavior with Tree-Based Methods, and Chapter 5, Customer Segmentation Using Clustering Algorithms, present four classification algorithms that can be used to predict an outcome based on a feature set. Chapter 6, Analyzing Visitor Patterns to Make Recommendations, explains clustering algorithms and demonstrates how they can be used for applications such as customer segmentation. Chapter 7, Implementing Deep Learning Algorithms, presents a recommendation algorithm that can be used to recommend new items to users based on their purchase history.
This chapter will introduce the basics of the Naive Bayes algorithm and present a text classification problem that will be addressed using of this algorithm and language models. We'll provide examples on how to use it with scikit-learn, Apache Spark, and SageMaker's BlazingText. Additionally, we'll explore how to further use the ideas behind Bayesian reasoning in more complex scenarios.
In this chapter, we will cover the following topics:
- Classification algorithms
- Naive Bayes classifier
- Classifying text with language models
- Naive Bayes — pros and cons