Identifying hidden patterns_R Machine Learning Essentials-QQ阅读男生科幻网

上QQ阅读APP看书，第一时间看更新

Identifying hidden patterns

Data displays some information that is evident and it contains a lot of other information that is more implicit. Sometimes, the solution to a business problem requires some information that is less evident and which may be partly subjective. This section shows how some machine learning techniques discover hidden structures and patterns from the data.

Data contains hidden information

Data that tracks an activity contains the information related to a technology device. For instance, in a supermarket, the checkout machines track the purchases. Therefore, it's possible to have some information about the sales of each item in the past. The available information is the Point of Sale (POS) data and it displays the transactions through the following attributes:

Item ID
Number of units that have been sold
Price of the item
Date and time of the purchase
The checkout machine's ID
Customer ID (for customers that use a Nectar card)

Some information is manifested and is easily accessible by analyzing the data, whereas some other information is hidden. Starting from the transactions, it's easy to determine the total amount of sales in the past. For instance, we can count how many units of a product have been sold in a day. It is very easy to do so:

Select the transactions based on the product ID and the day.
Add the number of units.

It's still easy to obtain some slightly more elaborated information. We can divide the items into departments, and with the knowledge of the total units that have been sold in each department in the previous year, we can:

Generate a list of product IDs for each department.
For each department, select the transactions of the previous year and of the product IDs of the department.
Add the number of units.

It's possible to extract any other kind of information about the overall sales in the past. What if the targets of the analysis are the customers instead of the sales?

We can use the customer ID in order to track the purchases of each customer. For instance, given a single customer ID, we can determine the total number of units that they purchased. This data is still easy to obtain, so we can't talk about hidden patterns. However, there is still a lot of information about the customers that cannot be directly displayed.

Some customers have similar customer habits. Examples of customer categories are:

Students
Housewives
Elderly people

Each group of people displays some specific purchase habits that are as follows:

Available money to spend
Products that the customers are interested in
Date and time of the purchase

For instance, students have, on average, less money to spend than other people. Moms are keener to buy groceries and products for the house. Students are more likely to go to the supermarket after school; elderly people will go at almost any time of day.

The data doesn't display which customer IDs are associated with each category of customers, even if it contains some information about their behavior. However, it's hard to identify which customers are similar in order to perform a simple analysis operation. In addition, in order to identify the groups, we need to have an initial guess about the categories of customers.

Business problems require hidden information

A business problem might require some hidden information. In the supermarket example, we want to address an ad-hoc marketing and discount campaign to some groups of customers.

The options of the marketing campaign determine the following:

Which items are advertised
Which items are discounted
The discount
Which weekdays are affected by the promotion

If the supermarket was very small, it would have been possible to extract the data about each customer and consequently address them with a specific campaign. However, the supermarket is big and there are many customers, so it'll be impossible to take into account each one of them separately without the use of some data processing.

A possibility is to define a method that automatically reads the data about each customer and consequently chooses the marketing campaign. This approach requires the following:

Organizing the data and selected information
Modeling the data
Defining the action

This approach works, although it has some drawbacks. The decision about a marketing campaign requires a general picture about the customer base. After having understood the patterns in the customer behavior, it's possible to define a method for the purpose of choosing the marketing campaign starting from the customer behavior. Therefore, this method requires some previous analysis.

Another solution is to identify groups of customers that have similar habits. Once the groups are defined, it's possible to analyze each group separately in order to understand its common purchase behavior.

The following chart shows some customers represented by small circles, where the big circles represent the homogeneous groups of customers:

In this way, the supermarket has some information about each group that helps them identify the right marketing campaign by combining the following:

Some aggregated information about the customers of the group
Some business knowledge that allows them to define a proper marketing campaign

Assuming that each customer will have the same habits in the future, at least in the short term, it's possible to identify the purchase behavior and interests of each group of customers and consequently target them with the same campaign.

Reshaping the data

Starting from the POS data, we want to model the purchase habits of the supermarket customers in order to identify homogeneous groups. Although the POS data doesn't display the customer behavior directly, it contains the customer ID. The behavior of each customer can be modeled by measuring their habits. For instance, we can measure the total number of units that they have purchased over the last few years. Similarly, we can define some other Key Performance Indicators (KPIs) that are values describing different aspects of the behavior. After extracting all the transactions related to a customer, we can define KPIs as follows:

The total number of units that they purchased in the previous year
The total amount of money that they spent in the last year
The percentage of units that they purchased between 6 p.m. and 7 p.m.
The total money spent in a specific item department
The percentage of money that they spent in summer

There are different options for choosing the KPIs and they should be relevant to the problem. In our example, we want to determine in which products the customers are likely to be interested.

Some KPIs that are relevant to the problem are as follows:

The total money spent in the last year, in order to identify the maximum amount of money that a customer can spend
The percentage of money spent in different item departments, in order to identify what the customer is interested in
The percentage of purchases in the morning and in the early afternoon, in order to identify housewives and pensioners

Given a small set of customers, it's easy to identify homogeneous groups by observing the data. However, if we have many customers and/or KPIs, we need computing tools to uncover the hidden patterns in the data.

Identifying patterns with unsupervised learning

There are some machine learning algorithms that identify hidden structures, and this branch of techniques is called "unsupervised learning". Starting from the data, the unsupervised learning algorithms identify patterns and labels that are not directly displayed.

In our example, we model the customers using a proper set of KPIs that describe their purchase behavior. Our target is to identify groups that have similar values for the KPIs.

In order to associate the customers, the first step is to measure how similar they are. Observing the data of two customers, we can see that they are similar if the values of their KPIs are similar. Since there are many customers, we can't observe data manually, so we need to define a criterion. The criterion is a function that takes as an input the KPIs of two customers and computes a distance, which is a number that expresses the dissimilarity between the values. In this way, there is an objective way to state how similar two customers are.

We have modeled the customers through objects whose similarity can be measured. There are several machine learning algorithms that group similar objects, and they're called clustering techniques. The techniques group together similar customers and consequently identify homogeneous groups.

There are different options to group the customers, depending on:

The number of desired clusters
The relevance of each KPI
The way to identify clusters

There are different options for clustering, and most of the algorithms contain some parameters. In order to choose the proper technique and setup, we need to explore the data to understand the business problem.

This chapter is just an introductory chapter, and clustering is just an example of unsupervised learning.

Making business decisions with unsupervised learning

Clustering techniques allow us to identify homogeneous groups of customers. For each cluster, the supermarket has to define a marketing campaign targeting its customers using promotions and discounts.

For each cluster, it's possible to define a summary table showing the average customer's behavior. Combining this information with some business expertise, the supermarket can maximize the positive impact of the campaign.

In conclusion, clustering allows us to convert a massive volume of data into a small set of relevant information. Then, a business expert can read and understand the clustering results to make the best decisions.

This example showed how data and expertise are strongly linked. The machine learning algorithms required the KPIs that are defined using business expertise. After the algorithm has processed the data, business expertise is necessary to identify the right action.