Correlation
Another statistical idea which is very basic and important while finding a relation between two variables is called correlation. In a way, one can say that the concept of correlation is the premise of predictive modelling, in the sense that the correlation is the factor relying on which we say that we can predict outcomes.
A good correlation between two variables suggests that there is a sort of dependence between them. If one is changed, the change will be reflected in the other as well. One can say that a good correlation certifies a mathematical relation between two variables and due to this mathematical relationship, we might be able to predict outcomes. This mathematical relation can be anything. If x and y are two variables, which are correlated, then one can write:
If f is a linear function, then a and b are linearly correlated. If f is an exponential function, then a and b are exponentially correlated:
The degree of correlation between the two variables x and y is quantified by the following equation:
Where xm and ym are mean values of x and y
A few points to note about the correlation coefficient are as follows:
- The value of the correlation coefficient can range from -1 to 1, that is -1<h<1.
- A positive correlation coefficient means that there is a direct relationship between the two variables; if one variable increases, the other variable will also increase and if one decreases the other will decrease as well.
- A positive correlation coefficient means that there is an inverse relationship between the two variables; if one variable increases, the other variable will decrease and if one decreases the other will increase.
- The more the value of the correlation coefficient, the stronger the relation between the two variables.
Although, a strong correlation suggests that there is some kind of a relationship that can be leveraged to predict one based on the other; it doesn't imply that its relation with the other variable is the only factor explaining this, there can be several others. Hence, the most often used quote related to correlation is, "Correlation doesn't imply causation."
Let us try to understand this concept better by looking at a dataset and trying to find the correlation between the variables. The dataset that we will be looking at is a very popular dataset about various costs incurred on advertising by different mediums and the sales for a particular product. We will be using it later to explore the concepts of linear regression. Let us import the dataset and calculate the correlation coefficients:
import pandas as pd advert=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Linear Regression/Advertising.csv') advert.head()
Fig. 4.8: Dummy dataset
Let us try to find out the correlation between the advertisement costs on TV and the resultant sales. The following code will do the job:
import numpy as np advert['corrn']=(advert['TV']-np.mean(advert['TV']))*(advert['Sales']-np.mean(advert['Sales'])) advert['corrd1']=(advert['TV']-np.mean(advert['TV']))**2 advert['corrd2']=(advert['Sales']-np.mean(advert['Sales']))**2 corrcoeffn=advert.sum()['corrn'] corrcoeffd1=advert.sum()['corrd1'] corrcoeffd2=advert.sum()['corrd2'] corrcoeffd=np.sqrt(corrcoeffd1*corrcoeffd2) corrcoeff=corrcoeffn/corrcoeffd corrcoeff
In this code snippet, the formula written above has been converted to code. The value of the correlation coefficient comes out to be 0.78 indicating that there is a descent in positive correlation between TV-advertisement costs and sales; it implies that if the TV-advertisement cost is increased, as a result sales will increase.
Let us convert the preceding calculation to a function, so that we can calculate all the pairs of correlation coefficients very fast just by replacing the variable names. One can do that using the following snippet wherein a function is defined to parameterize the name of the data frame and the column names for which the correlation coefficient is to be calculated:
def corrcoeff(df,var1,var2): df['corrn']=(df[var1]-np.mean(df[var1]))*(df[var2]-np.mean(df[var2])) df['corrd1']=(df[var1]-np.mean(df[var1]))**2 df['corrd2']=(df[var2]-np.mean(df[var2]))**2 corrcoeffn=df.sum()['corrn'] corrcoeffd1=df.sum()['corrd1'] corrcoeffd2=df.sum()['corrd2'] corrcoeffd=np.sqrt(corrcoeffd1*corrcoeffd2) corrcoeff=corrcoeffn/corrcoeffd return corrcoeff
This function can be used to calculate correlation coefficient for any two variables of any data frame.
For example, to calculate the correlation between TV and Sales columns of the advert
data frame, we can write it as follows:
We can summarize the pair-wise correlation coefficients between the variables in the following table:
This table is called Correlation Matrix. As you can see, it is a symmetric matrix because the correlation between TV and Sales will be the same as that between Sales and TV. Along the diagonal, all the entries are 1 because, by definition, the correlation of a variable with itself will always be 1. As can be seen, the strongest correlation can be found between TV advertisement cost and sales.
Let us see the nature of this correlation by plotting TV
and Sales
variables of the advert data frame. We can do this using the following code snippet:
import matplotlib.pyplot as plt %matplotlib inline plt.plot(advert['TV'],advert['Sales'],'ro') plt.title('TV vs Sales')
The result is similar to the following plot:
Fig. 4.9: Scatter plot of TV vs Sales
Looking at this plot, we can see that the points are more or less compact and not scattered far away and as the TV advertisement cost increases, the sales also increase. This is the characteristic of two variables that are positively correlated. This is supported by a strong correlation coefficient of 0.78.
Let us plot the variables and see how they are distributed to corroborate their correlation coefficient. For Radio
and Sales
, this can be plotted as follows:
import matplotlib.pyplot as plt %matplotlib inline plt.plot(advert['Radio '],advert['Sales'],'ro') plt.title('Radio vs Sales')
The plot we get is as shown in the following figure:
Fig. 4.10: Scatter plot of Radio vs Sales
For Radio and Sales, the points are a little more scattered than TV versus Sales and this is corroborated by the fact that the correlation coefficient for this pair (0.57) is less than that for TV and Sales (0.78).
For plotting Newspaper vs Sales
data, we can write something similar to the following code:
import matplotlib.pyplot as plt %matplotlib inline plt.plot(advert['Newspaper'],advert['Sales'],'ro') plt.title('Newspaper vs Sales')
The output plot looks similar to the following figure:
Fig. 4.11: Scatter plot of Newspaper vs Sales
For Newspaper and Sales, the points are way more scattered than in the case of TV and Sales and Radio and Sales. This is further strengthened by a small correlation coefficient of 0.23 between Newspaper and Sales, compared to 0.78 between TV and Sales, and 0.57 between Radio and Sales.