Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Merging/joining datasets

Merging or joining is a mission critical step for predictive modelling and, more often than not, while working on actual problems, an analyst will be required to do it. The readers who are familiar with relational databases know how there are multiple tables connected by a common key column across which the required columns are scattered. There can be instances where two tables are joined by more than one key column. The merges and joins in Python are very similar to a table merge/join in a relational database except that it doesn't happen in a database but rather on the local computer and that these are not tables, rather data frames in pandas. For people familiar with Excel, you can find similarity with the VLOOKUP function in the sense that both are used to get an extra column of information from a sheet/table joined by a key column.

There are various ways in which two tables/data frames can be merged/joined. The most commonly used ones are Inner Join, Left Join, Right Join, and so on. We will go in to detail and understand what each of these mean. But before that, let's go ahead and perform a simple merge to get a feel of how it is done.

We will be using a different dataset to illustrate the concept of merge and join. These datasets can be found in the Google Drive folder in Merge and the Join/Medals folder. The main dataset Medals.csv contains details of medals won by inpidual players at different Olympic events. The two subsidiary datasets contain details of the nationality and sports of the inpidual player. What if we want to see the nationality or sport played by the player together with all the other medal information for each player? The answer is to merge both the datasets and to get the relevant columns. In data science parlance, merging, joining, and mapping are used synonymously; although, there are minor technical differences.

Let us import all of them and have a cursory look at them:

import pandas as pd
data_main=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/Medals/Medals.csv')
data_main.head()

The Medals.csv looks similar to the following screenshot:

Fig. 3.47: First few entries of the Medals dataset

As we can see, this is the information about the Olympic Year in which the medals were won, details of how many Gold, Silver, and Bronze medals were won and the Age of the player. There are 8,618 rows in the dataset. One more thing one might be interested to know about this dataset is how many unique athletes are there in the dataset, which will come in handy later when we learn and apply different kinds of joins:

a=data_main['Athlete'].unique().tolist()
len(a)

The output of this snippet is 6956, which means that there are many athletes for whom we have records in the datasets. The other entries come because many athletes may have participated in more than one Olympics.

Let us now import the Athelete Country Map.csv and have a look at it:

country_map=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/Medals/Athelete_Country_Map.csv')
country_map.head()

The output data frame looks similar to the following screenshot, with two columns: Athlete and Country:

Fig. 3.48: First few entries of the Athelete_Country_Map dataset

There are 6,970 rows in this dataset. If you try to find out the unique number of athletes in this data frame, it will still be 6,956. The 14 extra rows come from the fact that some players have played for two countries in different Olympics and have won medals. Search for Aleksandar Ciric and you will find that he has played for both Serbia and Serbia and Montenegro.

Note

(Disclaimer: This might not be the actual case and this might be an issue with the mapping file, which can be taken care of by removing duplicate values, as we would show later in this chapter).

You can do this by using the following code snippet:

country_map[country_map['Athlete']=='Aleksandar Ciric']

Fig. 3.49: Subsetting the country_map data frame for Aleksandar Ciric

Let us finally import the Athelete Sports Map.csv and have a look at it:

sports_map=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/Medals/Athelete_Sports_Map.csv')
sports_map.head()

The sports_map data frame looks as shown in the following screenshot:

Fig. 3.50: First few entries of the Athelete_Sports_Map dataset

There are 6,975 rows in this dataset because, yes you guessed it right, there are very few athletes in this mapping data frame who have played more than one game and have won medals. Watch out for athletes, such as Chen Jing, Richard Thompson and Matt Ryan who have played more than one game.

This can be done by writing a code, such as the following snippet:

sports_map[(sports_map['Athlete']=='Chen Jing') | (sports_map['Athlete']=='Richard Thompson') | (sports_map['Athlete']=='Matt Ryan')]

The output looks similar to the following screenshot:

Fig. 3.51: Subsetting the sports_map data frame for athletes Richard Thompson and Matt Ryan

Let's now merge the data_main and country_map data frames to get the country for all the athletes. There is a merge method in pandas, which facilitates this:

import pandas as pd
merged=pd.merge(left=data_main,right=country_map,left_on='Athlete',right_on='Athlete')
merged.head()

The output looks, as follows. It has a country column as expected:

Fig. 3.52: First few entries of the merged data frame. It has a country column.

The length of the merged data frame is 8,657, which is more than the total number of rows (8,618) in the data_main data frame. This is because when we join these two data frames without any specified conditions, an inner join is performed wherein the join happens based on the common key-values present in both the data frames. Also, we saw that some athletes have played for two countries and the entries for such athletes will be duplicated for such athletes. If you look at Aleksandar Ciric in the merged data frame, you will find something similar to this:

merged[merged['Athlete']=='Aleksandar Ciric']

Fig. 3.53 Subsetting the merged data frame for athlete Aleksandar Ciric

The problem is not with the type of join but with the kind of mapping file we have. This mapping file is one-many and hence the number increases because for each key multiple rows are created in such a case.

To rectify this issue, one can remove the duplicate entries from the country_map data frame and then perform the merge with data_main. Let's do that. This can be done using the drop_duplicates method, as shown:

country_map_dp=country_map.drop_duplicates(subset='Athlete')

The length of the country_map_dp is 6,956 rows, which is the same as the number of unique athletes. Let us now merge this with data_main.

merged_dp=pd.merge(left=data_main,right=country_map_dp,left_on='Athlete',right_on='Athlete')
len(merged_dp)

The number of rows in the merged_dp is indeed 8,618, which is the actual number of rows in the data_main.

The next step is to merge sports_map with the merged_dp to get the country and sports along with other details in the same data frame.

We have seen similar issue of increase in the number of rows for sports_map, as was the case for country_map data frame. To take care of that, let's remove the duplicates from the sports_map before merging it with merged_dp:

sports_map_dp=sports_map.drop_duplicates(subset='Athlete')
len(sports_map_dp)

The length of the sports_map_dp is 6,956, which is the same as the number of rows in the data_main data frame, as expected.

The next step is to merge this with the merge_pd data frame to get the sports played by the athlete in the final merged table:

merged_final=pd.merge(left=merged_dp,right=sports_map_dp,left_on='Athlete',right_on='Athlete')
merged_final.head()

Fig. 3.54: First few entries of the merged_final dataset. The duplicates from country_map were deleted before the merge

As we can see, the Sport column is present in the merged_final data frame after the merge. The merged_final data frame has 8,618 rows as expected.

Let us now look at various kinds of merge/joins that we can apply to two data frames. Although you would come across many kinds of joins in different texts, it is sufficient to know the concept behind the three of them—Inner Join, Left Join, and Right Join. If you consider the two tables/data frames as sets, then these joins can be well represented by Venn Diagrams.

Inner Join

The characteristics of the Inner Join are as follows:

  • Returns a data frame containing rows, which have a matching value in both the original data frames being merged.
  • The number of rows will be equal to the minimum of the row numbers of the two data frames. If data frame A containing 100 rows is being merged with data frame B having 80 rows, the merged data frame will have 80 rows.
  • The Inner Join can be thought of as an intersection of two sets, as illustrated in the following figure:

    Fig. 3.55: Inner Join illustrated via a Venn diagram

Left Join

The characteristics of the Left Join are, as follows:

  • Returns a data frame containing rows, which contains all the rows from the left data frame irrespective of whether it has a match in the right data frame or not.
  • In the final data frame, the rows with no matches in the right data frame will return NAs in the columns coming from right data frame.
  • The number of rows will be equal to the number of rows in the left data frame. If data frame A containing 100 rows is being merged with data frame B having 80 rows, the merged data frame would have 100 rows.
  • The Left Join can be thought of as the set containing the entire left data frame, as illustrated in the following figure:

    Fig. 3.56: Left Join illustrated via a Venn Diagram

Right Join

The characteristics of the Right Join are as follows:

  • Returns a data frame containing rows, which contains all the rows from the right data frame irrespective of whether it has a match in the left data frame or not.
  • In the final data frame, the rows with no matches in the left data frame will return NAs in the columns coming from left data frame.
  • The number of rows will be equal to the number of rows in the left data frame. If data frame A containing 100 rows is being merged with data frame B having 80 rows, the merged data frame will have 80 rows
  • The Right Join can be thought of as the set containing the entire right data frame, as illustrated in the following figure:

Fig. 3.57: Right Join illustrated via a Venn diagram

The comparison between join type and set operation is summarized in the following table:

Let us see some examples of how different kinds of mappings actually work. For that, a little data preparation is needed. Currently, both our mapping files contain matching entries for all the rows in the actual data frame data_main. So, we can't see the effects of different kind of merges. Let's create a country and sports mapping file which doesn't have the information for some of the athletes and let's see how it reflects in the merged table. This can be done by creating a new data frame that doesn't have country/sports information for some of the athletes, as shown in the following code:

country_map_dlt=country_map_dp[(country_map_dp['Athlete']<>'Michael Phelps') & (country_map_dp['Athlete']<>'Natalie Coughlin') & (country_map_dp['Athlete']<>'Chen Jing')
                    & (country_map_dp['Athlete']<>'Richard Thompson') & (country_map_dp['Athlete']<>'Matt Ryan')]
len(country_map_dlt)

Using this snippet, we have created a country_map_dlt data frame that doesn't have country mapping for five athletes, that is Michael Phelps, Natalie Coughlin, Chen Jing, Richard Thompson, and Matt Ryan. The length of this data frame is 6,951; it is five less than the actual mapping file, indicating that the information for five athletes has been removed.

Let's do the same for sports_map as well as the data_main data frame using the following snippets:

sports_map_dlt=sports_map_dp[(sports_map_dp['Athlete']<>'Michael Phelps') & (sports_map_dp['Athlete']<>'Natalie Coughlin') & (sports_map_dp['Athlete']<>'Chen Jing')
                    & (sports_map_dp['Athlete']<>'Richard Thompson') & (sports_map_dp['Athlete']<>'Matt Ryan')]
len(sports_map_dlt)

data_main_dlt=data_main[(data_main['Athlete']<>'Michael Phelps') & (data_main['Athlete']<>'Natalie Coughlin') & (data_main['Athlete']<>'Chen Jing')
                    & (data_main['Athlete']<>'Richard Thompson') & (data_main['Athlete']<>'Matt Ryan')]
len(data_main_dlt)

The length of data_main_dlt becomes 8,605 because the data_main contains multiple rows for an athlete.

An example of the Inner Join

One example of Inner join would be to merge data_main data frame with country_map_dlt. This can be done using the following snippet:

merged_inner=pd.merge(left=data_main,right=country_map_dlt,how='inner',left_on='Athlete',right_on='Athlete')
len(merged_inner)

This merge should give us information for the athletes who are present in both the data frames. As the country_map_dlt doesn't contain information about five athletes present in data_main, these five athletes wouldn't be a part of the merged table.

The length of the merged_inner comes out to be 8,605 (similar to data_main_dlt) indicating that it doesn't contain information about those five athletes.

An example of the Left Join

One example of Left Join would be to merge data_main data frame with country_map_dlt. This can be done using the following snippet:

merged_left=pd.merge(left=data_main,right=country_map_dlt,how='left',left_on='Athlete',right_on='Athlete')
len(merged_left)

This merge should give us the information about all the athletes that are present in the left data frame (data_main) even if they aren't present in the right data frame (country_map_dlt). So, the merged_left data frame should contain 8,618 rows (similar to the data_main) even if the country_map_dlt doesn't contain information about five athletes present in data_main. These five athletes will have a NaN value in the Country column.

The length of merged_left indeed comes out to be 8,618. Let's check the merged_left for an athlete whose information is not present in the country_map_dlt. It should contain NaN for the Country column:

merged_left_slt=merged_left[merged_left['Athlete']=='Michael Phelps']
merged_left_slt

The output is similar to the following screenshot. It indeed contains NaN for Michael Phelps' Country because it doesn't have a mapping in country_map_dlt:

Fig. 3.58: Merged_left data frame sub-setted for Michael Phelps contains NaN values, as expected

An example of the Right Join

One example of Right Join will be to merge data frame data_main with country_map_dlt. This can be done using the following snippet:

merged_right=pd.merge(left=data_main_dlt,right=country_map_dp,how='right',left_on='Athlete',right_on='Athlete')
len(merged_right)

This should contain the NaN values for the columns coming from data_main_dlt, in the rows where there is no athlete information in data_main_dlt.

As shown in the following table:

Fig. 3.59: merged_right data frame sub-setted for Michael Phelps contains NaN values, as expected

There will be one row created for each athlete who is not there in the data_main_dlt but is present in the country_map_dp. Hence, there will be five extra rows, one for each deleted athlete. The number of rows in the merged_right is thus equal to 8,610.

There are other joins like Outer Joins, which can be illustrated as the Union of two data frames. The Outer join would contain rows from both the data frames, even if they are not present in the other. It will contain NaN for the columns which it can't get values for. It can be easily performed setting the how parameter of the merge method to outer.

Summary of Joins in terms of their length

The effect of these joins can be more effectively explained if we summarize the number of samples present in the data frames that were used for merging and in the resultant data frames.

The first table provides the number of samples present in the data frames that were used for merging. All these data frames have been defined earlier in this section of the chapter: