Python:Advanced Predictive Analytics
上QQ阅读APP看书,第一时间看更新

Concatenating and appending data

All the required information to build a model doesn't always come from a single table or data source. In many cases, two datasets need to be joined/merged to get more information (read new column/variable). Sometimes, small datasets need to be appended together to make a big dataset which contains the complete picture. Thus, merging and appending are important components of an analyst's armor.

Let's learn each of these methods one by one. For illustrating these methods, we will be using a lot of new interesting datasets. The one we are going to use first is a dataset about the mineral contents of wine; we will have separate datasets for red and white wine. Each sample represents a different sample of red or white wine.

Let us import this dataset and have a look at it. The delimiter for this dataset is ; (a semi-colon), which needs to be taken care of:

import pandas as pd
data1=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/winequality-red.csv',sep=';')
data1.head()

The output of this input snippet is similar to the following screenshot:

Fig. 3.38: First few entries of the wine quality-red dataset

The column names are as follows:

data1.columns.values

Fig. 3.39: Column names of the wine quality-red dataset

The size of the dataset is can be found out using the following snippet:

data1.shape

The output is 1599x12 implying that the dataset has 1599 rows.

Let us import the second dataset which is very similar to the preceding dataset except that the data points are collected for white wine:

import pandas as pd
data2=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/winequality-white.csv',sep=';')
data2.head()

The output of this input snippet looks similar to the following screenshot:

Fig. 3.40: First few entries of the winequality-white dataset

As we can see, this dataset looks very similar to the preceding dataset. Let us confirm this by getting the column names for this dataset. They should be the same as the preceding array of column names:

Fig. 3.41: Column names of the winequality-white dataset

The size of the dataset is, as follows:

data2.shape

4898x12, this means that the dataset has 4898 rows.

So, we can see that the data1 and data2 are very similar (in terms of column names and column types) except the row numbers in the two datasets. These are ideal circumstances to append two datasets along the horizontal axis (axis=0).

Tip

In Python, the horizontal axis is denoted by axis=0 and the vertical axis is denoted by axis=1.

Let us append these two datasets along axis=0. This can be done using the concat method of pandas library. After appending the datasets, the row numbers of the final dataset should be the same as the row numbers of both the datasets.

This can be accomplished as follows:

wine_total=pd.concat([data1,data2],axis=0)

Let us check the number of rows of the appended dataset wine_total:

wine_total.shape

The output is 6497x12. It indicates that the final appended dataset has 6497 (6497=1599+4898) rows. One can see that the row numbers in the appended dataset is the sum of row numbers of the inpidual datasets.

Let us have a look at the final dataset just to ensure everything looks fine. While appending over axis=0, the two datasets are just stacked over one another. In this case, data1 will be stacked over data2 dataset. So, the first few rows of the final dataset wine_total will look similar to the first few rows of the first dataset data1. Let us check that:

wine_total.head()

The output looks similar to the following screenshot:

Fig. 3.42: First few entries of the final dataset obtained from appending data1 and data2

The preceding output is the same as the first few rows of the data1.

This concat method can be used to scramble data that is taken a few rows from here and there and stacking them over one another. The concat method takes more than two datasets also as an argument. The datasets are stacked over one another in order of appearance. If the datasets are data1, data2, and data3, in that order, then data1 will be stacked over data2 which will be stacked over data3.

Let us look at an example of such scrambling. We will use the data1 dataset (coming from winequality-red.csv) and take 50 rows from head, middle, and tail to create three different data frames. These data frames will be then stacked over one another to create a final dataset:

data1_head=data1.head(50)
data1_middle=data1[500:550]
data1_tail=data.tail(50)
wine_scramble=pd.concat([data1_middle,data1_head,data1_tail],axis=0)
wine_scramble

The output dataset will contain 150 rows, as confirmed by the following snippet:

wine_scramble.shape

This returns 150x12 as the output.

The output dataset wine_scramble looks similar to the following screenshot:

Fig. 3.43: First few rows of the scrambled data frame with rows from the data1_middle at the top

Since, the order of the appended dataset is data1_middle, data1_head, data1_tail, the rows contained in the data1_middle come at the top followed by the data1_head and data1_tail rows.

If you change the order of the stacking, the view of the appended dataset will change. Let's try that:

data1_head=data1.head(50)
data1_middle=data1[500:550]
data1_tail=data.tail(50)
wine_scramble=pd.concat([data1_head,data1_middle,data1_tail],axis=0)
wine_scramble

The output looks similar to the following screenshot, wherein, as expected the rows in the data1_head appear at the top:

Fig. 3.44: First few rows of the scrambled data frame with rows from the data1_head at the top

Let's see another scenario where the concat function comes as a savior. Suppose, you have to deal with the kind of data that comes in several files containing similar data. One example of such a scenario is the daily weather data of a city where the same metrics are tracked everyday but the data for each day is stored in a separate file. Many a time, when we do analysis of such files, we are required to append them into a single consolidated file before running any analyses or models.

I have curated some data that has these kind of properties to illustrate the method of consolidation into a single file. Navigate to the Chapter 3 folder in the Google Drive and then to the lotofdata folder. You will see 332 CSV files, each named with its serial number. Each CSV file contains pollutant measure levels at different points of time in a single day. Each CSV file represents a day worth of pollutant measure data.

Let us go ahead and import the first file and have a look at it. Looking at one file will be enough, as the others will be very similar to this (exactly similar except the number of rows and the data points):

data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata/001.csv')
data.head()

The one feature of this data is that it is very sparse and it contains a lot of missing values that would be visible once we look at the output of the preceding code snippet. The sparseness doesn't affect the analyses in this case because the dataset has sufficient rows with non-missing values. Even 100 rows with non-missing values should give us a good picture of the pollutant levels for a day. However, for the same reason, appending such a dataset becomes all the more important so that we have a significant amount of data with the non-missing values for our analyses.

Let us look at the first CSV file of the lot and the output of the preceding snippet:

Fig. 3.45: First few entries of the first file out of 332 CSV files

The size of the dataset is 1461x4 indicating that there are 1,461 rows and four columns. The name of the columns are Date, sulfate, nitrate, and ID. ID would be 1 for the first dataset, 2 for the 2nd, and so on. The number of rows in the other CSV files should be in the same range while the number of rows with non-missing values might vary.

Let us now move towards our goal of this discussion that is to demonstrate how to consolidate such small and similar files in a single file. To be able to do so, one needs to do the following in that sequence:

  1. Import the first file.
  2. Loop through all the files.
  3. Import them one by one.
  4. Append them to the first file.
  5. Repeat the loop.

Let us now look at the code snippet, which will achieve this:

import pandas as pd
filepath='E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata'
data_final=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata/001.csv')
for i in range(1,333):
    if i<10:
        filename='0'+'0'+str(i)+'.csv'
    if 10<=i<100:
        filename='0'+str(i)+'.csv'
    if i>=100:
        filename=str(i)+'.csv'
        
    file=filepath+'/'+filename
    data=pd.read_csv(file)
    
    data_final=pd.concat([data_final,data],axis=0)

In the code snippet, the read_csv is taking a file variable that consists of filepath and filename variables. The if condition takes care of the changing filenames (three conditions arise – first, when filename contains a single non-zero digit; second, when the filename contains two non-zero digits, and third, when the filename contains all the three non-zero digits).

The first file is imported and named as data_final. The subsequent files are imported and appended to data_final. The for loop runs over all the 332 files wherein the importing and appending of the files occur.

The size of the data_final data frame is 773548x4 rows, indicating that it has 7,73,548 rows because it contains the rows from all the 332 files.

If one looks at the last rows of the data_final data frame, one can confirm that all the files have been appended if the ID column contains 332 as value. This means that the last few rows come from the 332nd file.

Fig. 3.46: Last few entries of the data_final data frame. They have ID as 332 indicating that they come from the 332nd CSV file.

The ID column indeed contains 332 observations, confirming that all the 332 files have been successfully appended.

Another way to confirm whether all the rows from all the files have been successfully appended or not, one can sum up the row numbers of each file and compare them to the row numbers of the final appended data frame. They should be equal if they have been appended successfully.

Let us check. We can use the same code as the preceding one, for this process with some minor tweaks. Let us see how:

import pandas as pd
filepath='E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata'
data_final=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Merge and Join/lotofdata/001.csv')
data_final_size=len(data_final)
for i in range(1,333):
    if i<10:
        filename='0'+'0'+str(i)+'.csv'
    if 10<=i<100:
        filename='0'+str(i)+'.csv'
    if i>=100:
        filename=str(i)+'.csv'
        
    file=filepath+'/'+filename
    data=pd.read_csv(file)
 data_final_size+=len(data)
    data_final=pd.concat([data_final,data],axis=0)
print data_final_size

Here, we are summing-up the row numbers of all the files (in the line highlighted) and the summed-up number is printed in the last line. The output is 773,548; it confirms that the final data frame has the same number of rows as the sum of rows in all the files.