
Dataset understanding using EDA
Goal: Understand your dataset.
Once you have collected the dataset, it is time for you to start understanding it using EDA which is a combination of numerical and visualization techniques that allow us to understand different characteristics of our dataset, its variables, and the potential relationship between them. The limits between this phase and the previous and next ones are often blurry, so you may think that your dataset is ready for analysis, but when you start your analysis you may realize that you have got five months of historical data from one source and two months from another source, or, for instance, you may find that three features are redundant or that you may need to combine some features to create a new one. So, after a few trips back to the previews phase you may finally get your dataset ready for analysis.
Now it is time for you to start understanding your dataset by starting to answer questions like the following:
- What types of variables are there in the dataset?
- What do their distributions look like?
- Do we still have missing values?
- Are there redundant variables?
- What are the relationships between the features?
- Do we observe outliers?
- How do the different pairs of features correlate with each other?
- Do these correlations make sense?
- What is the relationship between the features and the target?
All the questions that you try to answer in this phase must be guided by the goal of the project: always keep in mind the problem you are trying to solve. Once you have a good understanding of the data, you will be ready for the next phase: model building.