Data Wrangling, EDA — What You Need To Know
As a beginner to data science, I was very perplexed about the first steps in the data cleaning process. I found myself struggling with the concept of data wrangling, EDA or feature engineering.
In an attempt to untie the knot, I have, therefore, taken a Udacity course where I came one step closer to realise the differences of these terms.
Processes of Data Analysis
Just by Googling “data analytics process” or any keyword of that sense, one can certainly come across many resources pointing out to the basic processes similar to the ones shown in Image 1, i.e. defining business objectives, data gathering, data exploring, data preparation and modeling and evaluation.
Defining business objectives, data gathering and communication steps can clearly speak for themselves. However, both data exploring and data evaluation process seem to be intertwined with one another. Let’s dive into the details now.
Data Exploring vs. Data Preparation
Data is obtained, it is then essential to verify the its quality to get an insight of whether it provides us with enough information to solve our business objectives or whether there it is ready for modeling — this step is called data exploring.
When all of the above are noted and defined, it is then time to actually clean it and put it in a format that enable further analyses — data preparation.
So what is Data Wrangling?
All the activity that you do on the raw data to make it “clean” enough to input to your analytical algorithm is called data wrangling or data munging. — Shubham Simar Tomar 2016
The above definition indicates that both the data exploring and data preparation can be confined in one term — data wrangling, i.e. data munging.
Data wrangling consists of three steps — gather, assess and clean.
Gather
Data is obtained by multiple methods — downloading from a site, web scraping, API queries, connecting with databases, etc.
Assess
Once data is obtained, we can either assess the dataset visually or by using code (i.e. programatically). Either way, what needs to be assessed is whether the data’s quality and tidiness.
Data quality is measured by many metrics/dimensions (Table 1). For instance, while assessing data visually or programatically, we need to observe whether data entries are consistent, columns’ datatypes are what they supposed to be, the presence of any missing values or placeholders values.
Tidiness is harder to explain. Simply put, tidiness refers to the structure of the data. Hadley Wickham examined this concept in his article, where he stated that:
In tidy data:
Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table
Common things that one can notice in a messy dataset are:
- Column headers are values, not variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
- Multiple types of observational units are stored in the same table
- A single observational unit is stored in multiple tables
Clean
In the actual cleaning process, we first define the problems of the data, code to fix them and test to see whether the changes have been made.
EDA
The next step after obtaining a cleaned dataset is EDA. EDA is the abbreviation of Exploratory Data Analysis.
Be aware of another term with the same abbreviation — Explanatory Data Analysis. In data science, exploratory data analysis involves examining the distribution of various variables in the dataset, identifying outliers, finding trends and patterns, looking for relationships between variables by using heat maps or correlation metrics. Explanatory data analysis on the other side focuses on relaying the message of what our data is trying to say to the public. It focuses on explaining why and how an event happened/about to happened.
To Sum Up
Data wrangling is a process of working with raw data and transform it to a format where it can be passed to further exploratory data analysis.
Data wrangling is wrapped up in three steps — gather, assess and clean, where the quality and tidiness of the dataset are being examined. While EDA is about identifying anomalies, patterns, trends and relationships hidden in the data. Explanatory data analysis, on the other hand, is about communicating its insights to the public.
References
Udacity — Data Analyst Nanodegree Programme