Data Wrangling, EDA — What You Need To Know

Ly Nguyenova
4 min readApr 27, 2020

--

As a beginner to data science, I was very perplexed about the first steps in the data cleaning process. I found myself struggling with the concept of data wrangling, EDA or feature engineering.

In an attempt to untie the knot, I have, therefore, taken a Udacity course where I came one step closer to realise the differences of these terms.

Processes of Data Analysis

Just by Googling “data analytics process” or any keyword of that sense, one can certainly come across many resources pointing out to the basic processes similar to the ones shown in Image 1, i.e. defining business objectives, data gathering, data exploring, data preparation and modeling and evaluation.

Image 1: Data Analytics Process Flow (Data Flair 2019)

Defining business objectives, data gathering and communication steps can clearly speak for themselves. However, both data exploring and data evaluation process seem to be intertwined with one another. Let’s dive into the details now.

Data Exploring vs. Data Preparation

Data is obtained, it is then essential to verify the its quality to get an insight of whether it provides us with enough information to solve our business objectives or whether there it is ready for modeling — this step is called data exploring.

When all of the above are noted and defined, it is then time to actually clean it and put it in a format that enable further analyses — data preparation.

So what is Data Wrangling?

All the activity that you do on the raw data to make it “clean” enough to input to your analytical algorithm is called data wrangling or data munging. — Shubham Simar Tomar 2016

The above definition indicates that both the data exploring and data preparation can be confined in one term — data wrangling, i.e. data munging.

Data wrangling consists of three steps — gather, assess and clean.

Gather

Data is obtained by multiple methods — downloading from a site, web scraping, API queries, connecting with databases, etc.

Assess

Once data is obtained, we can either assess the dataset visually or by using code (i.e. programatically). Either way, what needs to be assessed is whether the data’s quality and tidiness.

Data quality is measured by many metrics/dimensions (Table 1). For instance, while assessing data visually or programatically, we need to observe whether data entries are consistent, columns’ datatypes are what they supposed to be, the presence of any missing values or placeholders values.

Table 1: Pipino, Lee, Wang 2002

Tidiness is harder to explain. Simply put, tidiness refers to the structure of the data. Hadley Wickham examined this concept in his article, where he stated that:

In tidy data:

Each variable forms a column

Each observation forms a row

Each type of observational unit forms a table

Common things that one can notice in a messy dataset are:

  • Column headers are values, not variable names
  • Multiple variables are stored in one column
  • Variables are stored in both rows and columns
  • Multiple types of observational units are stored in the same table
  • A single observational unit is stored in multiple tables

Clean

In the actual cleaning process, we first define the problems of the data, code to fix them and test to see whether the changes have been made.

EDA

The next step after obtaining a cleaned dataset is EDA. EDA is the abbreviation of Exploratory Data Analysis.

Be aware of another term with the same abbreviation — Explanatory Data Analysis. In data science, exploratory data analysis involves examining the distribution of various variables in the dataset, identifying outliers, finding trends and patterns, looking for relationships between variables by using heat maps or correlation metrics. Explanatory data analysis on the other side focuses on relaying the message of what our data is trying to say to the public. It focuses on explaining why and how an event happened/about to happened.

To Sum Up

Data wrangling is a process of working with raw data and transform it to a format where it can be passed to further exploratory data analysis.

Data wrangling is wrapped up in three steps — gather, assess and clean, where the quality and tidiness of the dataset are being examined. While EDA is about identifying anomalies, patterns, trends and relationships hidden in the data. Explanatory data analysis, on the other hand, is about communicating its insights to the public.

References

Udacity — Data Analyst Nanodegree Programme

--

--

Ly Nguyenova

Junior Data Scientist | Data Engineer | BI Analyst | Tableau Developer