Data Exploration

Akilankm
2 min readJan 25, 2021

--

Hi Everyone,
In this article, we will be understanding about data exploration and its techniques. Let’s move one by one…

Photo by Myriam Jessier on Unsplash

1.1 Variables

Definition: any measurable property/characteristic of a phenomenon being observed. They are called ‘variables’ because the value they take may vary (and it usually does) in a population.

Dependent and Independent variables

Dependent variables are nothing but the variable which holds the phenomena which we are studying.

Independent variables are the ones which through we are trying to explain the value or effect of the output variable (dependent variable) by creating a relationship between an independent and dependent variable.

Types of Variables

1.2 Variable Identification

Definition: Identify the data types of each variable.

Every variable has a specific data type associated with it. Sometimes variables are recorded on wrong scale of measurement, we fix it by typecasting the variable.

Note: In reality we may have mixed type of variable for a variety of reasons. For example, in credit scoring “Missed payment status” is a common variable that can take values 1, 2, 3 meaning that the customer has missed 1–3 payments in their account. And it can also take the value D, if the customer defaulted on that account. We may have to convert data types after certain steps of data cleaning.

1.3 Univariate Analysis

Descriptive statistics on one single variable. This generally gives us overall summary about each variable
pandas provide built-in method for univariate analysis

1.4 Bi-variate Analysis

Descriptive statistics between two or more variables.

Scatter Plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation.

Correlation plot can be used to quickly find insights. It is used to investigate the dependence between multiple variables at the same time and to highlight the most correlated variables in a data table.

Heat map (or heatmap) is a graphical representation of data where the individual values contained in a matrix are represented as colors.

I hope, the article gave you pretty good understanding about the data exploration and how to get started with your own. if you like my work and would like to support me, Join me

Linkedin : https://www.linkedin.com/mynetwork/

--

--

Akilankm
Akilankm

Written by Akilankm

Data Scientist | Machine Learning | Artificial Intelligence | Statistics | Data Visualization | NLP

No responses yet