The greatest value of a picture is when it forces us to notice what we never expected to see. -John Tukey
In the bigger picture Data-driven science, we start by collecting a data set of reasonable size, and then looking for patterns that ideally will play the role of hypotheses for future analysis. Exploratory data analysis is the search for patterns and trends in a given data set. The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. i.e to make sure we have the right data, any problems with the dataset, determining if what we answer our desired question and get rough idea of what the answer will look like.
People are not very good at looking at a column of numbers or a whole spreadsheet and then determining important characteristics of the data. They find looking at numbers to be tedious, boring, and/or overwhelming. Exploratory data analysis techniques have been devised as an aid in this situation. Most of these techniques work in part by hiding certain aspects of the data while making other aspects more clear.
Overall, The goal of exploratory analysis is to examine or explore the data and find relationships that weren’t previously known. Exploratory analyses explore how different measures might be related to each other but do not confirm that relationship as causitive. After the basic exploratory analysis we can pause and think that if our question needs refinement or if we need to collect more or new data.
Exploratory data analysis is about looking carefully at your dataset and identify any errors in data collection processing, finding violations of statistical assumptions, and suggesting interesting hypotheses.
As first step we could answer the following questions: Who constructed this dataset, when, and why? what is size of the data and the description of the various columns?
EDA always precedes formal (confirmatory) data analysis. EDA is useful for:
- Detection of mistakes
- Checking of assumptions
- Determining relationships among the explanatory variables
- Assessing the direction and rough size of relationships between explanatory and outcome variables,
- Preliminary selection of appropriate models of the relationship between an outcome variable and one or more explanatory variables.
EDA Methods:
- EDA method is either non-graphical or graphical.
- Each method is either univariate or multivariate (usually just bivariate). Overall,the four types of EDA are univariate non-graphical, multivariate nongraphical, univariate graphical, and multivariate graphical.
- Non-graphical methods generally involve calculation of summary statistics, while graphical methods obviously summarize the data in a diagrammatic or pictorial way.
- Univariate methods look at one variable (data column) at a time, while multivariate methods look at two or more variables at a time to explore relationships. Usually our multivariate EDA will be bivariate (looking at exactly two variables), but occasionally it will involve three or more variables. It is almost always a good idea to perform univariate EDA on each of the components of a multivariate EDA before performing the multivariate EDA.
Before EDA:
- Check the size and type of data
- See if the data is in appropriate for - Convert the data to a format you can easily manupulate (without changing the data itself)
- Sample a test set, set it aside and never look at it
Key Steps of EDA:
- Grab a copy of the data and Read in your data using appropriate python or R libraries
- Check the packaging e.g check the number of rows and columns and see they match the dataset description.
- Study Each Attribute and its characteristics (Name, Type, % of missing values, noisy, usefulness, type of distribution)
- Look at the top and the bottom of your data e.g using head() and tail() commands to get a sense of what the rows look like
- Perform Summary statistics. e.g Tukey’s five number summary is a great start for numerical values, consisting of the extreme values (max and min), plus the median and quartile elements
- Visualize the data: Perform Pairwise correlations, Plots of distributions to identify correlations and identify classes.
- Formulate your question e.g Are air pollution levels higher on the east coast than on the west coast? For Supervised Machine Learning, identify the target attribute
- Study how you would solve the problem manually
- Identify the promising transformations you may want to apply
- Make plans to collect more of different data (if needed and if possible)
All of this is an an iterative process to get at the truth and answer our questions.
Statistical knowledge required:
(Expectation and Mean, Variance and Standard deviation, Co-varriance and Correlation, Median Quartile, Interquartile range, Percentile/quantile, Mode )
Visualization knowledge required:
Knowing which methods are suitable for which type of data.
- Andrew Abela. Advanced Presentations by Design: Creating Communication that Drives Action. Pfeiffer, 2nd edition, 2013.
- The Art of Data Science, A Guide for Anyone Who Works with Data by Roger D. Peng and Elizabeth Matsui
- Chapter 4, Experimental Design and Analysis by Howard J. Seltman
- Section 1, Hands-On Exploratory Data Analysis with Python
- Chapter 1, Practical Statistics for Data Scientists, 2nd Edition