Table of contents
Introduction
Exploratory Data Analysis is a crucial step in the data science and analytics workflow. While it involves examining and understanding the dataset, in this post, I will also touch on the importance of EDA, as well as some best practices and pitfalls to avoid.
What is Exploratory Data Analysis (EDA)?
Exploratory data analysis is a systematic approach to examining and understanding the dataset. It involves summarizing main characteristics, often with the help of visualizations, to uncover patterns, anomalies, and relationships within the dataset, helping you gain insights and understand your data before diving into more advanced analytics.
The EDA approach is precisely that – an approach – not a set of techniques, but an attitude or philosophy about how data analysis should be carried out.
Importance of EDA in Data Science
Why is EDA important? Why can’t we just plug and chug with the datasets we are given if it is the correct data?
Well, how do you know it is correct? What if there are errors such as missing values or they were input incorrectly during data collection? We don’t know this until we explore the data. EDA helps spot errors or inconsistencies. It can also show trends, patterns, and relationships that might have been missed from an initial look at the dataset. Also, it is an important step in feature selection for machine learning models. Essentially, EDA is the foundation of any successful data analysis.
Exploratory data analysis is more than just cleaning and preprocessing. It is about asking the right questions, visualizing the dataset and understanding its nuances. By mastering basic EDA, we are well-equipped to tackle more complex tasks.
The EDA Process
We can split the exploratory data analysis process into five general steps:
- Data collection
- Data cleaning
- Initial data exploration
- Hypothesis generation
- Advanced EDA
This is not a linear process, but an iterative one, as described here. Each of these general steps can be broken down further.
Data Collection
Before starting any analysis, it is important to obtain the data and load it the environment or tool. This is done by collecting data from various sources such as databases, APIs or files. Then, we import the data in our respective tool such as Python or R. Ensure the data is complete and accurately represents the question or problem at hand.
Data Cleaning
This steps involves identifying and addressing potential issues within the dataset such as missing or NULL values, duplicates, outliers, and any other inconsistencies. Cleaning the dataset ensures its quality and reliability for the analysis.
Initial Data Exploration
Data Summarization
In this step, we compute descriptive statistics to understand the dataset’s basic properties. This will give information regarding the distribution, mean, median, mode, variance, standard deviation and quartiles. These numbers provide an overview of the dataset’s central tendencies and spread and can reveal some of the potential issues that could arise within a dataset for cleaning.
Univariate, Bivariate, and Multivariate Analysis
Univariate analysis analyzes individual variables in the dataset to understand their distribution, range, and basic statistics. We can do this via histograms, box plots and bar charts to explore these variables.
Bivariate analysis investigates the relationships between pairs of variables to view correlations or associations. We use scatter plots, correlation matrices and pair plots to understand how these variables might relate to each other.
Multivariate analysis explores the interactions between multiple variables simultaneously. Techniques like dimensionality reduction methods such as Principal Component Analysis (PCA) are used to understand complex relationships within the dataset.
Hypothesis Generation
Based on the insights gained from the initial exploration, we can generate hypotheses, ideas and questions for further analysis and modeling. These can prompt us to go back and perform some further analysis or guide more formal statistical test or models to validate or explore the relationships.
Tools and Resources
Most data professionals will use tools such as programming languages, or other platforms such as excel, PowerBI, Qlik, and Tableau, which are also common visualization tools. Python and R are the most popular programming languages for EDA, and both come with many useful libraries (Python: pandas, matplotlib, numpy, and seaborn; R: tidyverse, dplyr, ggplot2, etc.)
Sometimes data isn’t readily available for us to use, but luckily there are many open source datasets and repositories, which will be given below.
Data Visualization Tools and Languages
I mentioned some visualization tools above such as excel and the other platforms. I will go over the benefits and pitfalls of each here:
- Excel: this is a widely tool used tool in industry for data analysis due to its accessibility and familiarity among its broad range of users.
- Benefits: Excel has many advantages. It offers a user-friendly interface making it easy to use for users with varying levels of expertise. It is also quick and easy to create simple visualizations. Excel allows users to perform a quick, basic analysis and perform basic calculations, sorting, and filtering on the data before visualization. It also can handle various data formats and is compatible with other Microsoft Office products.
- Pitfalls: However, in my opinion, there are some disadvantages to using excel. One such disadvantage is its limited ability to handle large or complex datasets. It can struggle when handling or processing extensive amounts of data, limiting its use for in depth analysis. Another disadvantage is, while it may offer several chart types, the level of customization is limited when compared to dedicated data visualization tools.
- PowerBI: Another Microsoft product, and a powerful business analytics tool, it offers various benefits, and some pitfalls to consider when comparing visualization and business intelligence tools.
- Benefits: Like excel, it offers a user-friendly interface, making it accessible for users with varying levels of expertise. It allows to connect seamlessly with numerous platforms and data sources for analysis. It also offers interactive and dynamic visualizations that allow users to explore data and gain insights through various filters, slicers, and drill-down features, which enhance the user experience and facilitate deeper data exploration. It provides a diverse selection of visualization types, including standard charts (bar, line, pie), maps, tables, matrices, and custom visuals, allowing for comprehensive and visually appealing representation of data. Compared to excel as described above, PowerBI can handle large datasets efficiently, and its performance scales well even when dealing with extensive amounts of data, making it suitable for enterprise-level use.
- Pitfalls: While the interface is user-friendly, mastering the full capabilities of PowerBI might require a learning curve, particularly for advanced functionalities and DAX (Data Analysis Expressions) language for complex calculations. Also, while it offers a free version, the full functionality comes with a cost, especially for enterprise-level use or additional features, which might be a consideration for smaller organizations or individual users. However some users may find the subscription for individual use affordable.
- Tableau: This is a very powerful and popular tool that offers various benefits as well as pitfalls.
- Benefits: Tableau provides an intuitive and user-friendly interface, making it accessible for users with varying levels of expertise in data analysis and visualization. Its drag-and-drop functionality simplifies the creation of visualizations. Also, Tableau connects to a wide range of data sources, allowing users to fetch data from multiple sources for analysis. It integrates well with various platforms, databases, and file types. Like PowerBI, Tableau offers highly interactive and dynamic visualizations, allowing users to explore data and gain insights through filters, parameters, and drill-down features. It provides a comprehensive range of visualization types, including standard charts, maps, dashboards, and storyboarding options. Users can create visually appealing and informative dashboards using various visualization options. Tableau efficiently handles large datasets and maintains performance even when dealing with extensive amounts of data. Its in-memory data engine allows for quick data exploration and analysis. Due to its popularity, Tableau has a vast and active user community. It offers extensive resources, forums, and user-generated content, providing support, guidance, and shared knowledge for users at all levels.
- Pitfalls: Tableau can be expensive, especially for larger organizations or for accessing advanced features. The cost of licensing may be a barrier for smaller businesses or individual users, however there is a free, public version with limited capabilities. Mastering Tableau’s more advanced features might require a learning curve, especially when using complex calculations or custom SQL.
- Qlik: Qlik is a robust business intelligence and data visualization platform that comes with several benefits, as well as certain limitations that users should consider.
- Benefits: Qlik uses an associative model, enabling users to dynamically explore and analyze data without predefined paths or queries. This allows for flexible and intuitive data discovery. It employs in-memory processing, allowing for fast data analysis and interactive visualizations, even with large datasets. Users can explore and manipulate data swiftly without waiting for extensive load times. Qlik offers smart and dynamic visualizations, and the ability to create interactive dashboards. Users can easily generate compelling visual representations with drag-and-drop functionalities. It connects to a wide array of data sources and provides powerful data integration capabilities, enabling users to combine multiple data sources for a comprehensive analysis.
- Pitfalls: Qlik can be expensive, particularly for larger-scale implementations or for accessing advanced features. The cost of licensing and maintenance might be a concern for smaller businesses or individual users. Qlik’s associative data model and complex functionalities might result in a steep learning curve for new users, particularly when understanding the data structure and leveraging advanced features effectively. Larger-scale Qlik deployments may require significant planning and setup, along with ongoing maintenance and governance efforts, which could be challenging for organizations without dedicated resources.
Each tool comes with its own benefits and limitations which one should consider when considering its use. Personally, I like using PowerBI and Tableau for personal, simple dashboards, and Python or R for more complex visualizations and apps. I will go over some advantages and disadvantages of using Python or R instead:
- Python: It has a vast ecosystem of libraries like Matplotlib, Seaborn, Plotly, and Bokeh for visualization, each offering different strengths, from basic plotting to interactive and complex visualizations. Python seamlessly integrates visualization with data analysis, making it easy to transition from data manipulation and analysis to creating visual representations. It allows extensive customization of visualizations, enabling users to fine-tune every aspect of a plot, making it suitable for both simple and highly complex visualizations. Python has a large and active community. Users can access numerous resources, tutorials, and community support for data visualization. Unfortunately, Python, especially its visualization libraries, might have a learning curve, particularly for beginners or users without programming experience.
- R: It provides specialized packages like ggplot2, Plotly, and lattice, designed specifically for visualization. These packages offer sophisticated and publication-quality visualizations. R seamlessly integrates statistical analysis with visualization, making it a preferred choice for researchers and statisticians. It has a strong user community, providing a vast collection of packages and active support for creating diverse visualizations. Unfortunately, R’s syntax and approach might have a steep learning curve, especially for users without a programming or statistical background. Also, while R excels in statistical analysis and visualization, its use might be limited for applications outside of these areas.
Both Python and R are powerful tools for data visualization, offering numerous libraries and capabilities for creating diverse and sophisticated visualizations. Python is versatile, known for its broad applicability and extensive libraries, making it suitable for various domains. R, on the other hand, is specifically tailored for statistical analysis and visualization, excelling in those areas.
Online Datasets and Repositories
As mentioned previously, sometimes we do not have data readily available for us to use. Luckily, there are several open source datasets and online repositories available for us to practice our skills. Below is a list of some of the most popular:
- DataHub
- Kaggle
- World Bank
- DrivenData
- Time Series Classification Website
- Repository on GitHub
- US Government
Best Practices
Employing best practices during exploratory data analysis ensures that the dataset is understood, errors are identified, and valuable insights are extracted. Here are some best practices for EDA:
- Understand Context: It’s important to comprehend the purpose of the analysis, the data sources, and the business problem the analysis aims to address.
- Data Profiling: Profile the dataset to understand its structure, dimensions, types, and distributions of variables. This includes summary statistics, missing values, and unique values within each column.
- Data Cleaning: Handle missing values, outliers, and inconsistencies in the dataset. Use methods like imputation, deletion, or transformation to ensure data quality.
- Data Visualizations: Use a variety of visualizations (histograms, box plots, scatter plots, etc.) to understand the distribution, relationships, and patterns within the data. Visualization aids in identifying trends, clusters, outliers, and potential correlations.
- Iterative Process: EDA is an iterative process. Insights gained from initial analysis may prompt further exploration. Revise and refine the analysis as new findings come to light.
- Documentation: Document all observations, findings, and transformations made during the EDA process. Keeping track of these insights helps in communicating results and understanding the process later.
- Collaboration and Feedback: EDA often benefits from collaboration. Seeking feedback from domain experts or peers can provide different perspectives and improve the quality of the analysis.
- Focus on Business Objectives: Align the EDA process with the business objectives. Ensure that the insights gained support the overall goals of the analysis.
Applying these best practices ensures a systematic and comprehensive understanding of the data, leading to valuable insights that can drive informed decision-making and subsequent analysis.
Pitfalls to Avoid
Exploratory data analysis is a critical step in data analysis, but it comes with its share of potential pitfalls that analysts and data scientists should be aware of:
- Overlooking Data Quality Issues: EDA assumes that the data is clean and accurate. However, overlooking data quality issues, such as missing values, outliers, or errors, can lead to incorrect conclusions.
- Confirmation Bias: Analysts may subconsciously seek patterns or insights that confirm their preconceived notions or hypotheses. Confirmation bias can lead to the misinterpretation of data and the failure to consider alternative explanations. Remember the saying “If you torture the data long enough, it will confess.”
- Cherry-Picking Results: Selectively focusing on specific aspects of the data that support a particular narrative while ignoring other relevant findings can lead to biased and incomplete insights.
- Lack of Hypothesis Testing: EDA should be hypothesis-driven. Failing to set hypotheses before conducting the analysis can lead to a lack of focus and structured exploration. Hypothesis testing helps guide the analysis and ensures that findings are statistically meaningful.
- Not Considering External Factors: EDA often focuses solely on the data without considering external factors or context. Neglecting the influence of external variables or events can lead to inaccurate conclusions.
- Data Selection Bias: EDA should be conducted on a representative sample of data. Selecting non-representative subsets of data can skew results and misrepresent the overall population.
- Ignoring Multicollinearity: Multicollinearity occurs when independent variables in a regression analysis are highly correlated. Failing to address multicollinearity can result in misleading conclusions about the relationships between variables.
- Not Considering Outliers: Outliers can significantly impact the distribution and relationships in the data. Ignoring or mishandling outliers can distort EDA results.
- Overemphasizing Data Visualization: While visualization is a powerful EDA tool, relying solely on visuals and not complementing them with quantitative analysis can lead to superficial insights and missed details.
- Ignoring Ethical Considerations: Failing to consider ethical issues related to data privacy, bias, or potential harm when conducting EDA can have serious consequences, including legal and reputational risks.
To mitigate these pitfalls, it’s essential to approach exploratory data analysis with a structured plan, clear hypotheses, and a critical mindset. Being aware of potential biases, data quality issues, and the limitations of the analysis can help ensure that EDA leads to accurate and valuable insights.
Wrap Up
In summary, exploratory data analysis plays a fundamental role in understanding the nature of data, enabling informed decision-making, and forming the basis for more advanced analytics and modeling, making it an indispensable step in the journey from raw data to valuable insights.
References and Further Reading
- Bruce, P., Bruce, A., & Gedeck, P. (2019). Practical Statistics for Data Scientists (2nd ed.). O’Reilly.
- Mendenhall, W., & Sincich, T. (1995). Statistics for engineering and the Sciences. Prentice Hall.
- Wackerly, D.D., Mendenhall, W. and Scheaffer, R.L. (2008) Mathematical Statistics with Applications. 7th Edition, Thomson Learning, Inc., USA
- Spiegel, M. R., PhD, Schiller, J., & Srinivasan, R. A. (2013). Probability and Statistics (4th ed.). McGraw Hill.