In the vast realm of data analysis, descriptive statistics are a powerful tool for uncovering meaningful insights and summarizing key characteristics of a dataset. Whether you’re a data enthusiast or a professional in any field, understanding descriptive statistics is essential for making informed decisions. In this blog post, we’ll unravel the essence of descriptive statistics and explore its significance in extracting valuable information from data.
What are descriptive statistics?
Given a data set, how can we make sense of the information it contains? In other words, how can we organize and summarize the data in a meaningful way?
Descriptive statistics refers to the collection of methods used to organize, summarize, and present data. It provides a concise and informative summary of the main features, patterns, and trends within a dataset. Summarizing the data set is also a good first step when doing exploratory data analysis (EDA). By utilizing various measures, descriptive statistics allows us to gain a comprehensive understanding of the data’s central tendency, variability, and shape.
In this post, we will go over some basic statistical tools to summarize and describe the data such as tables, graphs and charts.
Definitions
First, lets get some definitions out of the way:
- Deviation – the difference between the observed values and the estimate of location.
- Interquartile range – the difference between the 75th percentile and the 25th percentile
- Mean – the sum of all values divided by the number of values, also called the average.
- Mean absolute deviation – the mean of the absolute values of the deviations from the mean.
- Median – the value such that one-half of the data lies above and below.
- Another phrase for this value is the 50th percentile.
- A percentile is the value such that P percent of the data lies below. Another term for this is quantile
- Median absolute deviations from the median – the median of the absolute values of the deviations from the median.
- Order statistics – metrics based on the data values sorted from the smallest to biggest.
- Range – the difference between the largest and the smallest value in the data set
- Robust – Not sensitive to extreme values
- Another term for extreme value is outlier.
- An outlier is a data value that is very different from most of the data
- Standard deviation – the square root of the variance
- Trimmed mean, or truncated mean – the average of all values after dropping a fixed number of extreme values
- Variance – the sum of squared deviations from the mean divided by n-1, where n is the number of observations.
- Weighted mean – The sum of all values times a weight divided by the sum of the weights, and like the mean is known as the average, this can be referred to as the weighted average.
- Weighted median – the value such that 1/2 of the sum of the weights lies above and below the sorted data.
Using Descriptive Statistics to Summarize Data
Univariate analysis
Univariate analysis is a statistical method to analyze a single variable or data set. It focuses on describing and summarizing the characteristics of that particular variable. The main purpose of univariate analysis is to understand the distribution, central tendency (mean, median, mode), dispersion (variance, standard deviation), and other relevant statistical measures of the variable under consideration. Common graphical representations used in univariate analysis include histograms, box plots, and bar charts.
Bivariate analysis
Bivariate analysis involves the examination of the relationship between two variables simultaneously. It aims to determine whether there is a statistical association, correlation, or dependency between the two variables. Bivariate analysis helps in understanding how changes in one variable are related to changes in another. The analysis uses different statistical techniques such as correlation analysis, scatter plots, contingency tables, and cross-tabulations. It provides insights into the strength, direction, and nature of the relationship between the variables.
Multivariate analysis
Multivariate analysis is an extension of bivariate analysis and involves the simultaneous analysis of three or more variables. It explores complex relationships among multiple variables to gain a comprehensive understanding of the data. Multivariate analysis allows researchers to investigate the interdependencies, interactions, and patterns within a dataset. It utilizes advanced statistical methods such as multiple regression, factor analysis, cluster analysis, and principal component analysis. By examining multiple variables together, multivariate analysis enables researchers to uncover hidden patterns, identify underlying factors, and make predictions or classifications based on the data.
Estimates of Location
We often use the term estimate to refer to a value calculated from the data set to highlight a distinction between the data and the theoretical or true value. In business, you may have heard this referred to as a metric. When we talk about looking at estimates of location, we really mean we want to see where the majority of the data is located. We also call this “measures of central tendency.”
When we discuss estimates of location, a first thought is to take the MEAN.
Mean
The mean is the average. To calculate the mean, one adds up every observation, then divides by the number of observations.
For example, with the set of numbers: {50, 72, 98, 92, 83}, to calculate the mean, we add 50 + 72 + 98 + 92 + 83, which = 395, and divide by 5 since there are 5 values. So 395 / 5 = 79 . So the average value of that set of numbers is 79.
You may see the mathematical notation of the mean, which is below:
and we pronounce the variable to the left of the equal sign as “X bar”.
In data science, we use N (or n) to refer to the total number of observations. In the simple example of the set of numbers, each number is an observation. However, in statistics, this notation is a little more vital than in data science. A capital N refers to a population, and a lower case n is for a sample FROM a population. But in data science, you may see it used interchangeably.
Other Forms of the Mean
Sometimes using the familiar mean is not preferable, such as in the sport of diving.
Trimmed Mean
There are usually five judges, and the top score and the bottom score drop, and then average the other three scores to create the final score. We call this a TRIMMED MEAN. It is more resistant to outliers than the ordinary mean, and this prevents a single judge influencing the score to avoid favoring their respective team or country.
Weighted Mean
Another type of mean is the WEIGHTED MEAN. Sometimes, the data collected does not provide an accurate representation of the measured groups, such as in a survey that may not accurately represent the user base of interest. To rectify this, we use the weighted mean, and give higher weight to the values of the groups that are underrepresented.
Another example is if we are taking measurements and averages of data from sensors, and some of the sensors we know may not be accurate, then we give a lower weight to the measurements from the less accurate sensors.
How to Calculate
To calculate the weighted mean, you multiply each value xi by the specified weight wi, and divide the sum by the sum of the weights.
As an example, say your professor administers four exams during the semester, but the last exam happened to be easier than the ones before it. The professor decides to give it less weight, so the weights for the exams go from 25% each, to 30% for the first three, and 10% for the last exam. Let’s say you scored 72, 81, 84, and 98 on each exam. Then the weighted average for the class is calculated as:
.3 * (72) | = 21.6 |
.3 * (81) | = 24.3 |
.3 * (84) | = 25.2 |
.1 * (98) | = 9.8 |
TOTAL (21.6 + 24.3 + 25.2 + 9.8) | = 80.9 |
Then, divide by the sum of each weight, which adds to 1, so 80.9 / 1 = 80.9.
0.05 * (98) | = 4.9 |
But if the weights don’t add to 1, let’s say the last exam is only 5%, then it adds up to 0.95. So, (21.6 + 24.3 + 25.2 + 4.9 = 76) 76.0 / 0.95 = 80. So 80% is the weighted average.
Be aware that this mean is NOT robust to extreme values, and is not always a good statistic to rely on. So, while the weighted mean is easy to compute and practical, it is not always the best value to use.
Median and Other Robust Estimates
The median is the middle number on a sorted list of the data. It only depends on the values at the center of the data while the mean depends on all of the observations. While depending only on the center values might seem like a disadvantage, there are many situations when the median is the better estimate to use for location since the mean is much more sensitive to the data.
Weighted Median
And just like one could use the weighted mean, there is also a WEIGHTED MEDIAN, and it is also robust against outliers. The weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list.
To find this value, first sort the data in ascending order, then find the cumulative sum of the weights. Then, find the value associated with the weight whose cumulative sum crosses 50% of the total sum of weights.
Let us say given a set S with elements {4, 5, 7, 11} with weights {1, 2, 3, 5}, then the median is (1 + 2 + 3 + 5) = 11 and (1+2+3)/11=0.5455 , so the weighted median is 7 because its cumulative sum of weights over the total sum is closest to 50%.
Weights | Sum of weights | Cumulative sum of weights divided by total sum of weights |
---|---|---|
1 + 2 | = 3 | 3 / 11 = 0.2727 |
1 + 2 +3 | = 6 | 6 / 11 = 0.5455 |
1 + 2 + 3 + 5 | = 11 | 11 /11 = 1.00 |
Because the weights that add to 6, and divided by the cumulative sum of 11 equals 0.5455, which is the closest to 50%, then the number corresponding to this weight is the weighted median, which is 7.
Putting these together
These measures provide a snapshot of where the data tends to cluster, helping us gain insights into its overall behavior.
So, putting the mean, median, and other estimates together in a more realistic example, we have a dataset collected by the U.S. Army Corp of Engineers of contaminated fish from toxic waste from a chemical plant located on the banks of a river in Alabama. Because of the river’s location, these contaminated fish could contaminate other animals that may prey on the fish in a wide area.
To view these estimates in a single view in python, we use the built-in method of pandas .describe():
import pandas as pd
df = pd.read_excel(r'\...\DDT.xls')
# separate into numeric and categorical columns
df_num = df.select_dtypes(include='number')
df_cat = df.select_dtypes(include='object')
df_num.describe()
Here, we get a view of the count, mean, standard deviation, minimum value, maximum value, and different percentiles, such as the 50th percentile, which is also the median. To look at how you can customize this built-in method, check out the documentation.
To take a look at the dataset and accompanying code, check it out on my GitHub or Kaggle.
Estimates of Variability
Location is not the only summarizing statistic of a feature. For that reason, we are also interested in estimates of variability. These estimates represent the amount of dispersion in the data. In other words, this measures if the data points are tightly clustered, or spread out. We use these estimates of variability according to our estimates of location against the observed data. We call this the error, deviations, or residuals. We also refer to these as “measures of dispersion.”
Variance & Standard Deviation
The most commonly used estimates are the variance and the standard deviation. The variance is the average of the squared distances from the mean, and the standard deviation is the square root of the variance.
The symbols used to represent the variance and standard deviation are as follows:
σ – Standard Deviation
The equation to find the variance in a population goes as follows:
When calculating the variance of a population, use the formula provided. However, if you’re estimating the population variance based on a sample, remember to adjust the denominator to N-1. This adjustment helps ensure an unbiased estimation that doesn’t underestimate the population variance.
Because the variance is a squared value, it is not on the same scale as the original data. The standard deviation is the square root of the variance, which removes the units from the analysis. This allows for comparisons between items that may have different units or magnitudes, which can make them easier to interpret.
Percentiles & Quantiles
Another way to measure dispersion is through the use of quantiles and percentiles.
Quantiles
First, let’s define a quantile. Quantiles are points in a distribution that relate to the rank order of values in a given distribution. In simpler terms, a quantile is a sample that is divided into equal sized, adjacent subgroups. In other words, it refers to dividing a probability distribution into areas of equal probability.
Percentiles
Another way to measure spread is through the use of percentiles (which are special cases of quantiles), which is based on looking at the dispersion of the sorted data. Statistics that rely on sorted data are order statistics.
The Pth percentile is the value such that at least P percent of the values are less than or equal to this value, and (100-P) is greater than or equal to the value. To help visualize how to find a percentile, note that the 50th percentile is also the median. So, we sort the data and find the value that is 50% of the way to the largest value. If we wanted to find the 75th percentile, then we would find the value is 75% of the way to the largest value in the sorted data set.
IQR: Interquartile Range
Commonly, we may use the 75th percentile and subtract it from the 25th percentile to look at the interquartile range.
In python, numpy.quantile uses linear interpolation to compute the percentile. When using a data frame, pandas has some built in functionality to calculate these statistics as well. Below are some snippets of examples using these methods:
Data Distribution & Graphical Methods
The previously mentioned statistics help describe important features of the data, but we may also want to look at the distribution of the entire data set. There are several ways one can visualize the distribution.
Distribution Visualizations
Visualizations play a crucial role in representing descriptive statistics.
Histograms provide a graphical representation of the data’s distribution, while boxplots showcase the distribution’s quartiles, outliers, and overall spread.
When using histograms, always explore with different bin widths. Below are histograms for the age distribution of the commonly used titanic data set. This is a good example of showing that using bin width of 1 year is too small, and a width of 15 years is too large, but 3-5 years work well.
Density plots
Histograms are popular because they are relatively easy to make. However, with the advanced computing resources that we have readily available, we also see the use of density plots.
To visualize the data distribution, we use a method called kernel density estimation. This involves drawing a smooth curve to estimate the shape of the data. An example is given below:
Density plots vs histograms
There is quite a bit of debate on whether density plots or histograms are better for visualizing distributions. Personally, I believe the use can vary by use case, and I typically use a density curve on top of the histogram if I can’t determine which might be better for the data at hand.
Now, there are some cases where I firmly believe that density plots are better than histograms, and that is in the case of showing multiple distributions. Multiple histograms tend to look a bit messy, and less interpretable. I think that Claus O. Wilke does a great job of explaining this in the book Fundamentals of Data Visualization.
Visualizing multiple distributions
Below are some examples of when to choose density plots over histograms, and even better uses for easier interpretability.
The continuous lines on the density plots help visually keep the distributions separate, avoiding the issue of the histograms seen in the images directly above.
An even better use of density plots to visualize and interpret the distribution is shown below.
To summarize, density plots will work better when visualizing more than one distribution, like we saw above when comparing two. Kernel density plots are also better for more than two distributions.
Visualizing relationships between two or more quantitative variables
Scatter plots illustrate the relationship between two variables. These visual tools enable us to interpret data patterns, identify outliers, and spot potential correlations quickly.
We can also show the comparison against each variable in a scatter plot matrix:
Sometimes we may want to visualize the correlation coefficients of the variables, and we do this via correlograms:
One could also help interpret the size of the correlation with size:
The color scale between the two correlation matrices is the same, however the magnitude of the correlation is further encoded via the size of the bubbles, so that way variables with little to no correlation are suppressed and high correlations stand out.
Binary & Categorical Data
Descriptive statistics are not limited to numerical data alone. It is also important for analyzing binary and categorical data. These data types are everywhere and are especially prevalent in the fields of marketing, healthcare, and social sciences.
Binary data consists of two distinct categories, usually represented as 0 or 1, or “Yes” and “No.”
Categorical data consists of more than two groups that are not inherently ordered such as demographic information like gender or ethnicity.
A fundamental practice for analyzing binary and categorical data is to look at the frequency distribution of the categories. This shows the occurrences of each group and presents them in a table or chart. This helps identify which categories are the most common, rare, or if the data is imbalanced.
One common order statistic used in analyzing these types of data is the mode. This value represents the category with the highest frequency. This provides insight into the predominant categorical variable. We can use a bar chart to see which value has the highest frequency:
The technique of cross-tabulation, also called a contingency table, is valuable for analyzing the relationship between two categorical variables. This provides insight about associations, dependencies, patterns and relationships among the categorical variables. Below is a simple example of this kind of table:
Chi-Square
The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. There is a second use for this test called the chi-square goodness of fit test, but that is out of the scope of this post.
The chi-square test for independence compares two variables in a contingency table, which shows frequency counts of the variables, to see if they are related, in other words, it tests to see whether distributions of categorical variables differ from each other.
The subscript c represents the degrees of freedom. The O represents the observed value, and E is the expected value. There are several programs available to compute this statistic, since doing it by hand would become too tedious for large amounts of data.
It is important to note that the chi-square test has assumptions.
Chi-square Test Assumptions
These assumptions are: independence of the observations and the expected frequencies not being too low. If these assumptions are not met, then this test may not be valid. It is also important to keep in mind that while it does indicate a relationship, it does not indicate the magnitude of that relationship.
How to use the chi-square test
Below are the steps briefly summarized of how to use this test:
- Formulate hypothesis: formulate the null and the alternative hypothesis, which usually states that there is no relationship between the variables, or that there is a relationship between the variables, respectively.
- Create contingency table: gather and organize the data into a contingency table
- Calculate expected frequencies
- Calculate the statistic using the formula above.
- Determine the degrees of freedom: these are determined by the formula: (rows – 1) x (columns – 1). This value is used to determine the critical value from the chi-square distribution.
- Find P-value (critical value): Using the calculated chi-square statistic and the degrees of freedom, find the critical value either from a chi-square distribution table or calculate the p-value associated with the chi-square statistic. The p-value represents the probability of obtaining a result as extreme as the one observed, assuming the null hypothesis is true.
- Compare this p-value to a significance level (usually 0.05, 0.01, 0.001) which we call alpha and is based on the desired level of confidence. If the p-value is less than alpha, you reject the null hypothesis indicating that there is a significant relationship between the variables. If it greater than alpha, then we fail to reject the null hypothesis indicating that there is not a significant relationship.
Visualizations
When looking at frequencies distributions, the most commonly used charts are bar charts and pie charts. Bar charts present the frequency or proportion of each category as individual bars. Pie charts represent the categories as slices of a circle, like a pie. The size of each slice represents the proportion of the corresponding category. More recently, some people have been using donut charts in lieu of pie charts, either for design aesthetic, or their opinion on pie chart vs. donut chart interpretability.
Pie Chart vs Donut Chart
Here are some examples of when to use a pie chart vs donut chart:
A pie chart is best used for representing parts of a whole relationship. Make sure the total sum of percentages equals 100% for the chart to make sense. Also, be sure to not have too many categories of data or it may be difficult to read.
A donut chart is like a pie chart, but with the center cut out. These charts are also used to show proportions of categories that make up the whole, and the center can also be used to show data. In my opinion, these are best to use to compare a handful of categories, and how they relate to the whole.
At the end of the day, if you are dealing with just a few categories, either chart will work well. My rule of thumb that I use is when using 2-4 categories, I tend to stick with a donut chart, and more than 4 I may use a pie chart. It really comes down to aesthetics and how you would like to display the data.
Interpreting Descriptive Statistics
We have gone over a plethora of ways to digest descriptive statistics and their meanings, but interpreting descriptive statistics is not a one size fits all venture. It is crucial to consider the context of your analysis, the domain you are working in, and the specific objectives of the research and decision making.
As you look at measures of central tendency, dispersion, and the shape of the distribution, you would be doing so with the goal of identifying trends and patterns, making comparisons to identify unique characteristics or significant differences, and drawing conclusions.
Interpreting descriptive statistics is the cornerstone of extracting valuable insights from data. By comprehending measures of central tendency, dispersion, distribution shape, and other statistical tools, you can transform data summaries into actionable knowledge. Whether you’re conducting research, analyzing business trends, or making data-driven decisions, understanding how to interpret descriptive statistics is essential.
Limitations of Descriptive Statistics
Lack of inferential capabilities and sensitivity to outliers
While descriptive statistics provides valuable insights, it’s essential to acknowledge its limitations. Descriptive statistics alone cannot establish causation or make inferences about a larger population. It describes the observed data without venturing into statistical inference or hypothesis testing. For that, inferential statistics techniques are needed. Here are some other points to consider:
- Simplification of data: complex data is summarized into a few key metrics like the measures of location and dispersion. While these metrics provide a snapshot of the data, they can also oversimplify the underlying patterns in the dataset. This oversimplification can also lead to a loss of detail such as hiding extreme values and outliers, leading to an incomplete understanding of the dataset.
- Dependence on distribution: Typically, we assume the data follows a certain distribution, such as the normal distribution. If this assumption is not met, then we may get an inaccurate picture of the dataset.
- No causation inference: Descriptive statistics can provide insights into correlations and associations between variables, but they cannot establish causation.
- Contextual Interpretation Required: Descriptive statistics alone may not provide a complete picture. To derive meaningful insights, it’s essential to interpret the results in the context of the research question, domain knowledge, and the specific dataset.
Wrap Up
Descriptive statistics acts as a guiding light in the world of data analysis, allowing us to extract meaningful information and make informed decisions based on evidence. By harnessing measures of central tendency, dispersion, and visualization techniques, we can uncover patterns, summarize data, and gain a comprehensive understanding of its characteristics. Embracing descriptive statistics equips us with a solid foundation for further statistical analysis and empowers us to extract valuable insights from data.
How do you like to use descriptive statistics when analyzing data? Perhaps you can think of something not mentioned here!
References
- Wilke, C. O. (2020). Fundamentals of Data Visualization: A Primer on making informative and compelling figures. OโReilly Media.
- Pie chart and donut chart images: https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart