Summary Statistics: the Essential Methods for Describing Data

In the vast realm of data analysis, descriptive statistics are a powerful tool for uncovering meaningful insights and summarizing key characteristics of a dataset. Whether you’re a data enthusiast or a professional in any field, understanding descriptive statistics is essential for making informed decisions. In this blog post, we’ll unravel the essence of descriptive statistics and explore its significance in extracting valuable information from data.

What are descriptive statistics?

Given a data set, how can we make sense of the information it contains? In other words, how can we organize and summarize the data in a meaningful way?

Descriptive statistics refers to the collection of methods used to organize, summarize, and present data. It provides a concise and informative summary of the main features, patterns, and trends within a dataset. Summarizing the data set is also a good first step when doing exploratory data analysis (EDA). By utilizing various measures, descriptive statistics allows us to gain a comprehensive understanding of the data’s central tendency, variability, and shape.

In this post, we will go over some basic statistical tools to summarize and describe the data such as tables, graphs and charts.

Definitions

First, lets get some definitions out of the way:

  • Deviation – the difference between the observed values and the estimate of location.
  • Interquartile range – the difference between the 75th percentile and the 25th percentile
  • Mean – the sum of all values divided by the number of values, also called the average.
  • Mean absolute deviation – the mean of the absolute values of the deviations from the mean.
  • Median – the value such that one-half of the data lies above and below.
    • Another phrase for this value is the 50th percentile.
    • A percentile is the value such that P percent of the data lies below. Another term for this is quantile
  • Median absolute deviations from the median – the median of the absolute values of the deviations from the median.
  • Order statistics – metrics based on the data values sorted from the smallest to biggest.
  • Range – the difference between the largest and the smallest value in the data set
  • Robust – Not sensitive to extreme values
    • Another term for extreme value is outlier.
    • An outlier is a data value that is very different from most of the data
  • Standard deviation – the square root of the variance
  • Trimmed mean, or truncated mean – the average of all values after dropping a fixed number of extreme values
  • Variance – the sum of squared deviations from the mean divided by n-1,  where n is the number of observations.
  • Weighted mean – The sum of all values times a weight divided by the sum of the weights, and like the mean is known as the average, this can be referred to as the weighted average.
  • Weighted median – the value such that 1/2 of the sum of the weights lies above and below the sorted data.

Using Descriptive Statistics to Summarize Data

Univariate analysis

Univariate analysis is a statistical method to analyze a single variable or data set. It focuses on describing and summarizing the characteristics of that particular variable. The main purpose of univariate analysis is to understand the distribution, central tendency (mean, median, mode), dispersion (variance, standard deviation), and other relevant statistical measures of the variable under consideration. Common graphical representations used in univariate analysis include histograms, box plots, and bar charts.

Bivariate analysis

Bivariate analysis involves the examination of the relationship between two variables simultaneously. It aims to determine whether there is a statistical association, correlation, or dependency between the two variables. Bivariate analysis helps in understanding how changes in one variable are related to changes in another. The analysis uses different statistical techniques such as correlation analysis, scatter plots, contingency tables, and cross-tabulations. It provides insights into the strength, direction, and nature of the relationship between the variables.

Multivariate analysis

Multivariate analysis is an extension of bivariate analysis and involves the simultaneous analysis of three or more variables. It explores complex relationships among multiple variables to gain a comprehensive understanding of the data. Multivariate analysis allows researchers to investigate the interdependencies, interactions, and patterns within a dataset. It utilizes advanced statistical methods such as multiple regression, factor analysis, cluster analysis, and principal component analysis. By examining multiple variables together, multivariate analysis enables researchers to uncover hidden patterns, identify underlying factors, and make predictions or classifications based on the data.

Estimates of Location

We often use the term estimate to refer to a value calculated from the data set to highlight a distinction between the data and the theoretical or true value. In business, you may have heard this referred to as a metric. When we talk about looking at estimates of location, we really mean we want to see where the majority of the data is located. We also call this “measures of central tendency.”

When we discuss estimates of location, a first thought is to take the MEAN.

Mean

The mean is the average. To calculate the mean, one adds up every observation, then divides by the number of observations.

For example, with the set of numbers: {50, 72, 98, 92, 83}, to calculate the mean, we add 50 + 72 + 98 + 92 + 83, which = 395, and divide by 5 since there are 5 values. So 395 / 5 = 79 . So the average value of that set of numbers is 79.

You may see the mathematical notation of the mean, which is below:

X bar equation

and we pronounce the variable to the left of the equal sign as “X bar”.

In data science, we use N (or n) to refer to the total number of observations. In the simple example of the set of numbers, each number is an observation. However, in statistics, this notation is a little more vital than in data science. A capital N refers to a population, and a lower case n is for a sample FROM a population. But in data science, you may see it used interchangeably.

Other Forms of the Mean

Sometimes using the familiar mean is not preferable, such as in the sport of diving.

Trimmed Mean

There are usually five judges, and the top score and the bottom score drop, and then average the other three scores to create the final score. We call this a TRIMMED MEAN. It is more resistant to outliers than the ordinary mean, and this prevents a single judge influencing the score to avoid favoring their respective team or country.

Weighted Mean

Another type of mean is the WEIGHTED MEAN. Sometimes, the data collected does not provide an accurate representation of the measured groups, such as in a survey that may not accurately represent the user base of interest. To rectify this, we use the weighted mean, and give higher weight to the values of the groups that are underrepresented.

Another example is if we are taking measurements and averages of data from sensors, and some of the sensors we know may not be accurate, then we give a lower weight to the measurements from the less accurate sensors.

How to Calculate

To calculate the weighted mean, you multiply each value xi by the specified weight wi, and divide the sum by the sum of the weights.

As an example, say your professor administers four exams during the semester, but the last exam happened to be easier than the ones before it. The professor decides to give it less weight, so the weights for the exams go from 25% each, to 30% for the first three, and 10% for the last exam. Let’s say you scored 72, 81, 84, and 98 on each exam. Then the weighted average for the class is calculated as:

.3 * (72)= 21.6
.3 * (81)= 24.3
.3 * (84)= 25.2
.1 * (98)= 9.8
TOTAL (21.6 + 24.3 + 25.2 + 9.8)= 80.9

Then, divide by the sum of each weight, which adds to 1, so 80.9 / 1 = 80.9.

0.05 * (98)= 4.9

But if the weights don’t add to 1, let’s say the last exam is only 5%, then it adds up to 0.95. So, (21.6 + 24.3 + 25.2 + 4.9 = 76) 76.0 / 0.95 = 80. So 80% is the weighted average.

Be aware that this mean is NOT robust to extreme values, and is not always a good statistic to rely on. So, while the weighted mean is easy to compute and practical, it is not always the best value to use.

Median and Other Robust Estimates

The median is the middle number on a sorted list of the data. It only depends on the values at the center of the data while the mean depends on all of the observations. While depending only on the center values might seem like a disadvantage, there are many situations when the median is the better estimate to use for location since the mean is much more sensitive to the data.

Weighted Median

And just like one could use the weighted mean, there is also a WEIGHTED MEDIAN, and it is also robust against outliers. The weighted median is a value such that the sum of the weights is equal for the lower and upper halves of the sorted list.

To find this value, first sort the data in ascending order, then find the cumulative sum of the weights. Then, find the value associated with the weight whose cumulative sum crosses 50% of the total sum of weights.

Let us say given a set S with elements {4, 5, 7, 11} with weights {1, 2, 3, 5}, then the median is (1 + 2 + 3 + 5) = 11 and (1+2+3)/11=0.5455 , so the weighted median is 7 because its cumulative sum of weights over the total sum is closest to 50%.

WeightsSum of weightsCumulative sum of weights divided by total sum of weights
1 + 2= 33 / 11 = 0.2727
1 + 2 +3 = 66 / 11 = 0.5455
1 + 2 + 3 + 5= 1111 /11 = 1.00
Table of weights and other values to find the weighted mean

Because the weights that add to 6, and divided by the cumulative sum of 11 equals 0.5455, which is the closest to 50%, then the number corresponding to this weight is the weighted median, which is 7.

Putting these together

These measures provide a snapshot of where the data tends to cluster, helping us gain insights into its overall behavior.

So, putting the mean, median, and other estimates together in a more realistic example, we have a dataset collected by the U.S. Army Corp of Engineers of contaminated fish from toxic waste from a chemical plant located on the banks of a river in Alabama. Because of the river’s location, these contaminated fish could contaminate other animals that may prey on the fish in a wide area.

To view these estimates in a single view in python, we use the built-in method of pandas .describe():

import pandas as pd

df = pd.read_excel(r'\...\DDT.xls')

# separate into numeric and categorical columns
df_num = df.select_dtypes(include='number')
df_cat = df.select_dtypes(include='object')

df_num.describe()
df.describe() on dataset descriptive statistics

Here, we get a view of the count, mean, standard deviation, minimum value, maximum value, and different percentiles, such as the 50th percentile, which is also the median. To look at how you can customize this built-in method, check out the documentation.

To take a look at the dataset and accompanying code, check it out on my GitHub or Kaggle.

Estimates of Variability

Location is not the only summarizing statistic of a feature. For that reason, we are also interested in estimates of variability. These estimates represent the amount of dispersion in the data. In other words, this measures if the data points are tightly clustered, or spread out. We use these estimates of variability according to our estimates of location against the observed data. We call this the error, deviations, or residuals. We also refer to these as “measures of dispersion.”

Graphic representation of small and large dispersion

Variance & Standard Deviation

The most commonly used estimates are the variance and the standard deviation. The variance is the average of the squared distances from the mean, and the standard deviation is the square root of the variance.

The symbols used to represent the variance and standard deviation are as follows:

          σ2 – Variance
          σ – Standard Deviation
Standard deviation is equal to the square root of the variance
Standard deviation is equal to the square root of the variance

The equation to find the variance in a population goes as follows:

Variance equation for a population
Variance equation for a population

When calculating the variance of a population, use the formula provided. However, if you’re estimating the population variance based on a sample, remember to adjust the denominator to N-1. This adjustment helps ensure an unbiased estimation that doesn’t underestimate the population variance.

Because the variance is a squared value, it is not on the same scale as the original data. The standard deviation is the square root of the variance, which removes the units from the analysis. This allows for comparisons between items that may have different units or magnitudes, which can make them easier to interpret.

Percentiles & Quantiles

Another way to measure dispersion is through the use of quantiles and percentiles.

Quantiles

First, let’s define a quantile. Quantiles are points in a distribution that relate to the rank order of values in a given distribution. In simpler terms, a quantile is a sample that is divided into equal sized, adjacent subgroups. In other words, it refers to dividing a probability distribution into areas of equal probability.

Percentiles

Another way to measure spread is through the use of percentiles (which are special cases of quantiles), which is based on looking at the dispersion of the sorted data. Statistics that rely on sorted data are order statistics.

The Pth percentile is the value such that at least P percent of the values are less than or equal to this value, and (100-P) is greater than or equal to the value. To help visualize how to find a percentile, note that the 50th percentile is also the median. So, we sort the data and find the value that is 50% of the way to the largest value. If we wanted to find the 75th percentile, then we would find the value is 75% of the way to the largest value in the sorted data set.

IQR: Interquartile Range

Commonly, we may use the 75th percentile and subtract it from the 25th percentile to look at the interquartile range.

In python, numpy.quantile uses linear interpolation to compute the percentile. When using a data frame, pandas has some built in functionality to calculate these statistics as well. Below are some snippets of examples using these methods:

Python pandas .describe() method on numerical columns of a dataset, descriptive statistics
Pandas .describe() method to look at various estimates across the numerical columns in a dataset
Pandas .quantile() method to look at the quantile of 0.2 across the data set columns
Pandas .quantile() method to look at the quantile of 0.2 across the data set columns
Pandas .quantile() method to look at the quantile of 0.2 across the data set columns
Numpy.quantile() method to look at the quantile of 0.5 on a specific column of the dataset that has been converted to a numpy array

Data Distribution & Graphical Methods

The previously mentioned statistics help describe important features of the data, but we may also want to look at the distribution of the entire data set. There are several ways one can visualize the distribution.

Distribution Visualizations

Visualizations play a crucial role in representing descriptive statistics.

Histograms provide a graphical representation of the data’s distribution, while boxplots showcase the distribution’s quartiles, outliers, and overall spread.

Distribution plot of each variable
Histograms representing the distribution of each feature in the data set
Box plot outlier analysis
Box plots visualizing the quantiles and outliers for each feature

When using histograms, always explore with different bin widths. Below are histograms for the age distribution of the commonly used titanic data set. This is a good example of showing that using bin width of 1 year is too small, and a width of 15 years is too large, but 3-5 years work well.

Histograms of varying bin width

Density plots

Histograms are popular because they are relatively easy to make. However, with the advanced computing resources that we have readily available, we also see the use of density plots.

To visualize the data distribution, we use a method called kernel density estimation. This involves drawing a smooth curve to estimate the shape of the data. An example is given below:

Kernel density estimation plot

Density plots vs histograms

There is quite a bit of debate on whether density plots or histograms are better for visualizing distributions. Personally, I believe the use can vary by use case, and I typically use a density curve on top of the histogram if I can’t determine which might be better for the data at hand.

Bimodal distribution example, histogram with density curve
Histogram with density curve example

Now, there are some cases where I firmly believe that density plots are better than histograms, and that is in the case of showing multiple distributions. Multiple histograms tend to look a bit messy, and less interpretable. I think that Claus O. Wilke does a great job of explaining this in the book Fundamentals of Data Visualization.

Visualizing multiple distributions

Below are some examples of when to choose density plots over histograms, and even better uses for easier interpretability.

Stacked bars / overlapping histograms, example of poor visualization
This is labeled as bad since this could easily be misinterpreted as a stacked bar plot instead of overlapping histograms
Example of bad use of histograms
This is also bad because it is hard to tell if all blue bars start at a count of 0.

The continuous lines on the density plots help visually keep the distributions separate, avoiding the issue of the histograms seen in the images directly above.

Example of density plots
Density estimates of male and female passengers on the titanic. The density curves were scaled such that the area under each curve corresponds to the total number of male and female passengers with known age

An even better use of density plots to visualize and interpret the distribution is shown below.

Density plots of gender compared to all
Male and female distributions compared to the distribution of the total overall distribution

To summarize, density plots will work better when visualizing more than one distribution, like we saw above when comparing two. Kernel density plots are also better for more than two distributions.

KDE plots for multiple distributions
Density estimates of the butter fat percentage in the milk of four different cattle breeds.

Visualizing relationships between two or more quantitative variables

Scatter plots illustrate the relationship between two variables. These visual tools enable us to interpret data patterns, identify outliers, and spot potential correlations quickly.

Scatter plots of each explanatory variable against the response variable, colored by the categorical variable
Scatter plots illustrate the relationship between different features. Here, three different length features, width, and height are plotted against the Weight feature, showing a non-linear relationship.

We can also show the comparison against each variable in a scatter plot matrix:

Scatter plot matrix / Correlogram

Sometimes we may want to visualize the correlation coefficients of the variables, and we do this via correlograms:

Correlations in mineral content for samples of glass fragments
Correlations in mineral content for samples of glass fragments

One could also help interpret the size of the correlation with size:

Correlations in mineral content for samples of glass fragments in a bubble chart
Correlations in mineral content for samples of glass fragments

The color scale between the two correlation matrices is the same, however the magnitude of the correlation is further encoded via the size of the bubbles, so that way variables with little to no correlation are suppressed and high correlations stand out.

Binary & Categorical Data

Descriptive statistics are not limited to numerical data alone. It is also important for analyzing binary and categorical data. These data types are everywhere and are especially prevalent in the fields of marketing, healthcare, and social sciences.

Binary data consists of two distinct categories, usually represented as 0 or 1, or “Yes” and “No.”

Categorical data consists of more than two groups that are not inherently ordered such as demographic information like gender or ethnicity.

A fundamental practice for analyzing binary and categorical data is to look at the frequency distribution of the categories. This shows the occurrences of each group and presents them in a table or chart. This helps identify which categories are the most common, rare, or if the data is imbalanced.

Seaborn bar plot
Seaborn.countplot() of the number of medals by gender for each medal type

One common order statistic used in analyzing these types of data is the mode. This value represents the category with the highest frequency. This provides insight into the predominant categorical variable. We can use a bar chart to see which value has the highest frequency:

Bar plot of species of fish
Bar plot of species of fish in DDT Contamination data set. Catfish is the most common species observed in the data.

The technique of cross-tabulation, also called a contingency table, is valuable for analyzing the relationship between two categorical variables. This provides insight about associations, dependencies, patterns and relationships among the categorical variables. Below is a simple example of this kind of table:

Simple cross tabulation example, descriptive statistics
Simple cross-tabulation example

Chi-Square

The chi-square test is a statistical method used to determine if there is a significant association between two categorical variables. This test is primarily used to examine whether two categorical variables are independent in influencing the test statistic. There is a second use for this test called the chi-square goodness of fit test, but that is out of the scope of this post.

The chi-square test for independence compares two variables in a contingency table, which shows frequency counts of the variables, to see if they are related, in other words, it tests to see whether distributions of categorical variables differ from each other.

chi-square test equation
chi-square formula

The subscript c represents the degrees of freedom. The O represents the observed value, and E is the expected value. There are several programs available to compute this statistic, since doing it by hand would become too tedious for large amounts of data.

It is important to note that the chi-square test has assumptions.

Chi-square Test Assumptions

These assumptions are: independence of the observations and the expected frequencies not being too low. If these assumptions are not met, then this test may not be valid. It is also important to keep in mind that while it does indicate a relationship, it does not indicate the magnitude of that relationship.

How to use the chi-square test

Below are the steps briefly summarized of how to use this test:

  1. Formulate hypothesis: formulate the null and the alternative hypothesis, which usually states that there is no relationship between the variables, or that there is a relationship between the variables, respectively.
  2. Create contingency table: gather and organize the data into a contingency table
  3. Calculate expected frequencies
  4. Calculate the statistic using the formula above.
  5. Determine the degrees of freedom: these are determined by the formula: (rows – 1) x (columns – 1). This value is used to determine the critical value from the chi-square distribution.
  6. Find P-value (critical value): Using the calculated chi-square statistic and the degrees of freedom, find the critical value either from a chi-square distribution table or calculate the p-value associated with the chi-square statistic. The p-value represents the probability of obtaining a result as extreme as the one observed, assuming the null hypothesis is true.
  7. Compare this p-value to a significance level (usually 0.05, 0.01, 0.001) which we call alpha and is based on the desired level of confidence. If the p-value is less than alpha, you reject the null hypothesis indicating that there is a significant relationship between the variables. If it greater than alpha, then we fail to reject the null hypothesis indicating that there is not a significant relationship.

Visualizations

When looking at frequencies distributions, the most commonly used charts are bar charts and pie charts. Bar charts present the frequency or proportion of each category as individual bars. Pie charts represent the categories as slices of a circle, like a pie. The size of each slice represents the proportion of the corresponding category. More recently, some people have been using donut charts in lieu of pie charts, either for design aesthetic, or their opinion on pie chart vs. donut chart interpretability.

Pie Chart vs Donut Chart

Here are some examples of when to use a pie chart vs donut chart:

A pie chart is best used for representing parts of a whole relationship. Make sure the total sum of percentages equals 100% for the chart to make sense. Also, be sure to not have too many categories of data or it may be difficult to read.

Example of pie chart use
image from beautiful.ai

A donut chart is like a pie chart, but with the center cut out. These charts are also used to show proportions of categories that make up the whole, and the center can also be used to show data. In my opinion, these are best to use to compare a handful of categories, and how they relate to the whole.

Example of donut chart use
image from beautiful.ai

At the end of the day, if you are dealing with just a few categories, either chart will work well. My rule of thumb that I use is when using 2-4 categories, I tend to stick with a donut chart, and more than 4 I may use a pie chart. It really comes down to aesthetics and how you would like to display the data.

Interpreting Descriptive Statistics

We have gone over a plethora of ways to digest descriptive statistics and their meanings, but interpreting descriptive statistics is not a one size fits all venture. It is crucial to consider the context of your analysis, the domain you are working in, and the specific objectives of the research and decision making.

As you look at measures of central tendency, dispersion, and the shape of the distribution, you would be doing so with the goal of identifying trends and patterns, making comparisons to identify unique characteristics or significant differences, and drawing conclusions.

Interpreting descriptive statistics is the cornerstone of extracting valuable insights from data. By comprehending measures of central tendency, dispersion, distribution shape, and other statistical tools, you can transform data summaries into actionable knowledge. Whether you’re conducting research, analyzing business trends, or making data-driven decisions, understanding how to interpret descriptive statistics is essential.

Limitations of Descriptive Statistics

Lack of inferential capabilities and sensitivity to outliers

While descriptive statistics provides valuable insights, it’s essential to acknowledge its limitations. Descriptive statistics alone cannot establish causation or make inferences about a larger population. It describes the observed data without venturing into statistical inference or hypothesis testing. For that, inferential statistics techniques are needed. Here are some other points to consider:

  • Simplification of data: complex data is summarized into a few key metrics like the measures of location and dispersion. While these metrics provide a snapshot of the data, they can also oversimplify the underlying patterns in the dataset. This oversimplification can also lead to a loss of detail such as hiding extreme values and outliers, leading to an incomplete understanding of the dataset.
  • Dependence on distribution: Typically, we assume the data follows a certain distribution, such as the normal distribution. If this assumption is not met, then we may get an inaccurate picture of the dataset.
  • No causation inference: Descriptive statistics can provide insights into correlations and associations between variables, but they cannot establish causation.
  • Contextual Interpretation Required: Descriptive statistics alone may not provide a complete picture. To derive meaningful insights, it’s essential to interpret the results in the context of the research question, domain knowledge, and the specific dataset.

Wrap Up

Descriptive statistics acts as a guiding light in the world of data analysis, allowing us to extract meaningful information and make informed decisions based on evidence. By harnessing measures of central tendency, dispersion, and visualization techniques, we can uncover patterns, summarize data, and gain a comprehensive understanding of its characteristics. Embracing descriptive statistics equips us with a solid foundation for further statistical analysis and empowers us to extract valuable insights from data.

How do you like to use descriptive statistics when analyzing data? Perhaps you can think of something not mentioned here!

References

  • Wilke, C. O. (2020). Fundamentals of Data Visualization: A Primer on making informative and compelling figures. Oโ€™Reilly Media.
  • Pie chart and donut chart images: https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart