The Power of Statistics and Probability in Data Science: A Gentle Intro to Statistics & Probability

Statistics is defined as the science of data. It is concerned with classifying, analyzing, summarizing and presenting data. In other words, it is a mathematically based field which involves collecting and interpreting quantitative data.

There are many different definitions of statistics out there, but they all agree on some common elements: they each imply that data is collected and that statistics is a theory of information, with inference as the objective.

Before You Start…

What would one recommend to know before reading this? Nothing! This is only the beginning! All that is โ€œrequiredโ€ is a desire to learn more about the endlessly fascinating world of mathematics and statistics.

A basic statistics and probability course can go a long way in helping you with your statistics learning journey. However, it isnโ€™t always applicable or accessible to every person out there. Even if youโ€™re here for a quick refresher or are just starting out, an introduction to the basics is a great place to start.

Why is statistics important and why should you learn it?

With technology becoming an ever more integral part of our daily lives, more data is generated every day than ever before. Data is everywhere in the world around us. It includes data that is captured from stock trades, sensors, IoT devices, and more. And with more data generated everyday, professions that work with this data are also growing at a rapid rate. These include:

  1. Data scientists
  2. Data analysts
  3. Statisticians
  4. Mathematicians
  5. Data engineers
  6. Computer scientists
  7. Data architects
  8. Machine learning engineers

And there are several more not mentioned here, and new positions are created as new ways of utilizing data emerge.

Statistics is a field that can help us understand how to use data to do several things, such as:

  • To make decisions using data
  • To gain a better understanding of the world around us
  • To make predictions about the future, and to understand the models used in these predictions
  • To be cautious of misleading information
  • To understand the statistics reported in research and studies
  • To avoid generalizing conclusions from a study to a larger population

The two most common types of statistics

There are two main branches of statistics, inferential and descriptive. Classical statistics focused almost entirely on inference, which are methods for quantifying properties and drawing inferences from a sample about the larger population from which the sample came from.

Descriptive statistics

Descriptive statistics refers to methods that aim to describe raw data using summary statistics, visualizations and frequency tables in order to understand the data.

Summary statistics include measures of central tendency like the mean and median, and measures of dispersion like the variance and standard variation. Some of the visualizations one could use include box plots, scatter plots and histograms.

An example use case of descriptive statistics is we want to understand the number of visits to a library in a given year. We could look into the average age, or other patterns within different age groups.

Another example scenario is we want to understand the different kinds of fish at a fish market, and we want to look into their different characteristics like weight and lengths. Using this data set from Kaggle, we look into some of the descriptive statistics using the code below:

df = pd.read_csv(r'C:\...\Fish.csv')
df.head(3)
Out:
Species Weight Length1 Length2 Length3 Height Width
0 Bream 242.0 23.2 25.4 30.0 11.5200 4.0200
1 Bream 290.0 24.0 26.3 31.2 12.4800 4.3056
2 Bream 340.0 23.9 26.5 31.1 12.3778 4.6961
df = df.rename({'Length1':'Vertical_Length', 'Length2':'Diagonal_Length', 'Length3':'Cross_Length'}, axis='columns')
df.describe()
Out:

In line 2, we use df.head() to view the first few rows of the data set. I didn’t like the ambiguous column names for the lengths, so I renamed them according to the code documentation on Kaggle.

In line 4, you can use df.describe() to see different summary statistics for each numeric variable. This pandas function is very versatile and you can include or exclude a variety of summary statistics. View the documentation for further exploration.

View the completed notebook on my GitHub!

Inferential statistics

Common forms of inferential statistics you may have heard of include: hypothesis tests, regression, and confidence intervals.

For a use case of inferential statistics, we want to know the preferred presidential candidate of people within a country before an election. However, it is very difficult to survey every single person in the respective country, so we would survey a smaller group of people (i.e. 1000, 10,000 or 1,000,000 people) to draw conclusions about the population as a whole.

The main difference between the two is that descriptive stats are used to describe a data set and inferential statistics are used to make inferences from a sample about a larger population.

Probability – what is it and how is it different from statistics?

I can think of several instances of when I heard someone discussing statistics and/or probability, and used the two interchangeably. But they are not the same. Generally, the term probability refers to a measure of oneโ€™s belief in the occurrence of a future event. More specifically, probability is a measure of the likelihood of future events. It is a theoretical branch of mathematics which studies the consequences of mathematical definitions of a given ideal world, while statistics is an applied branch of mathematics which tries to understand observations from the real world. While both subjects are somewhat related, important and useful, they are very different. Understanding this difference is pertinent to interpreting statistical and mathematical evidence.

One way to illustrate the difference between statistics and probability is through the mathematical lens of a coin flip.

If the mathematician were looking at this coin flip through a probability lens, she observes the coin and thinks that heads or tails are equally likely, assuming it is a fair coin. She then assumes that each face has the probability of 1/2, and can determine the chances of getting a heads vs tails.

From a statistical perspective, she thinks that the coin may seem fine at first, but how do we really know it is a fair coin? We track the observations of how often heads or tails occurs, and decide if it is consistent with the assumption of each face occurring with 1/2 probability.

Another way to define probability is thinking of it as the logic of uncertainty, while mathematics is the logic of certainty. Probability is useful in a variety of fields to explain variation, uncertainty, and model complex phenomena. Some fields that use probability include, but are not limited to: statistics, physics, biology and medicine.

Sample spaces

The mathematical framework for probability is built around set theory. Sets are very useful in probability because it provides a way to work with and understand events. Before an experiment is performed, it is unknown which one out of a set of possible outcomes will result. The set of all possible outcomes in an experiment is known as the sample space (S). An event, A, is a subset of S.

Concepts of set theory are very general and abstract, so it is important to have an example to keep in mind.

Example: coin flip

A coin is flipped 2 times. Let H represent Heads and T represent tails, which are possible outcomes, and the sample space is the set of all possible H’s and T’s for each observation of the 2 coin flips. The sample space for each possible outcomes is {HH, HT, TH, TT}.

Different interpretations of probability

As such with statistics, there are a few interpretations of probability. Typically, one first learns the classical and frequentist approaches.

Classical

In the classical approach, an event may occur in r different ways out of a total n number of possible ways, and of which all are equally likely, then the probability of the event occurring is r/n.

For example, if we wish to flip a coin 2 times, what are all the possible ways at least 1 tail will occur? We approach this by writing out all the possible combinations of 2 coin flips. Each flip of a fair coin has the probability of 1/2 landing as a heads or tails.

The set of all possible outcomes is: {HH, HT, TH, TT}. So there are four possible outcomes in total, three of which contain at least 1 tail event. So, with each having equal probability of occurring with probability 1/4, the probability of getting at least 1 tail in two flips of a coin is:

1/4 + 1/4 + 1/4 = 0.75.

Frequentist

From a frequentist approach, after n repetitions of an experiment, where n is very large, an event occurs in r of these, then the probability of the event is r/n, and we call this the empirical probability of the event.

For example, if we toss a coin 100 times and it lands on tails 52 of those times, then we write this as 52/100 = 0.52.

So with this view of probability, if we say a coin has probability 1/2 of Heads, then this means that the coin would land on Heads 50% of the time if we tossed it repeatedly.

This is just meant as a soft introduction to probability. There are many important principles and axioms one must understand to properly execute the use of probabilities in real world applications. For example, as in the example mentioned earlier, in the frequentist approach to probability, we stated โ€œwhere n is very large,โ€ this refers to an important fundamental concept known as The Law of Large Numbers. Using this law, in the example where we flip the coin 100 times and get tails 52 times, as we flip more and more times (as in, n gets very large), then we notice a pattern of convergence to its expected value. In this example, we expect the probability to be 50% Heads and 50% Tails landing from a coin flip.

There is also a third approach which is beyond the scope of this introduction, known as the Bayesian approach. This is a controversial part of probability, and will be covered later.

What’s next?

Hopefully this is a gentle (and interesting) enough introduction to spur your curiosity to learn more. Below are some additional sources of information and references.

References:

Spiegel, M. R., PhD, Schiller, J., & Srinivasan, R. A. (2013). Probability and Statistics (4th ed.). McGraw Hill.

Wackerly, D.D., Mendenhall, W. and Scheaffer, R.L. (2008) Mathematical Statistics with Applications. 7th Edition, Thomson Learning, Inc., USA

Bruce, P., Bruce, A., & Gedeck, P. (2019). Practical Statistics for Data Scientists (2nd ed.). O’Reilly.

Blitzstein, J. K., & Hwang, J. (2015). Introduction to probability. CRC Press.

Questions & Comments

If you have any questions or comments, or just want to chat, please email us at: [email protected], or leave one below!