abstract painting

A Basic Introduction to Statistics

Definition

Statistics is the science of data. It is concerned with collecting, classifying, summarizing, analyzing, interpreting and presenting information. In other words, it is a mathematically based field which involves collecting and interpreting quantitative (and sometimes qualitative) data.

There are many different definitions of statistics out there, but they all agree on a few common elements: they each imply that data is collected and that statistics is a theory of information, with inference as the objective.

There are two main branches of statistics, descriptive and differential statistics (really there are three but the third deserves a whole post of its own).

Descriptive Statistics

Descriptive statistics refers to methods that aim to describe raw data using summary statistics, visualizations and frequency tables, in order to understand the underlying data. Summary statistics are values that summarize the data to help paint a picture of the data set.

These include measures of central tendency like the mean and median, and measures of dispersion like variance and standard deviation. Some visualizations to describe the data include box plots, scatter plots and histograms.

Examples: An example use case would be if we wanted to understand the relationship between fish weight and length measurements. Let’s look at some python code:

df = pd.read_csv(r'C:\...\Fish.csv')

df.head(3)

I do not like those ambiguous column names, so I will rename them according to the the code documentation on Kaggle.

df = df.rename({'Length1':'Vertical_Length', 'Length2':'Diagonal_Length', 'Length3':'Cross_Length'}, axis='columns')

I would like to visualize a correlation plot for each variable.

sns.pairplot(df, kind="scatter", hue='Species')
plt.show()

Looking at this, I can see some linearity, nonlinearity, and some correlation in the variables. If we wanted to use this data set to predict the weights of fish, then I would like to see the variables compared to the weight variable:

pp = sns.pairplot(df, 
             x_vars=['Vertical_Length', 'Diagonal_Length', 'Cross_Length', 'Height', 'Width'],
             y_vars=['Weight'], hue='Species')
plt.show()

Now, I would like to see some summary statistics for each of the variables:

df.describe()

This summary gives a rough idea about what your data set looks like. You can look at minimum, maximum and percentile values. For example, in this data set, the ‘Weight’ variable has a minimum value of 0 g. How can a fish weight 0 g? This value could be erroneous, so some further investigation is needed.

Estimates of Location

Some of the most important values in statistics are the mean and the median.

A basic step to explore the data set is to get an estimate of where most of the data is located, also referred to as central tendency. Generally, the first thing that may come to mind is to find the average value, or the mean.

Mean, median and mode

While the mean is fast and easy to calculate, it isn’t always the best value to use for central tendency. This is because the mean can be sensitive to high outliers. There can be other types of means like the trimmed mean or weighted means. One use of the trimmed mean is in international diving, to limit the influence of extreme values. For weighted means, highly variable observations are given a lower weight since some values are intrinsically more variable than others. One example of this is when we are taking data from sensors, and one or two sensors may be less accurate in their readings, then we may want to down weight the data from those sensors when we take average values. Weighted means can also be used if we happen to have an underrepresented group in a sample to help accurately reflect all groups in the user base.

Median

Another measure of central tendency is the median. The median is the middle number in a sorted array of the data, and visually, it separates the density curve into two equal parts.

There are some cases where the median may be better to use than the mean, such as if we were to look into salary ranges in particular neighborhoods, such as the one Bill Gates lives in and a separate but similar neighborhood (like one next to it with no billionaires). The mean salaries of the people in these neighborhoods would look drastically different due to Bill Gates’ unusually high salary. Using the median, it wouldn’t matter how rich he is, the middle value would remain the same.

In instances when the distribution of the data is skewed, or outliers are present, then it is best to use the median.

Mode

You might also have come across the mode in school. This is the value when you are working with categorical data and want to know which value occurs most often, also known as the value with the highest probability of occurrence. When there are two, three, or more values with high probabilities, the distributions are bimodal, trimodal and multimodal, respectively.

Bimodal distribution
Trimodal distribution
Multimodal distribution

Measures of Dispersion

When looking at a data set, we want to look at central tendency, but we also want to get an idea of how spread out the values may be. For this, we would use the variance, standard deviation, range, and other values like percentiles and the interquartile range. The interquartile range is the difference between the third and first quartiles (which is when we split the data to four equal parts).

Variance & standard deviation

The variance and standard deviation are measures of dispersion, or scatter of the values of the random variable about the mean. If the values concentrate near the mean, then the variance is small. If the values are far from the mean, then the variance is large. The standard deviation is the square root of the variance and another commonly used value to understand the spread.

Distribution with small variance
Distribution with large variance

Visualizations

Visualizations help us visualize the data, such as those used in the example above that painted a picture of how the variables in the fish data set related to each other. Common types of plots used to visualize the data include box plots, histograms, correlation heatmaps and scatterplots. Examples of these shown below:

Box plot of the Fish market data set – these are commonly used to look at the distribution of values for variables and identify potential outliers
Histograms of the variables in the Fish market data set – these are used to examine counts and distributions for variables
Correlation heatmap of the Fish market data set variables – this is used to view collinearity in a data set, this one in particular shows high multicollinearity between the variables.

Again, descriptive statistics help describe a data set and inferential statistics help make inferences from a sample about a larger population.

Inferential Statistics

Inferential statistics utilize small samples of data to analyze and uses the results to draw inferences about the population as a whole. The process of obtaining these smaller samples is sampling. Some common forms of inferential statistics are:

  1. Hypothesis testing
  2. Confidence intervals
  3. Regression

Using a different use case of people who go to the library, we want to know what the population prefers to read, but we don’t have the ability to ask every single person who comes to the library what they prefer. So we pick a representative sample and survey them to draw inferences about the larger population of people who go to the library.

Populations and samples

Keep in mind that the term population here does not always mean the same thing as the population of a country or city. Population in this situation refers to the individual observations and measurements. The population can be infinite or finite, and the finite number is the population size, usually denoted as N. The sample size (n) is usually finite since it taken from the larger population with the intention of using these specific measurements to infer information on the population.

X bar represents the mean of a sample from a population, and mu represents the mean of a population (both shown below). We make this distinction intentionally due to information about samples is observed, while the information on the population is inferred from the smaller samples. Using the different symbols helps keep things in order and separate.

Hypothesis testing

An A/B test is an experiment with two groups to establish and determine which of the two has the most effective outcome, or which hypothesis is accepted or rejected. An example for using hypothesis tests is to test two prices to determine which yields a better net profit. Let’s say we think that price A will produce higher profit than price B. One might think to just do the experiment and use whichever has the better outcome. However, we naturally underestimate the effect of random events, and can often misinterpret them as being a significant pattern rather than random effect. Once we collect the data from the A/B test, then we determine that observed differences between the two groups are either random effects, or a true difference between A and B.

Confidence intervals

We have already discussed histograms, boxplots and other methods to understand the potential error in the sample estimates. Another way to do this is to use confidence intervals. The sample is unlikely to contain a perfect estimate for the population, which generates uncertainty in that point estimate. One way to avoid this uncertainty is to create a confidence interval, which is a range of values that we are confident the true value for the population lies in. Generally, we say “an x% confidence interval around the sample estimate should, on average contain similar estimates x% of the time,” if the same (or similar) sampling procedure is conducted.

Regression

A common goal of statistics is to understand the relationship between two variables. To understand this, we would find the answer to the questions “is X associated with Y? What is the relationship and can we use it to predict Y?” To answer these, we would perform a regression analysis. This technique is a way to quantify the relationship between the variables. Predicting with one variable is simple linear regression, more than one is multiple linear regression, and logistic regression is prediction with a binary categorical variable such as if a person will default on a loan or not, or survived the sinking of the titanic. Regression is a complex process that deserves a post of its own. Check out the full notebook on my GitHub to see how I used multiple linear regression with the fish market data set.

Conclusion

There are many different definitions of statistics out there, but all point back to the same elements: statistics is a theory of information, and data is collected with inference as the objective.

There are two main branches of statistics, descriptive and inferential statistics. Descriptive statistics refers to methods that aim to describe data using summary statistics, visualizations and frequency tables, in order to understand the underlying data and paint a picture of the data set. Classical statistics focused almost entirely on inference, which are methods for quantifying properties and drawing inferences from a sample about the larger population from which the sample came.

  1. Practical Statistics for Data Scientists
  2. Mathematical Statistics

    References:

    Spiegel, M. R., PhD, Schiller, J., & Srinivasan, R. A. (2013). Probability and Statistics (4th ed.). McGraw Hill.

    Wackerly, D.D., Mendenhall, W. and Scheaffer, R.L. (2008) Mathematical Statistics with Applications. 7th Edition, Thomson Learning, Inc., USA

    Bruce, P., Bruce, A., & Gedeck, P. (2019). Practical Statistics for Data Scientists (2nd ed.). O’Reilly.

    Mendenhall, W., & Sincich, T. (1995). Statistics for engineering and the Sciences. Prentice Hall.

    2 responses to “A Basic Introduction to Statistics”

      • Thank you! I really needed to hear some good vibes this week! It encourages me to post more ๐Ÿ™‚