Book Cover for Practical Statistics for Data Scientist with Crab on a white background

Book Review: Practical Statistics for Data Scientists

Bottom Line Up Front

Statistics is essential for doing data science, but it is often not readily available or easily digestible for those not formally trained in the field.

I like this book. This book is good for a quick reference, but not great for in-depth understanding of the statistics required for data science. I do like how it provides references to other sources, which are great resources themselves. This book is also great for quick R or Python reference, but could eventually become outdated as the languages and packages get updates over time.

I say this book is a quick reference since I was aware of these concepts before reading due to my education and practical experience in the field, so I cannot provide a reliable perspective from someone who is just learning these concepts. Nonetheless, I think it is a great introduction and provides a solid foundation for the reader to get started.

Book Overview and Structure

Overview & Goals

Basically, this book claims that it is aimed towards the data scientist with some familiarity with R/Python, and a basic knowledge of statistics. The authors chiefly stick true to that assumption of the reader. Particularly, the first couple of chapters are a great primer and refresher on basic statistics.

It explains two goals for the book:

  1. To lay out in a digestible, navigable, and easily referenced form, key concepts from statistics that are relevant to data science.
  2. To explain which concepts are important and useful from a data science perspective, which are less so, and why.

To me, this book accomplishes those goals. But I do think some data “influencers” out there put a lot more weight into this book as being great for learning data science, and fail to mention that it is fairly high level, and meant mostly as a primer on statistics from a data science perspective, not an in depth explanation.

Even so, this over inflation of the perpetual use of this book is by no means the authors’ fault. They explicitly state in the goals for the book that it is for easy reference, and an explanation of which concepts are important and useful in data science, not an in depth demonstration of those concepts.

Book Structure

The book is laid out in 7 chapters, each touching on important data science concepts.

Additionally, each chapter is split into sections and subsections, which I will outline below.

Each section in the chapter starts with a short blurb about the topic of the section to introduce the topic. It then includes a list of key vocabulary of the important concepts in the chapter.

Furthermore, it goes through each important concept in that vocabulary list to explain a little further on the concept, along with other related topics not directly listed in the vocabulary list. Subsequently, each section contains useful definitions and examples to paint a clear picture of the topic discussed. Key idea and further reading suggestions end each section to recap the key subjects in the section, and to point the reader in a direction to reference the suggested reading to learn more. Then, the chapter wraps up in a short paragraph or two summarizing the chapter.

Chapters

  • Chapter 1: Exploratory Data Analysis
  • Chapter 2: Data and Sampling Distributions
  • Chapter 3: Statistical Experiments and Significance Testing
  • Chapter 4: Regression and Prediction
  • Chapter 5: Classification
  • Chapter 6: Statistical Machine Learning
  • Chapter 7: Unsupervised Learning

Chapter 1 – Exploratory Data Analysis

Sections

  • Estimates of Locations
    • Mean
    • Median and Other Robust Estimates
  • Estimates of Variability
    • Standard Deviation and Related Estimates
    • Estimates based on Percentiles
  • Exploring the Data Distribution
    • Percentiles and Box Plots
    • Frequency Tables and Histograms
    • Density Plots and Estimates
  • Exploring Binary and Categorical Data
    • Mode
    • Expected Values
    • Probability
  • Correlation
    • Scatter Plots
  • Exploring Two or More Variables
    • Hexagonal Binning and Contours
    • Two Categorical Variables
    • Categorical and Numerical Data
    • Visualizing Multiple Variables

Opinion

The first chapter offers a gentle introduction to descriptive statistics. I see many reviews on this book saying that it is one of the best books out there for data science, but I think that is an overstatement. This book is great, but only at a high level. It is not all encompassing for data science, and only a small subset of the data realm. However, it is also good for a quick refresher, and I have found myself picking it up from time to time. But if you are looking into this book to learn data science, I suggest you look into a more traditional and in depth textbook such as Mathematical Statistics (reference below) for the statistics portion. However, it is an accessible entry point into the world of statistics for data science.

One thing I really like in this chapter is the explanation of each visualization used, and the R and Python code provided.

Chapter 2 – Data and Sampling Distributions

Sections

  • Random sampling and sample bias
    • Bias
    • Random selection
    • Size versus quality: when does size matter?
    • Sample mean versus population mean
  • Selection Bias
    • Regression to the mean
  • Sampling distribution of a statistic
    • Central limit theorem
    • Standard error
  • The Bootstrap
    • Resampling versus bootstrapping
  • Confidence Intervals
  • Normal distribution
    • Standard normal and QQ-Plots
  • Long-tailed distributions
  • Student’s t-Distribution
  • Binomial distribution
  • Chi-square distribution
  • F-Distribution
  • Poisson and related distributions
    • Poisson distributions
    • Exponential distributions
    • Estimating the failure rate
    • Weibull distribution

Opinion

“Data and Sampling Distributions,” delves into the fundamental concepts of sampling and how it forms the basis for making inferences about populations from sample data. It gives simple and clear explanations to help make complex concepts like sampling distributions understandable, laying the groundwork for statistical inference.

Specifically, it explores various methods of sampling and their implications for the validity of statistical conclusions. This section helps readers understand how different sampling methods can impact the generalizability of results. It clearly explains the concept of sampling distributions, emphasizing their importance in understanding the behavior of sample statistics and their variability. The book helps readers grasp how sample statistics, like the sample mean or sample proportion vary from one sample to another.

It illuminates the Central Limit Theorem (CLT) and its significance in statistical inference. Also, it demonstrates how, regardless of the underlying distribution of the population, the distribution of sample means tends to be normally distributed under certain conditions. This theorem is crucial for hypothesis testing and constructing confidence intervals.

Furthermore, it introduces bootstrapping as a resampling technique used to estimate the sampling distribution empirically, especially when analytical methods are challenging or unavailable.

Finally, this chapter serves as a bridge between basic descriptive statistics and more advanced inferential statistics. It equips readers with the necessary foundation to understand how statistical analyses rely on sampling distributions and sets the stage for subsequent chapters on hypothesis testing, confidence intervals, and more sophisticated statistical techniques. However, it only touches and briefly explains each concept in the chapter, but does suggest some further reading in order to gain a better understanding of the topics, such as:

Chapter 3 – Statistical Experiments and Significance Testing

Sections

  • A/B Testing
    • Why have a control group?
    • Why just A/B? Why not C, D, …?
  • Hypothesis Tests
    • The null hypothesis
    • Alternative hypothesis
    • One-way versus two-way hypothesis tests
  • Resampling
    • Permutation test
    • Example: Web stickiness
    • Exhaustive and Bootstrap Permutation Tests
    • Permutation Tests: The Bottom Line for Data Science
  • Statistical Significance and p-values
    • p-value
    • Alpha
    • Type 1 and Type 2 errors
    • Data science and p-values
  • t-Tests
  • Multiple testing
  • Degrees of Freedom
  • ANOVA
    • F-Statistic
    • Two-way ANOVA
  • Chi-square test
    • Chi-square test: A Resampling Approach
    • Chi-square test: Statistical Theory
    • Fisher’s Exact Test
    • Relevance for Data Science
  • Multi-Arm Bandit Algorithm
  • Power and Sample Size
    • Sample size

Opinion

Insightful guidance on conducting experiments and interpreting significance tests, crucial for making informed data-driven decisions. While not the most in-depth, I think it is a great primer on statistical experiments and significance testing. It goes over designing experiments in order to confirm or reject a hypothesis, as well as covers important concepts and explains their meaning and relevance to data science.

Personally, significance testing is something I frequently find myself needing to brush up on. I often reach for this book for its straightforward definitions of certain practices and concepts. It also offers great further reading suggestions and snippets of helpful Python and R code. This chapter alone makes buying the book worth it for me.

Chapter 4: Regression and Prediction

Sections

  • Simple Linear Regression
    • The Regression Equation
    • Fitted Values and Residuals
    • Least Squares
    • Prediction Versus Explanation (Profiling)
  • Multiple Linear Regression
    • Example: King County Housing Data
    • Assessing the Model
    • Cross-Validation
    • Model Selection and Stepwise Regression
    • Weighted Regression
  • Prediction Using Regression
    • The Dangers of Extrapolation
    • Confidence and Prediction Intervals
  • Factor Variables in Regression
    • Dummy Variables Representation
    • Factor Variables with Many Levels
    • Ordered Factor Variables
  • Interpreting the Regression Equation
    • Correlated Predictors
    • Multicollinearity
    • Confounding Variables
    • Interactions and Main Effects
  • Regression Diagnostics
    • Outliers
    • Influential Values
    • Heteroskedasticity, Non-Normality, and Correlated Errors
    • Partial Residual Plots and Nonlinearity
  • Polynomial and Spline Regression
    • Polynomial
    • Splines
    • Generalized Additive Models

Opinion

Chapter 4 dives deep into regression, one of the most fundamental tools in data science for predicting outcomes. It covers both simple and multiple linear regression models, and explains key concepts like least squares, multicollinearity, and interaction terms. The book also introduces more advanced topics such as splines and generalized additive models (GAMs), but it doesn’t provide enough depth to fully understand their practical implications without further reading.

What I appreciate about this chapter is its balance between theory and application. The inclusion of real-world examples, like predicting housing prices in King County (although this example is not novel), makes the material more relatable. However, for those looking to implement models in a production environment, the book’s treatment of topics like cross-validation and model selection is somewhat limited. While the chapter does mention these concepts, it doesnโ€™t explore them in great depth, making this more of a primer than a comprehensive guide.

I will say, that it went more in-depth than I thought it would, and still mentioned some of the fundamentals needed to really understand the models. But I do not think it went in enough depth, or placed enough emphasis on the underlying assumptions models make about the data, and how to check and adjust for those, in my opinion.

This is another area where readers would benefit from supplementary texts, especially when it comes to nuanced subjects like heteroskedasticity or outlier detection, which require more sophisticated understanding than what is presented here. At the end of the chapter, there are some suggestions for further reading, which I think are great materials to start learning this topic in more detail. Still, for a beginner or someone in need of a quick refresher, this chapter serves its purpose well.

Chapter 5: Classification

Sections

  • Naรฏve Bayes
    • Why Exact Bayesian Classification is Impractical
    • The Naรฏve Solution
    • Numeric Predictor Variables
  • Discriminant Analysis
    • Covariance Matrix
    • Fisher’s Linear Discriminant
    • A Simple Example
  • Logistic Regression
    • Logistic Response Function and Logit
    • Logistic Regression and the GLM
    • Generalized Linear Models
    • Predicted Values from Logistic Regression
    • Interpreting the Coefficients and Odds Ratios
    • Linear and Logistic Regression: Similarities and Differences
    • Assessing the Model
  • Evaluating Classification Models
    • Confusion Matrix
    • The Rare Class Problem
    • Precision, Recall and Specificity
    • ROC Curve
    • AUC
    • Lift
  • Strategies for Imbalanced Data
    • Undersampling
    • Oversampling and Up/Down Weighting
    • Data Generation
    • Cost-Based Classification
    • Exploring the Predictions

Opinion

Classification techniques are at the heart of many data science projects, especially when it comes to tasks like spam detection, credit scoring, and medical diagnoses. Chapter 5 offers a clear and approachable introduction to key classification methods such as Naรฏve Bayes, Logistic Regression, and Discriminant Analysis.

The chapter does a good job of explaining the differences between logistic regression and linear regression, which is often a point of confusion for beginners. It also explains how to evaluate classification models using metrics like accuracy, precision, recall, and AUC.

One thing I found helpful is the discussion on strategies for imbalanced data, such as oversampling and cost-based classificationโ€”these are real-world challenges that many data scientists face. I still find myself referencing this section when I am in a pinch.

If you’re aiming to build robust, industry-level classification models, you’ll likely need to consult more advanced texts or papers to gain a deeper understanding of areas like model tuning, imbalanced class handling, and overfitting in large datasets. At the end of each section, there are some great suggestions for further reading.

Chapter 6: Statistical Machine Learning

Sections

  • K-Nearest Neighbors
    • A Small Example: Predicting Loan Default
    • Distance Metrics
    • One Hot Encoder
    • Standardization
      • Normalization
      • z-Scores
    • Choosing K
    • KNN as a feature engine
  • Tree Models
    • A Simple Example
    • The Recursive Partitioning Algorithm
    • Measuring Homogeneity or Impurity
    • Stopping the Tree from Growing
    • Predicting a Continuous Value
    • How Trees are Used
  • Bagging and the Random Forest
    • Bagging
    • Random Forest
    • Variable Importance
    • Hyperparameters
  • Boosting
    • The Boosting Algorithm
    • XGBoost
    • Regularization: Avoid Overfitting
    • Hyperparameters and Cross-Validation

Opinion

This chapter introduces machine learning from a statistical standpoint, focusing on techniques like decision trees, random forests, and support vector machines (SVMs). These are core algorithms in the data science toolkit, and the authors do a commendable job of explaining them in a concise and approachable manner.

For those unfamiliar with machine learning, this chapter provides a solid foundation. It covers the intuition behind these models and explains how they fit into the broader scope of statistical learning. I also liked the use of more realistic dataset examples to help visualize these concepts. However, it again stays at a high level, stopping short of the depth required for real-world implementation. Concepts such as hyperparameter tuning and ensemble methods are only briefly touched upon. But these are still great, quick explanations for someone looking for a primer, refresher, or for someone encountering these for the first time. One part I do appreciate is the quick note on machine learning vs statistics, as this is something I have encountered recently in the workplace.

While the book is excellent for quick reference, this is another chapter where further reading is necessary if youโ€™re looking to specialize in machine learning. Fortunately, the authors provide good suggestions for additional resources, such as “The Elements of Statistical Learning” by Hastie, Tibshirani, and Friedman, which is much more comprehensive.

Chapter 7: Unsupervised Learning

Sections

  • Principal Component Analysis
    • A Simple Example
    • Computing the Principal Components
    • Interpreting Principal Components
    • Correspondence Analysis
  • K-Means Clustering
    • A Simple Example
    • K-Means Algorithm
    • Interpreting the Clusters
    • Selecting the Number of Clusters
  • Hierarchical Clustering
    • A Simple Example
    • The Dendrogram
    • The Agglomerative Algorithm
    • Measures of Dissimilarity
  • Model-Based Clustering
    • Multivariate Normal Distribution
    • Mixtures of Normals
    • Selecting the Number of Clusters
  • Scaling and Categorical Variables
    • Scaling the Variables
    • Dominant Variables
    • Categorical Data and Gower’s Distance
    • Problems with Clustering Mixed Data

Opinion

The final chapter introduces unsupervised learning, focusing on clustering methods like k-means, hierarchical clustering, and principal component analysis (PCA). These techniques are essential for discovering patterns in unlabelled data, and the chapter serves as a good introduction.

I appreciate that the authors included a discussion on the practical challenges of unsupervised learning, such as determining the optimal number of clusters and interpreting PCA results. However, like previous chapters, the coverage is more cursory than in-depth. For example, advanced topics like t-SNE or DBSCAN are not included, which would be useful for readers wanting to dive deeper into clustering and dimensionality reduction.

While the provided R and Python code snippets are helpful, they are somewhat basic and might not be sufficient for handling more complex real-world datasets, especially since the languages have evolved quite a bit since the time of the writing. Nonetheless, this chapter is a useful primer for someone just getting started with unsupervised learning, as I think it was intended.

Final Thoughts

Strengths

Overall, Practical Statistics for Data Scientists succeeds in its goal of providing a digestible and navigable reference for essential statistical concepts in data science. The book excels at breaking down complex topics into more manageable chunks and provides practical R and Python code snippets to demonstrate the methods discussed. For a data scientist who already has a foundation in statistics, this book can serve as a handy, go-to reference.

Weaknesses

However, the book’s high-level nature is both a strength and a limitation. While it provides a broad overview of important statistical techniques, it often lacks the depth needed to fully grasp more advanced topics. Readers looking for a comprehensive understanding of statistical methods or machine learning algorithms may find themselves needing to supplement this book with more specialized and detailed resources. But, the authors make up for this by providing great suggestions for further reading.

Conclusion

In summary, Practical Statistics for Data Scientists is a valuable addition to a data scientist’s bookshelf, especially for those looking for a quick reference guide. It provides clear, concise explanations of key statistical concepts, but falls short if you’re aiming for in-depth mastery of the subject. As long as you recognize its limitations, this book will serve as a helpful companion in your data science journey.

Have you read this book or found this review helpful? Please leave a comment below!

Note for Spanish Speakers: There is a digital version written in Spanish. I was able to get a copy via Google Play Books.