Back Cover Blurb
- How can statistics help us understand the world?
- Can we come to reliable conclusions when data is imperfect?
- How is statistics changing in the age of data science?
- Sir David John Spiegelhalter is a British statistician and Chair of the Winton Centre for Risk and Evidence Communication in the Statistical Laboratory at the University of Cambridge. Spiegelhalter is one of the most cited and influential researchers in his field, and was elected as President of the Royal Statistical Society for 2017-18.
How can statistics help us understand the world?1
- Does going to University increase the risk of getting a brain tumour?
- An ambitious study conducted on over 4 million Swedish men and women whose tax and health records were linked over eighteen years enabled researchers to report that men with a higher socioeconomic position had a slightly increased rate of being diagnosed with a brain tumour.
- But did all that sweating in the library overheat the brain and lead to some strange cell mutations? The authors of the paper doubted it: ‘Completeness of cancer registration and detection bias are potential explanations for the findings.’ In other words, wealthy people with higher education are more likely to be diagnosed and get their tumour registered, an example of ascertainment bias.
- How many sexual partners have people in Britain really had?
- Plotting the responses from a recent UK survey revealed various features, including a (very) long tail, a tendency to use round numbers such as 10 and 20, and more partners reported by men than women. It is incredibly easy to just claim that what these respondents say accurately represents what is really going on in the country. Media surveys about sex, where people volunteer to say what they get up to behind closed doors, do this all the time.
- What is the risk of cancer from bacon sandwiches?
- An IARC report concluded that, normally, 6 in every 100 people who do not eat bacon daily would be expected to get bowel cancer. If 100 similar people ate a bacon sandwich every single day of their lives, the IARC would expect an 18% increase in cases of bowel cancer, i.e. a rise from 6 to 7 cases out of 100. That is one extra case in all those 100 lifetime bacon-eaters, which does not sound as impressive as the relative risk (an 18% increase) and might serve to put this hazard into perspective.
- Do busier hospitals have higher survival rates?
- There is a considerable interest in the so-called ‘volume effect’ in surgery – the claim that busier hospitals get better survival rates, possibly since they achieve greater efficiency and have more experience.
- When considering English hospitals conducting children’s heart surgery in the 1990s, and plotting the number of cases against their survival, the high correlation showed that bigger hospitals were associated with lower mortality. But we could not conclude that bigger hospitals caused the lower mortality. We cannot conclude that the higher survival rates were in any sense caused by the increased number of cases – in fact it could even be the other way round: better hospitals simply attracted more patients.
List of Figures – ix
List of Tables – xiii
Acknowledgements – xv
Introduction – 1
- Getting Things in Proportion: Categorical Data and Percentages – 19
- Summarizing and Communicating Numbers. Lots of Numbers – 39
- Why Are We Looking at Data Anyway? Populations and Measurement – 73
- What Causes What? – 95
- Modelling Relationships Using Regression – 121
- Algorithms, Analytics and Prediction – 143
- How Sure Can We Be About What Is Going On? Estimates and Intervals – 189
- Probability - the Language of Uncertainty and Variability – 205
- Putting Probability and Statistics Together – 229
- Answering Questions and Claiming Discoveries – 253
- Learning from Experience the Bayesian Way – 305
- How Things Go Wrong – 341
- How We Can Do Statistics Better – 361
- In Conclusion – 379
Glossary – 381
Notes – 407
Getting Things in Proportion: Categorical Data and Percentages
- Turning experiences into data is not straightforward, and data is inevitably limited in its capacity to describe the world.
- Statistical science has a long and successful history, but is now changing in the light of increased availability of data.
- Skill in statistical methods plays an important part of being a data scientist.
- Teaching statistics is changing from a focus on mathematical methods to one based on an entire problem-solving cycle.
- The PPDAC cycle provides a convenient framework: Problem - Plan - Data - Analysis - Conclusion and communication.
- Data literacy is a key skill for the modern world.
Summarizing and Communicating Numbers. Lots of Numbers
- Binary variables are yes/no questions, sets of which can be summarized as proportions.
- Positive or negative framing of proportions can change their emotional impact.
- Relative risks tend to convey an exaggerated importance, and absolute risks should be provided for clarity.
- Expected frequencies promote understanding and an appropriate sense of importance.
- Odds ratios arise from scientific studies but should not be used for general communication.
- Graphics need to be chosen with care and awareness of their impact.
Why Are We Looking at Data Anyway? Populations and Measurement
- A variety of statistics can be used to summarize the empirical distribution of data-points, including measures of location and spread.
- Skewed data distributions are common, and some summary statistics are very sensitive to outlying values.
- Data summaries always hide some detail, and care is required so that important information is not lost.
- Single sets of numbers can be visualized in strip-charts, box-and-whisker plots and histograms.
- Consider transformations to better reveal patterns, and use the eye to detect patterns, outliers, similarities and clusters.
- Look at pairs of numbers as scatter-plots, and time-series as line-graphs.
- When exploring data, a primary aim is to find factors that explain the overall variation.
- Graphics can be both interactive and animated.
- Infographics highlight interesting features and can guide the viewer through a story, but should be used with awareness of their purpose and their impact.
What Causes What?
- Inductive inference requires working from our data, through study sample and study population, to a target population.
- Problems and biases can crop up at each stage of this path.
- The best way to proceed from sample to study population is to have drawn a random sample.
- A population can be thought of as a group of individuals, but also as providing the probability distribution for a random observation drawn from that population.
- Populations can be summarized using parameters that mirror the summary statistics of sample data.
- Often data does not arise as a sample from a literal population. When we have all the data there is, then we can imagine it drawn from a metaphorical population of events that could have occurred, but didn’t.
Modelling Relationships Using Regression
- Causation, in the statistical sense, means that when we intervene, the chances of different outcomes are systematically changed.
- Causation is difficult to establish statistically, but well-designed randomized trials are the best available framework.
- Principles of blinding, intention-to-treat and so on have enabled large-scale clinical trials to identify moderate but important effects.
- Observational data may have background factors influencing the apparent observed relationships between an exposure and an outcome, which may be either observed confounders or lurking factors.
- Statistical methods exist for adjusting for other factors, but judgement is always required as to the confidence with which causation can be claimed.
Algorithms, Analytics and Prediction
- Regression models provide a mathematical representation between a set of explanatory variables and a response variable.
- The coefficients in a regression model indicate how much we expect the response to change when the explanatory variable is observed to change.
- Regression-to-the-mean occurs when more extreme responses revert to nearer the long-term average, since a contribution to their previous extremeness was pure chance.
- Regression models can incorporate different types of response variable, explanatory variables and non-linear relationships.
- Caution is required in interpreting models, which should not be taken too literally: ‘All models are wrong, but some are useful.’
How Sure Can We Be About What Is Going On? Estimates and Intervals
- Algorithms built from data can be used for classification and prediction in technological applications.
- It is important to guard against over-fitting an algorithm to training data, essentially fitting to noise rather than signal.
- Algorithms can be evaluated by the classification accuracy, their ability to discriminate between groups, and their overall predictive accuracy.
- Complex algorithms may lack transparency, and it may be worth trading off some accuracy for comprehension.
- The use of algorithms and artificial intelligence presents many challenges, and insights into both the power and limitations of machine-learning methods is vital.
Probability - the Language of Uncertainty and Variability
- Uncertainty intervals are an important part of communicating statistics.
- Bootstrapping a sample consists of creating new data sets of the same size by resampling the original data, with replacement.
- Sample statistics calculated from bootstrap resamples tend towards a normal distribution for larger data sets, regardless of the shape of the original data distribution.
- Uncertainty intervals based on bootstrapping take advantage of modern computer power, do not require assumptions about the mathematical form of the population and do not require complex probability theory.
Putting Probability and Statistics Together
- The theory of probability provides a formal language and mathematics for dealing with chance phenomena.
- The implications of probability are not intuitive, but insights can be improved by using the idea of expected frequencies.
- The ideas of probability are useful even when there is no explicit use of a randomizing mechanism.
- Many social phenomena show a remarkable regularity in their overall pattern, while individual events are entirely unpredictable.
Answering Questions and Claiming Discoveries
- Probability theory can be used to derive the sampling distribution of summary statistics, from which formulae for confidence intervals can be derived.
- A 95% confidence interval is the result of a procedure that, in 95% of cases in which its assumptions are correct, will contain the true parameter value. It cannot be claimed that a specific interval has 95% probability of containing the true value.
- The Central Limit Theorem implies that sample means and other summary statistics can be assumed to have a normal distribution for large samples.
- Margins of error usually do not incorporate systematic error due to non-random causes – external knowledge and judgement is required to assess these.
- Confidence intervals can be calculated even when we observe all the data, which then represent uncertainty about the parameters of an underlying metaphorical population.
Learning from Experience the Bayesian Way
How Things Go Wrong
How We Can Do Statistics Better
- Tests of null hypotheses - default assumptions about statistical models - form a major part of statistical practice.
- A P-value is a measure of the incompatibility between the observed data and a null hypothesis: formally it is the probability of observing such an extreme result, were the null hypothesis true.
- Traditionally, P-value thresholds of 0.05 and 0.01 have been set to declare ‘statistical significance’.
- These thresholds need to be adjusted if multiple tests are conducted, for example on different subsets of the data or multiple outcome measures.
- There is a precise correspondence between confidence intervals and P-values: if, say, the 95% interval excludes 0, we can reject the null hypothesis of 0 at P<0.05.
- Neyman-Pearson theory specifies an alternative hypothesis, and fixes Type I and Type II error rates for the two possible kinds of errors in a hypothesis test.
- Separate forms of hypothesis tests have been developed for sequential testing.
- P-values are often misinterpreted: in particular they do not convey the probability that the null hypothesis is true, nor does a non-significant result imply that the null hypothesis is true.
- I was alerted to the book via excerpts in Aeon3.
- While this is a new paperback, it's a fairly horrible edition – small and bound so that any attempt to open it flat risks snapping the spine.
- Having said that, it’s more robust than expected as I’ve successfully cc’d most of the Chapter Summaries. I expect it’ll get more brittle with age.
- From what I’ve read, the text is much better than the fabric. However, it can’t really be read like a novel. It’ll certainly – for me – require a second reading.
In-Page Footnotes ("Spiegelhalter (David) - The Art of Statistics: Learning from Data")
- “From the Publisher” – via Amazon.
- I’m adding these as I read the Chapters, though I had to do a catch-up after Chapter 4.
- I ought also to note any interesting snippets …
- One that immediately comes to mind is that the statistics for the use of statins are based on prescription rather than use; ie. they may be ineffective because not taken. There may be a 50% reduction in CHD rather than the published 25%.
Pelican (13 Feb. 2020)
Text Colour Conventions (see disclaimer)
- Blue: Text by me; © Theo Todman, 2022
- Mauve: Text by correspondent(s) or other author(s); © the author(s)