Statistics
Statistics is the science of data: collecting, organizing, analyzing, and interpreting it.
Start with the simplest version: this lesson is about Statistics. If you can explain the core idea to a friend using everyday language, examples, and one clear reason why it matters, you have moved from memorising to understanding.
Statistics is the science of data: collecting, organizing, analyzing, and interpreting it. While a single data point tells you one story, statistics reveals patterns hidden within collections of data. This chapter introduces measures of central tendency (mean, median, mode) that describe where data clusters, and measures of dispersion (variance, standard deviation) that describe how spread out it is. Understanding statistics is essential for science (designing experiments), business (market analysis), medicine (clinical trials), and informed citizenship (interpreting news and policy claims). Statistics transforms raw data into actionable insights.
Data: Raw Material for Statistics
Data comes from observations or measurements. A dataset is a collection of data values, often organized in a table or list.
Types of data:
- Discrete: Countable (number of students) - Continuous: Measurable across a range (height, weight)
- Qualitative: Descriptive (colors, categories)
- Quantitative: Numerical (ages, heights, test scores)
A population is the entire set of objects of interest. A sample is a subset of the population studied to estimate population characteristics.
Frequency Distributions
Organizing data into a frequency distribution makes patterns visible.
Create a table with class intervals (ranges) and count how many data points fall in each interval. For test scores 0-100:
- 0-20: 2 students
- 20-40: 5 students
- 40-60: 8 students
- 60-80: 10 students
- 80-100: 5 students
The frequency is the count. The relative frequency is the proportion (count/total). A histogram visualizes this as a bar chart with class intervals on the x-axis and frequencies on the y-axis.
Measures of Central Tendency
These numbers summarize where the center of the data lies.
Mean (Average): The sum of all values divided by the count.
Mean = Σx / n
For data {2, 4, 6, 8}, mean = (2+4+6+8)/4 = 5
The mean is sensitive to outliers. One extremely large value pulls it up.
Median: The middle value when data is arranged in order.
For {2, 4, 6, 8}, median = (4+6)/2 = 5 (average of two middle values for even-sized sets)
The median is robust—outliers don't affect it as much.
Mode: The value that appears most frequently.
For {2, 2, 4, 6, 8, 8, 8}, mode = 8 (appears 3 times)
Data can have one mode (unimodal), two (bimodal), or no mode if no value repeats.
Measures of Dispersion
These measure how spread out the data is around the center.
Range: Difference between maximum and minimum values.
For {2, 4, 6, 8}, range = 8 - 2 = 6
Simple but sensitive to outliers.
Variance (σ²): Average of squared deviations from the mean.
Variance = Σ(x - mean)² / n
For each value, calculate how far it is from the mean, square it, average those squares.
For {2, 4, 6, 8} with mean = 5: Variance = [(2-5)² + (4-5)² + (6-5)² + (8-5)²] / 4 = [9 + 1 + 1 + 9] / 4 = 5
Standard Deviation (σ): The square root of variance.
σ = √Variance
For the example above: σ = √5 ≈ 2.24
Standard deviation is in the original units of measurement, making it more interpretable than variance. A small standard deviation means data is tightly clustered around the mean.
Coefficient of Variation
When comparing datasets with different scales, the coefficient of variation (CV) is useful:
CV = (Standard Deviation / Mean) × 100%
This expresses dispersion as a percentage of the mean, allowing comparison across different units.
Probability and the Normal Distribution
The normal distribution (bell curve) is ubiquitous in nature. Many phenomena—heights, test scores, measurement errors—follow it approximately.
The normal distribution is characterized by:
- 68% of data falls within 1 standard deviation of the mean - 95% within 2 standard deviations - 99.7% within 3 standard deviations
- Symmetry: Mean = median = mode
- 68-95-99.7 Rule:
Correlation and Relationship Between Variables
Correlation measures how two variables move together.
The correlation coefficient (r) ranges from -1 to +1:
- r = +1: Perfect positive correlation (one increases with the other)
- r = 0: No correlation
- r = -1: Perfect negative correlation (one decreases as the other increases)
Important: Correlation does not imply causation. Two variables might move together because of a third variable or pure chance.
Real-World Applications
Medicine: Clinical trials use statistics to test drug effectiveness.
Quality Control: Manufacturing monitors product consistency using statistical samples.
Economics: Unemployment rates, inflation, and GDP growth are statistical measures guiding policy.
Psychology: Research findings rely on statistical significance testing.
Sports: Advanced analytics reveal player performance patterns.
Key Formulas
- Mean: x̄ = Σx / n
- Variance: σ² = Σ(x - x̄)² / n
- Standard Deviation: σ = √Variance
- Coefficient of Variation: CV = (σ / x̄) × 100%
- 68-95-99.7 Rule: Probabilities for normal distributions
Socratic Questions
- Why is the median often preferred over the mean for describing typical values when data contains outliers? Can you construct an example where they differ dramatically?
- Variance is the average of squared deviations from the mean. Why square the deviations instead of using absolute values? What would be lost if we used |x - mean| instead?
- The standard deviation measures spread. If one dataset has σ = 2 and another has σ = 10, what can you conclude about which is more clustered? How would this affect predictions?
- The correlation coefficient measures association between variables, but "correlation is not causation." Can you think of two variables that are correlated but not causally related? What other factors might explain their association?
- The normal distribution describes many real phenomena. Why is this distribution so common? What fundamental principles in nature lead different independent random processes to produce bell-shaped patterns?
