This lesson covers inferential statistics, which allow us to make predictions and draw conclusions about populations based on sample data.
Population: The complete set of individuals, items, or data points we want to study.
Sample: A subset of the population used to make inferences about the entire population.
Parameter: A numerical characteristic of a population (e.g., population mean μ).
Statistic: A numerical characteristic of a sample (e.g., sample mean x̄).
Practical Reasons:
Example Scenarios:
Population: All customers of an e-commerce site
Sample: 1,000 randomly selected customers
Goal: Estimate average customer satisfaction score
Population: All manufactured light bulbs
Sample: 100 bulbs tested for lifespan
Goal: Estimate average bulb lifespan
Probability Sampling:
Simple Random Sampling
SELECT * FROM customers ORDER BY RANDOM() LIMIT 1000;Stratified Sampling
-- Sample by region proportionally
SELECT * FROM customers
WHERE region = 'North' ORDER BY RANDOM() LIMIT 300
UNION ALL
SELECT * FROM customers
WHERE region = 'South' ORDER BY RANDOM() LIMIT 500
UNION ALL
SELECT * FROM customers
WHERE region = 'East' ORDER BY RANDOM() LIMIT 200;
Cluster Sampling
Non-Probability Sampling:
Definition: The probability distribution of a given statistic based on a random sample.
Central Limit Theorem: As sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of the population distribution.
Key Properties:
Standard Error Formula: $$SE = \frac{\sigma}{\sqrt{n}}$$
Where:
SQL Example:
-- Calculate standard error of mean
WITH sample_stats AS (
SELECT
AVG(satisfaction_score) as sample_mean,
STDDEV(satisfaction_score) as sample_std,
COUNT(*) as sample_size
FROM customer_survey_sample
)
SELECT
sample_std / SQRT(sample_size) as standard_error
FROM sample_stats;
Definition: A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.
Interpretation: We are X% confident that the true population parameter lies within this interval.
Components:
For Population Mean (known σ): $$CI = \bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}}$$
For Population Mean (unknown σ): $$CI = \bar{x} \pm t \cdot \frac{s}{\sqrt{n}}$$
For Population Proportion: $$CI = \hat{p} \pm z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Where:
Common Confidence Levels and Z-values:
Example Calculation:
Sample: 100 customers, mean satisfaction = 4.2, std dev = 0.8
95% CI for population mean:
Standard Error = 0.8 / √100 = 0.08
Margin of Error = 1.96 × 0.08 = 0.157
CI = 4.2 ± 0.157 = [4.043, 4.357]
Interpretation: We are 95% confident that true mean satisfaction
is between 4.043 and 4.357.
-- Calculate confidence interval for mean satisfaction
WITH sample_stats AS (
SELECT
AVG(satisfaction_score) as sample_mean,
STDDEV(satisfaction_score) as sample_std,
COUNT(*) as sample_size
FROM customer_survey_sample
),
confidence_interval AS (
SELECT
sample_mean,
sample_std,
sample_size,
sample_std / SQRT(sample_size) as standard_error,
1.96 * (sample_std / SQRT(sample_size)) as margin_of_error_95,
sample_mean - 1.96 * (sample_std / SQRT(sample_size)) as lower_bound_95,
sample_mean + 1.96 * (sample_std / SQRT(sample_size)) as upper_bound_95
FROM sample_stats
)
SELECT
sample_mean,
standard_error,
margin_of_error_95,
lower_bound_95,
upper_bound_95,
'95% CI: [' || ROUND(lower_bound_95, 3) || ', ' || ROUND(upper_bound_95, 3) || ']' as confidence_interval
FROM confidence_interval;
Trade-off Example:
Same data, different confidence levels:
90% CI: [4.068, 4.332] (width = 0.264)
95% CI: [4.043, 4.357] (width = 0.314)
99% CI: [3.994, 4.406] (width = 0.412)
Definition: A statistical method used to make decisions about population parameters based on sample data.
Purpose: To determine whether there is enough evidence to reject a claim about a population.
Null Hypothesis (H₀): The default assumption or claim to be tested.
Alternative Hypothesis (H₁ or Hₐ): The claim we're testing against the null.
Test Statistic: A value calculated from sample data used to make decisions.
Significance Level (α): Probability of rejecting H₀ when it's actually true.
One-Tailed Tests:
Two-Tailed Tests:
Example Setup:
Scenario: Testing if new website design increases conversion rate
One-tailed test:
H₀: μ_new ≤ μ_old (new design is not better)
H₁: μ_new > μ_old (new design is better)
Two-tailed test:
H₀: μ_new = μ_old (no difference)
H₁: μ_new ≠ μ_old (there is a difference)
Step 1: State Hypotheses
H₀: μ = 50 (population mean is 50)
H₁: μ ≠ 50 (population mean is not 50)
Step 2: Choose Significance Level
α = 0.05 (5% significance level)
Step 3: Calculate Test Statistic
Sample: n = 25, x̄ = 52, s = 8
t = (x̄ - μ₀) / (s / √n)
t = (52 - 50) / (8 / √25) = 2 / 1.6 = 1.25
Step 4: Determine Critical Value
Degrees of freedom = n - 1 = 24
Critical t-value (two-tailed, α = 0.05) = ±2.064
Step 5: Make Decision
Test statistic (1.25) < Critical value (2.064)
Fail to reject H₀
Step 6: Interpret Results
Conclusion: Insufficient evidence to conclude population mean differs from 50.
Type I Error (α): Rejecting H₀ when it's actually true.
Type II Error (β): Failing to reject H₀ when it's actually false.
Power (1 - β): Probability of correctly rejecting H₀ when it's false.
Error Trade-off:
Decreasing α (more stringent) → Increases β (decreases power)
Increasing α (less stringent) → Decreases β (increases power)
Definition: The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.
Interpretation: Small p-values provide evidence against the null hypothesis.
Decision Rule:
For t-test:
-- Calculate p-value for two-tailed t-test
WITH test_stats AS (
SELECT
AVG(conversion_rate) as sample_mean,
STDDEV(conversion_rate) as sample_std,
COUNT(*) as sample_size,
50 as hypothesized_mean
FROM ab_test_results
WHERE test_group = 'treatment'
),
t_statistic AS (
SELECT
sample_mean,
sample_std,
sample_size,
(sample_mean - hypothesized_mean) / (sample_std / SQRT(sample_size)) as t_value,
sample_size - 1 as degrees_of_freedom
FROM test_stats
)
SELECT
t_value,
degrees_of_freedom,
-- For two-tailed test, multiply by 2
2 * (1 - ABS(CUM_DIST_T(t_value, degrees_of_freedom))) as p_value
FROM t_statistic;
Myth 1: "p-value is the probability that H₀ is true"
Myth 2: "p-value is the probability of being wrong"
Myth 3: "Small p-value means large effect"
Myth 4: "p = 0.05 is magical"
p-value Guidelines:
Example Interpretation:
A/B Test Results:
- Control conversion: 3.2%
- Treatment conversion: 3.8%
- p-value: 0.023
- α = 0.05
Interpretation:
- Statistically significant difference (p < 0.05)
- Evidence suggests treatment improves conversion
- Practical significance: 0.6% absolute improvement
Decision Tree:
What type of data?
├── Categorical
│ ├── One sample → Chi-square goodness of fit
│ ├── Two samples → Chi-square test of independence
│ └── More than two → Chi-square test of independence
├── Ordinal
│ ├── Two samples → Mann-Whitney U test
│ ├── Paired samples → Wilcoxon signed-rank test
│ └── More than two → Kruskal-Wallis test
└── Continuous (Interval/Ratio)
├── One sample → One-sample t-test
├── Two independent samples → Independent t-test
├── Two paired samples → Paired t-test
└── More than two → ANOVA
Purpose: Test if population mean differs from a known value.
When to Use:
Example: Testing if average customer satisfaction differs from target of 4.0
-- One-sample t-test implementation
WITH sample_stats AS (
SELECT
AVG(satisfaction_score) as sample_mean,
STDDEV(satisfaction_score) as sample_std,
COUNT(*) as sample_size
FROM customer_satisfaction
),
t_test AS (
SELECT
sample_mean,
sample_std,
sample_size,
4.0 as hypothesized_mean,
(sample_mean - 4.0) / (sample_std / SQRT(sample_size)) as t_statistic,
sample_size - 1 as degrees_of_freedom
FROM sample_stats
)
SELECT
sample_mean,
t_statistic,
degrees_of_freedom,
2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
CASE
WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05
THEN 'Significant difference from 4.0'
ELSE 'No significant difference from 4.0'
END as conclusion
FROM t_test;
Purpose: Test if means of two independent groups differ.
When to Use:
Example: Comparing test scores between two teaching methods
-- Independent two-sample t-test
WITH group_stats AS (
SELECT
teaching_method,
AVG(test_score) as mean_score,
STDDEV(test_score) as std_score,
COUNT(*) as sample_size,
VARIANCE(test_score) as variance
FROM student_scores
GROUP BY teaching_method
),
pooled_variance AS (
SELECT
g1.teaching_method as group1,
g2.teaching_method as group2,
g1.mean_score as mean1,
g2.mean_score as mean2,
((g1.sample_size - 1) * g1.variance + (g2.sample_size - 1) * g2.variance) /
(g1.sample_size + g2.sample_size - 2) as pooled_var,
g1.sample_size as n1,
g2.sample_size as n2
FROM group_stats g1, group_stats g2
WHERE g1.teaching_method = 'traditional' AND g2.teaching_method = 'modern'
),
t_test AS (
SELECT
group1,
group2,
mean1,
mean2,
pooled_var,
n1,
n2,
(mean1 - mean2) / SQRT(pooled_var * (1/n1 + 1/n2)) as t_statistic,
n1 + n2 - 2 as degrees_of_freedom
FROM pooled_variance
)
SELECT
group1,
group2,
mean1,
mean2,
ABS(mean1 - mean2) as mean_difference,
t_statistic,
degrees_of_freedom,
2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
CASE
WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05
THEN 'Significant difference between groups'
ELSE 'No significant difference between groups'
END as conclusion
FROM t_test;
Purpose: Test if means of paired observations differ.
When to Use:
Example: Testing weight loss program effectiveness
-- Paired t-test for before/after measurements
WITH paired_data AS (
SELECT
participant_id,
before_weight,
after_weight,
before_weight - after_weight as difference
FROM weight_loss_program
),
t_test AS (
SELECT
AVG(difference) as mean_difference,
STDDEV(difference) as std_difference,
COUNT(*) as sample_size,
AVG(difference) / (STDDEV(difference) / SQRT(COUNT(*))) as t_statistic,
COUNT(*) - 1 as degrees_of_freedom
FROM paired_data
)
SELECT
mean_difference,
t_statistic,
degrees_of_freedom,
2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
CASE
WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05
THEN 'Significant weight loss'
ELSE 'No significant weight loss'
END as conclusion
FROM t_test;
Purpose: Test if two categorical variables are independent.
When to Use:
Example: Testing relationship between gender and product preference
-- Chi-square test of independence
WITH observed_counts AS (
SELECT
gender,
product_preference,
COUNT(*) as observed_count
FROM customer_survey
GROUP BY gender, product_preference
),
row_totals AS (
SELECT
gender,
SUM(observed_count) as row_total
FROM observed_counts
GROUP BY gender
),
col_totals AS (
SELECT
product_preference,
SUM(observed_count) as col_total
FROM observed_counts
GROUP BY product_preference
),
grand_total AS (
SELECT SUM(observed_count) as total_count
FROM observed_counts
),
expected_counts AS (
SELECT
o.gender,
o.product_preference,
o.observed_count,
(r.row_total * c.col_total / g.total_count) as expected_count,
POWER(o.observed_count - (r.row_total * c.col_total / g.total_count), 2) /
(r.row_total * c.col_total / g.total_count) as chi_square_contribution
FROM observed_counts o
JOIN row_totals r ON o.gender = r.gender
JOIN col_totals c ON o.product_preference = c.product_preference
CROSS JOIN grand_total g
),
chi_square_test AS (
SELECT
SUM(chi_square_contribution) as chi_square_statistic,
(SELECT COUNT(DISTINCT gender) FROM observed_counts) - 1 as df_rows,
(SELECT COUNT(DISTINCT product_preference) FROM observed_counts) - 1 as df_cols
FROM expected_counts
)
SELECT
chi_square_statistic,
(df_rows * df_cols) as degrees_of_freedom,
1 - CUM_DIST_CHI_SQUARE(chi_square_statistic, df_rows * df_cols) as p_value,
CASE
WHEN 1 - CUM_DIST_CHI_SQUARE(chi_square_statistic, df_rows * df_cols) < 0.05
THEN 'Variables are associated (not independent)'
ELSE 'Variables are independent'
END as conclusion
FROM chi_square_test;
Purpose: Test if means of three or more groups differ.
When to Use:
Example: Comparing sales performance across multiple regions
-- One-way ANOVA
WITH group_stats AS (
SELECT
region,
AVG(sales_amount) as group_mean,
COUNT(*) as group_size,
VARIANCE(sales_amount) as group_variance
FROM sales_data
GROUP BY region
),
overall_stats AS (
SELECT
AVG(sales_amount) as grand_mean,
COUNT(*) as total_n,
COUNT(DISTINCT region) as num_groups
FROM sales_data
),
anova_calculations AS (
SELECT
-- Between-group variability
SUM(group_size * POWER(group_mean - o.grand_mean, 2)) as ss_between,
-- Within-group variability
SUM((group_size - 1) * group_variance) as ss_within,
-- Degrees of freedom
o.num_groups - 1 as df_between,
o.total_n - o.num_groups as df_within,
o.total_n - 1 as df_total
FROM group_stats g, overall_stats o
),
f_test AS (
SELECT
ss_between / df_between as ms_between,
ss_within / df_within as ms_within,
(ss_between / df_between) / (ss_within / df_within) as f_statistic,
df_between,
df_within
FROM anova_calculations
)
SELECT
f_statistic,
df_between,
df_within,
1 - CUM_DIST_F(f_statistic, df_between, df_within) as p_value,
CASE
WHEN 1 - CUM_DIST_F(f_statistic, df_between, df_within) < 0.05
THEN 'Significant difference among group means'
ELSE 'No significant difference among group means'
END as conclusion
FROM f_test;
Scenario: Testing new website design impact on conversion rates
-- A/B test analysis
WITH test_results AS (
SELECT
test_group,
COUNT(*) as total_visitors,
SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions,
AVG(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversion_rate
FROM ab_test_data
GROUP BY test_group
),
proportion_test AS (
SELECT
t1.test_group as group1,
t2.test_group as group2,
t1.conversion_rate as rate1,
t2.conversion_rate as rate2,
t1.total_visitors as n1,
t2.total_visitors as n2,
ABS(t1.conversion_rate - t2.conversion_rate) as rate_difference,
-- Pooled proportion
(t1.conversions + t2.conversions) / (t1.total_visitors + t2.total_visitors) as pooled_p,
-- Standard error
SQRT(pooled_p * (1 - pooled_p) * (1/t1.total_visitors + 1/t2.total_visitors)) as standard_error,
-- Z-statistic
ABS(t1.conversion_rate - t2.conversion_rate) /
SQRT(pooled_p * (1 - pooled_p) * (1/t1.total_visitors + 1/t2.total_visitors)) as z_statistic
FROM test_results t1, test_results t2
WHERE t1.test_group = 'control' AND t2.test_group = 'treatment'
)
SELECT
group1,
group2,
rate1,
rate2,
rate_difference,
z_statistic,
2 * (1 - CUM_DIST_NORM(z_statistic)) as p_value,
CASE
WHEN 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05
THEN 'Significant difference in conversion rates'
ELSE 'No significant difference in conversion rates'
END as conclusion,
-- Business interpretation
CASE
WHEN rate2 > rate1 AND 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05
THEN 'Treatment group performs better - implement new design'
WHEN rate2 < rate1 AND 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05
THEN 'Control group performs better - keep current design'
ELSE 'No clear winner - consider larger sample or different approach'
END as recommendation
FROM proportion_test;
Scenario: Monitoring manufacturing process quality
-- Process control with confidence intervals
WITH process_stats AS (
SELECT
AVG(measurement) as process_mean,
STDDEV(measurement) as process_std,
COUNT(*) as sample_size
FROM quality_measurements
WHERE measurement_date >= CURRENT_DATE - INTERVAL '7 days'
),
control_limits AS (
SELECT
process_mean,
process_std,
sample_size,
process_mean - 1.96 * (process_std / SQRT(sample_size)) as lower_control_limit,
process_mean + 1.96 * (process_std / SQRT(sample_size)) as upper_control_limit
FROM process_stats
),
current_measurements AS (
SELECT
measurement,
measurement_time,
CASE
WHEN measurement < (SELECT lower_control_limit FROM control_limits)
THEN 'Below Control Limit'
WHEN measurement > (SELECT upper_control_limit FROM control_limits)
THEN 'Above Control Limit'
ELSE 'In Control'
END as status
FROM quality_measurements
WHERE measurement_date = CURRENT_DATE
)
SELECT
COUNT(*) as total_measurements,
COUNT(CASE WHEN status = 'In Control' THEN 1 END) as in_control,
COUNT(CASE WHEN status != 'In Control' THEN 1 END) as out_of_control,
ROUND(COUNT(CASE WHEN status = 'In Control' THEN 1 END) * 100.0 / COUNT(*), 2) as in_control_percentage
FROM current_measurements;
In the next lesson, we'll explore correlation and causation to understand relationships between variables and avoid common analytical pitfalls.