Population: All customers of an e-commerce site
Sample: 1,000 randomly selected customers
Goal: Estimate average customer satisfaction score

Population: All manufactured light bulbs
Sample: 100 bulbs tested for lifespan
Goal: Estimate average bulb lifespan

Sampling Methods

Probability Sampling:

Simple Random Sampling
- Every member has equal chance of selection
- SELECT * FROM customers ORDER BY RANDOM() LIMIT 1000;

Stratified Sampling

Population divided into subgroups (strata)
Random sample from each stratum

-- Sample by region proportionally
SELECT * FROM customers 
WHERE region = 'North' ORDER BY RANDOM() LIMIT 300
UNION ALL
SELECT * FROM customers 
WHERE region = 'South' ORDER BY RANDOM() LIMIT 500
UNION ALL
SELECT * FROM customers 
WHERE region = 'East' ORDER BY RANDOM() LIMIT 200;

Cluster Sampling
- Population divided into clusters
- Random clusters selected entirely
- Useful for geographic sampling

Non-Probability Sampling:

Convenience sampling
Judgment sampling
Quota sampling
Snowball sampling

Sampling Distribution

Definition: The probability distribution of a given statistic based on a random sample.

Central Limit Theorem: As sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of the population distribution.

Key Properties:

Mean of sampling distribution = population mean
Standard deviation of sampling distribution = standard error
Shape becomes more normal with larger samples

Standard Error Formula: $$SE = \frac{\sigma}{\sqrt{n}}$$

Where:

SE = Standard Error
σ = Population standard deviation
n = Sample size

SQL Example:

-- Calculate standard error of mean
WITH sample_stats AS (
    SELECT 
        AVG(satisfaction_score) as sample_mean,
        STDDEV(satisfaction_score) as sample_std,
        COUNT(*) as sample_size
    FROM customer_survey_sample
)
SELECT 
    sample_std / SQRT(sample_size) as standard_error
FROM sample_stats;

Confidence Intervals

What Are Confidence Intervals?

Definition: A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.

Interpretation: We are X% confident that the true population parameter lies within this interval.

Components:

Point Estimate: Sample statistic (e.g., sample mean)
Margin of Error: Range around point estimate
Confidence Level: Probability that interval contains true parameter

Calculating Confidence Intervals

For Population Mean (known σ): $$CI = \bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}}$$

For Population Mean (unknown σ): $$CI = \bar{x} \pm t \cdot \frac{s}{\sqrt{n}}$$

For Population Proportion: $$CI = \hat{p} \pm z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Where:

$\bar{x}$ = sample mean
$s$ = sample standard deviation
$\hat{p}$ = sample proportion
$z$ = critical value from standard normal distribution
$t$ = critical value from t-distribution
$n$ = sample size

Critical Values

Common Confidence Levels and Z-values:

90% confidence: z = 1.645
95% confidence: z = 1.96
99% confidence: z = 2.576

Example Calculation:

Sample: 100 customers, mean satisfaction = 4.2, std dev = 0.8
95% CI for population mean:

Standard Error = 0.8 / √100 = 0.08
Margin of Error = 1.96 × 0.08 = 0.157
CI = 4.2 ± 0.157 = [4.043, 4.357]

Interpretation: We are 95% confident that true mean satisfaction 
is between 4.043 and 4.357.

SQL Implementation

-- Calculate confidence interval for mean satisfaction
WITH sample_stats AS (
    SELECT 
        AVG(satisfaction_score) as sample_mean,
        STDDEV(satisfaction_score) as sample_std,
        COUNT(*) as sample_size
    FROM customer_survey_sample
),
confidence_interval AS (
    SELECT 
        sample_mean,
        sample_std,
        sample_size,
        sample_std / SQRT(sample_size) as standard_error,
        1.96 * (sample_std / SQRT(sample_size)) as margin_of_error_95,
        sample_mean - 1.96 * (sample_std / SQRT(sample_size)) as lower_bound_95,
        sample_mean + 1.96 * (sample_std / SQRT(sample_size)) as upper_bound_95
    FROM sample_stats
)
SELECT 
    sample_mean,
    standard_error,
    margin_of_error_95,
    lower_bound_95,
    upper_bound_95,
    '95% CI: [' || ROUND(lower_bound_95, 3) || ', ' || ROUND(upper_bound_95, 3) || ']' as confidence_interval
FROM confidence_interval;

Factors Affecting Confidence Intervals

Sample Size: Larger samples → narrower intervals
Variability: More variability → wider intervals
Confidence Level: Higher confidence → wider intervals

Trade-off Example:

Same data, different confidence levels:
90% CI: [4.068, 4.332] (width = 0.264)
95% CI: [4.043, 4.357] (width = 0.314)
99% CI: [3.994, 4.406] (width = 0.412)

Hypothesis Testing Fundamentals

What is Hypothesis Testing?

Definition: A statistical method used to make decisions about population parameters based on sample data.

Purpose: To determine whether there is enough evidence to reject a claim about a population.

Key Components

Null Hypothesis (H₀): The default assumption or claim to be tested.

Represents "no effect" or "no difference"
Assumed true until evidence suggests otherwise

Alternative Hypothesis (H₁ or Hₐ): The claim we're testing against the null.

Represents "effect" or "difference"
What we conclude if we reject H₀

Test Statistic: A value calculated from sample data used to make decisions.

Significance Level (α): Probability of rejecting H₀ when it's actually true.

Common values: 0.05, 0.01, 0.10
Also called Type I error rate

Types of Hypothesis Tests

One-Tailed Tests:

Directional hypotheses
Tests if parameter is > or < a value
More power for detecting specific direction

Two-Tailed Tests:

Non-directional hypotheses
Tests if parameter ≠ a value
More conservative

Example Setup:

Scenario: Testing if new website design increases conversion rate

One-tailed test:
H₀: μ_new ≤ μ_old (new design is not better)
H₁: μ_new > μ_old (new design is better)

Two-tailed test:
H₀: μ_new = μ_old (no difference)
H₁: μ_new ≠ μ_old (there is a difference)

The Hypothesis Testing Process

Step 1: State Hypotheses

H₀: μ = 50 (population mean is 50)
H₁: μ ≠ 50 (population mean is not 50)

Step 2: Choose Significance Level

α = 0.05 (5% significance level)

Step 3: Calculate Test Statistic

Sample: n = 25, x̄ = 52, s = 8
t = (x̄ - μ₀) / (s / √n)
t = (52 - 50) / (8 / √25) = 2 / 1.6 = 1.25

Step 4: Determine Critical Value

Degrees of freedom = n - 1 = 24
Critical t-value (two-tailed, α = 0.05) = ±2.064

Step 5: Make Decision

Test statistic (1.25) < Critical value (2.064)
Fail to reject H₀

Step 6: Interpret Results

Conclusion: Insufficient evidence to conclude population mean differs from 50.

Types of Errors

Type I Error (α): Rejecting H₀ when it's actually true.

False positive
"Claiming an effect when there isn't one"
Probability = significance level

Type II Error (β): Failing to reject H₀ when it's actually false.

False negative
"Missing an effect that exists"
Probability depends on effect size and sample size

Power (1 - β): Probability of correctly rejecting H₀ when it's false.

Ability to detect a true effect
Increases with larger sample sizes

Error Trade-off:

Decreasing α (more stringent) → Increases β (decreases power)
Increasing α (less stringent) → Decreases β (increases power)

p-values and Statistical Significance

What is a p-value?

Definition: The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

Interpretation: Small p-values provide evidence against the null hypothesis.

Decision Rule:

If p ≤ α: Reject H₀ (statistically significant)
If p > α: Fail to reject H₀ (not statistically significant)

Calculating p-values

For t-test:

-- Calculate p-value for two-tailed t-test
WITH test_stats AS (
    SELECT 
        AVG(conversion_rate) as sample_mean,
        STDDEV(conversion_rate) as sample_std,
        COUNT(*) as sample_size,
        50 as hypothesized_mean
    FROM ab_test_results
    WHERE test_group = 'treatment'
),
t_statistic AS (
    SELECT 
        sample_mean,
        sample_std,
        sample_size,
        (sample_mean - hypothesized_mean) / (sample_std / SQRT(sample_size)) as t_value,
        sample_size - 1 as degrees_of_freedom
    FROM test_stats
)
SELECT 
    t_value,
    degrees_of_freedom,
    -- For two-tailed test, multiply by 2
    2 * (1 - ABS(CUM_DIST_T(t_value, degrees_of_freedom))) as p_value
FROM t_statistic;

Common Misconceptions

Myth 1: "p-value is the probability that H₀ is true"

Reality: p-value assumes H₀ is true

Myth 2: "p-value is the probability of being wrong"

Reality: p-value is about data, not about hypotheses

Myth 3: "Small p-value means large effect"

Reality: p-value depends on effect size AND sample size

Myth 4: "p = 0.05 is magical"

Reality: 0.05 is arbitrary; context matters

Practical Interpretation

p-value Guidelines:

p < 0.01: Very strong evidence against H₀
0.01 ≤ p < 0.05: Strong evidence against H₀
0.05 ≤ p < 0.10: Moderate evidence against H₀
p ≥ 0.10: Little or no evidence against H₀

Example Interpretation:

A/B Test Results:
- Control conversion: 3.2%
- Treatment conversion: 3.8%
- p-value: 0.023
- α = 0.05

Interpretation:
- Statistically significant difference (p < 0.05)
- Evidence suggests treatment improves conversion
- Practical significance: 0.6% absolute improvement

Common Statistical Tests

Choosing the Right Test

Decision Tree:

What type of data?
├── Categorical
│   ├── One sample → Chi-square goodness of fit
│   ├── Two samples → Chi-square test of independence
│   └── More than two → Chi-square test of independence
├── Ordinal
│   ├── Two samples → Mann-Whitney U test
│   ├── Paired samples → Wilcoxon signed-rank test
│   └── More than two → Kruskal-Wallis test
└── Continuous (Interval/Ratio)
    ├── One sample → One-sample t-test
    ├── Two independent samples → Independent t-test
    ├── Two paired samples → Paired t-test
    └── More than two → ANOVA

One-Sample t-test

Purpose: Test if population mean differs from a known value.

When to Use:

One continuous variable
Compare to known or hypothesized value
Sample size < 30 or population σ unknown

Example: Testing if average customer satisfaction differs from target of 4.0

-- One-sample t-test implementation
WITH sample_stats AS (
    SELECT 
        AVG(satisfaction_score) as sample_mean,
        STDDEV(satisfaction_score) as sample_std,
        COUNT(*) as sample_size
    FROM customer_satisfaction
),
t_test AS (
    SELECT 
        sample_mean,
        sample_std,
        sample_size,
        4.0 as hypothesized_mean,
        (sample_mean - 4.0) / (sample_std / SQRT(sample_size)) as t_statistic,
        sample_size - 1 as degrees_of_freedom
    FROM sample_stats
)
SELECT 
    sample_mean,
    t_statistic,
    degrees_of_freedom,
    2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
    CASE 
        WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05 
        THEN 'Significant difference from 4.0'
        ELSE 'No significant difference from 4.0'
    END as conclusion
FROM t_test;

Independent Two-Sample t-test

Purpose: Test if means of two independent groups differ.

When to Use:

Two independent groups
One continuous variable
Compare group means

Example: Comparing test scores between two teaching methods

-- Independent two-sample t-test
WITH group_stats AS (
    SELECT 
        teaching_method,
        AVG(test_score) as mean_score,
        STDDEV(test_score) as std_score,
        COUNT(*) as sample_size,
        VARIANCE(test_score) as variance
    FROM student_scores
    GROUP BY teaching_method
),
pooled_variance AS (
    SELECT 
        g1.teaching_method as group1,
        g2.teaching_method as group2,
        g1.mean_score as mean1,
        g2.mean_score as mean2,
        ((g1.sample_size - 1) * g1.variance + (g2.sample_size - 1) * g2.variance) / 
        (g1.sample_size + g2.sample_size - 2) as pooled_var,
        g1.sample_size as n1,
        g2.sample_size as n2
    FROM group_stats g1, group_stats g2
    WHERE g1.teaching_method = 'traditional' AND g2.teaching_method = 'modern'
),
t_test AS (
    SELECT 
        group1,
        group2,
        mean1,
        mean2,
        pooled_var,
        n1,
        n2,
        (mean1 - mean2) / SQRT(pooled_var * (1/n1 + 1/n2)) as t_statistic,
        n1 + n2 - 2 as degrees_of_freedom
    FROM pooled_variance
)
SELECT 
    group1,
    group2,
    mean1,
    mean2,
    ABS(mean1 - mean2) as mean_difference,
    t_statistic,
    degrees_of_freedom,
    2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
    CASE 
        WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05 
        THEN 'Significant difference between groups'
        ELSE 'No significant difference between groups'
    END as conclusion
FROM t_test;

Paired t-test

Purpose: Test if means of paired observations differ.

When to Use:

Same subjects measured twice
Before/after measurements
Matched pairs

Example: Testing weight loss program effectiveness

-- Paired t-test for before/after measurements
WITH paired_data AS (
    SELECT 
        participant_id,
        before_weight,
        after_weight,
        before_weight - after_weight as difference
    FROM weight_loss_program
),
t_test AS (
    SELECT 
        AVG(difference) as mean_difference,
        STDDEV(difference) as std_difference,
        COUNT(*) as sample_size,
        AVG(difference) / (STDDEV(difference) / SQRT(COUNT(*))) as t_statistic,
        COUNT(*) - 1 as degrees_of_freedom
    FROM paired_data
)
SELECT 
    mean_difference,
    t_statistic,
    degrees_of_freedom,
    2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
    CASE 
        WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05 
        THEN 'Significant weight loss'
        ELSE 'No significant weight loss'
    END as conclusion
FROM t_test;

Chi-Square Test of Independence

Purpose: Test if two categorical variables are independent.

When to Use:

Two categorical variables
Test for association/relationship

Example: Testing relationship between gender and product preference

-- Chi-square test of independence
WITH observed_counts AS (
    SELECT 
        gender,
        product_preference,
        COUNT(*) as observed_count
    FROM customer_survey
    GROUP BY gender, product_preference
),
row_totals AS (
    SELECT 
        gender,
        SUM(observed_count) as row_total
    FROM observed_counts
    GROUP BY gender
),
col_totals AS (
    SELECT 
        product_preference,
        SUM(observed_count) as col_total
    FROM observed_counts
    GROUP BY product_preference
),
grand_total AS (
    SELECT SUM(observed_count) as total_count
    FROM observed_counts
),
expected_counts AS (
    SELECT 
        o.gender,
        o.product_preference,
        o.observed_count,
        (r.row_total * c.col_total / g.total_count) as expected_count,
        POWER(o.observed_count - (r.row_total * c.col_total / g.total_count), 2) / 
        (r.row_total * c.col_total / g.total_count) as chi_square_contribution
    FROM observed_counts o
    JOIN row_totals r ON o.gender = r.gender
    JOIN col_totals c ON o.product_preference = c.product_preference
    CROSS JOIN grand_total g
),
chi_square_test AS (
    SELECT 
        SUM(chi_square_contribution) as chi_square_statistic,
        (SELECT COUNT(DISTINCT gender) FROM observed_counts) - 1 as df_rows,
        (SELECT COUNT(DISTINCT product_preference) FROM observed_counts) - 1 as df_cols
    FROM expected_counts
)
SELECT 
    chi_square_statistic,
    (df_rows * df_cols) as degrees_of_freedom,
    1 - CUM_DIST_CHI_SQUARE(chi_square_statistic, df_rows * df_cols) as p_value,
    CASE 
        WHEN 1 - CUM_DIST_CHI_SQUARE(chi_square_statistic, df_rows * df_cols) < 0.05 
        THEN 'Variables are associated (not independent)'
        ELSE 'Variables are independent'
    END as conclusion
FROM chi_square_test;

ANOVA (Analysis of Variance)

Purpose: Test if means of three or more groups differ.

When to Use:

Three or more independent groups
One continuous variable
Compare multiple group means simultaneously

Example: Comparing sales performance across multiple regions

-- One-way ANOVA
WITH group_stats AS (
    SELECT 
        region,
        AVG(sales_amount) as group_mean,
        COUNT(*) as group_size,
        VARIANCE(sales_amount) as group_variance
    FROM sales_data
    GROUP BY region
),
overall_stats AS (
    SELECT 
        AVG(sales_amount) as grand_mean,
        COUNT(*) as total_n,
        COUNT(DISTINCT region) as num_groups
    FROM sales_data
),
anova_calculations AS (
    SELECT 
        -- Between-group variability
        SUM(group_size * POWER(group_mean - o.grand_mean, 2)) as ss_between,
        -- Within-group variability
        SUM((group_size - 1) * group_variance) as ss_within,
        -- Degrees of freedom
        o.num_groups - 1 as df_between,
        o.total_n - o.num_groups as df_within,
        o.total_n - 1 as df_total
    FROM group_stats g, overall_stats o
),
f_test AS (
    SELECT 
        ss_between / df_between as ms_between,
        ss_within / df_within as ms_within,
        (ss_between / df_between) / (ss_within / df_within) as f_statistic,
        df_between,
        df_within
    FROM anova_calculations
)
SELECT 
    f_statistic,
    df_between,
    df_within,
    1 - CUM_DIST_F(f_statistic, df_between, df_within) as p_value,
    CASE 
        WHEN 1 - CUM_DIST_F(f_statistic, df_between, df_within) < 0.05 
        THEN 'Significant difference among group means'
        ELSE 'No significant difference among group means'
    END as conclusion
FROM f_test;

Practical Applications and Examples

A/B Testing in Business

Scenario: Testing new website design impact on conversion rates

-- A/B test analysis
WITH test_results AS (
    SELECT 
        test_group,
        COUNT(*) as total_visitors,
        SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions,
        AVG(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversion_rate
    FROM ab_test_data
    GROUP BY test_group
),
proportion_test AS (
    SELECT 
        t1.test_group as group1,
        t2.test_group as group2,
        t1.conversion_rate as rate1,
        t2.conversion_rate as rate2,
        t1.total_visitors as n1,
        t2.total_visitors as n2,
        ABS(t1.conversion_rate - t2.conversion_rate) as rate_difference,
        -- Pooled proportion
        (t1.conversions + t2.conversions) / (t1.total_visitors + t2.total_visitors) as pooled_p,
        -- Standard error
        SQRT(pooled_p * (1 - pooled_p) * (1/t1.total_visitors + 1/t2.total_visitors)) as standard_error,
        -- Z-statistic
        ABS(t1.conversion_rate - t2.conversion_rate) / 
        SQRT(pooled_p * (1 - pooled_p) * (1/t1.total_visitors + 1/t2.total_visitors)) as z_statistic
    FROM test_results t1, test_results t2
    WHERE t1.test_group = 'control' AND t2.test_group = 'treatment'
)
SELECT 
    group1,
    group2,
    rate1,
    rate2,
    rate_difference,
    z_statistic,
    2 * (1 - CUM_DIST_NORM(z_statistic)) as p_value,
    CASE 
        WHEN 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05 
        THEN 'Significant difference in conversion rates'
        ELSE 'No significant difference in conversion rates'
    END as conclusion,
    -- Business interpretation
    CASE 
        WHEN rate2 > rate1 AND 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05 
        THEN 'Treatment group performs better - implement new design'
        WHEN rate2 < rate1 AND 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05 
        THEN 'Control group performs better - keep current design'
        ELSE 'No clear winner - consider larger sample or different approach'
    END as recommendation
FROM proportion_test;

Quality Control Applications

Scenario: Monitoring manufacturing process quality

-- Process control with confidence intervals
WITH process_stats AS (
    SELECT 
        AVG(measurement) as process_mean,
        STDDEV(measurement) as process_std,
        COUNT(*) as sample_size
    FROM quality_measurements
    WHERE measurement_date >= CURRENT_DATE - INTERVAL '7 days'
),
control_limits AS (
    SELECT 
        process_mean,
        process_std,
        sample_size,
        process_mean - 1.96 * (process_std / SQRT(sample_size)) as lower_control_limit,
        process_mean + 1.96 * (process_std / SQRT(sample_size)) as upper_control_limit
    FROM process_stats
),
current_measurements AS (
    SELECT 
        measurement,
        measurement_time,
        CASE 
            WHEN measurement < (SELECT lower_control_limit FROM control_limits) 
            THEN 'Below Control Limit'
            WHEN measurement > (SELECT upper_control_limit FROM control_limits) 
            THEN 'Above Control Limit'
            ELSE 'In Control'
        END as status
    FROM quality_measurements
    WHERE measurement_date = CURRENT_DATE
)
SELECT 
    COUNT(*) as total_measurements,
    COUNT(CASE WHEN status = 'In Control' THEN 1 END) as in_control,
    COUNT(CASE WHEN status != 'In Control' THEN 1 END) as out_of_control,
    ROUND(COUNT(CASE WHEN status = 'In Control' THEN 1 END) * 100.0 / COUNT(*), 2) as in_control_percentage
FROM current_measurements;

Key Takeaways

Inferential Statistics: Allow us to make predictions about populations from samples
Confidence Intervals: Provide ranges for population parameters with specified confidence
Hypothesis Testing: Systematic method for testing claims about populations
p-values: Measure evidence against null hypothesis, not probability of hypotheses
Test Selection: Choose appropriate test based on data type and research question
Practical Significance: Statistical significance doesn't always mean practical importance
Sample Size Matters: Larger samples increase power and precision of estimates

Next Steps

In the next lesson, we'll explore correlation and causation to understand relationships between variables and avoid common analytical pitfalls.

Data Analyst Fundamentals

7 hours

23 lessons

Course Content

Descriptive StatisticsLearn to summarize and describe data using statistical measures

45 minutesPreview

Inferential Statistics and Hypothesis TestingMaster inferential statistics and learn to test hypotheses with data

60 minutes

Correlation and CausationUnderstand difference between correlation and causation in data analysis

30 minutes

Regression and Time Series AnalysisLearn regression analysis and time series forecasting techniques

60 minutes

Data Visualization FundamentalsMaster principles of effective data visualization

45 minutes

Dashboards and StorytellingCreate compelling dashboards and tell stories with data

60 minutes

A/B Testing and ExperimentsDesign and analyze A/B tests and controlled experiments

45 minutes

Machine Learning OverviewIntroduction to machine learning concepts for data analysts

30 minutes

Course Stats

Total Duration7 hours

Chapters3

Total Lessons23

LevelBEGINNER

YouTube video ID required

Inferential Statistics and Hypothesis Testing

Lesson Overview

This lesson covers inferential statistics, which allow us to make predictions and draw conclusions about populations based on sample data.

What You'll Learn:

Population vs sample statistics
Confidence intervals
Hypothesis testing fundamentals
p-values and statistical significance
Common statistical tests

Key Concepts:

Inferential Statistics: Making predictions about populations from samples
Hypothesis Testing: Method for testing claims about data
p-value: Probability of observing results by chance
Confidence Interval: Range of likely values for population parameter
Statistical Significance: Likelihood that results are not due to chance

Population vs Sample Statistics

Understanding the Distinction

Population: The complete set of individuals, items, or data points we want to study.

Sample: A subset of the population used to make inferences about the entire population.

Parameter: A numerical characteristic of a population (e.g., population mean μ).

Statistic: A numerical characteristic of a sample (e.g., sample mean x̄).

Why We Use Samples

Practical Reasons:

Cost: Surveying entire populations is expensive
Time: Collecting data from everyone takes too long
Accessibility: Some populations are impossible to reach completely
Destruction: Some tests destroy the item being tested

Example Scenarios:

Population: All customers of an e-commerce site
Sample: 1,000 randomly selected customers
Goal: Estimate average customer satisfaction score

Population: All manufactured light bulbs
Sample: 100 bulbs tested for lifespan
Goal: Estimate average bulb lifespan

Sampling Methods

Probability Sampling:

Simple Random Sampling
- Every member has equal chance of selection
- SELECT * FROM customers ORDER BY RANDOM() LIMIT 1000;

Stratified Sampling

Population divided into subgroups (strata)
Random sample from each stratum

-- Sample by region proportionally
SELECT * FROM customers 
WHERE region = 'North' ORDER BY RANDOM() LIMIT 300
UNION ALL
SELECT * FROM customers 
WHERE region = 'South' ORDER BY RANDOM() LIMIT 500
UNION ALL
SELECT * FROM customers 
WHERE region = 'East' ORDER BY RANDOM() LIMIT 200;

Cluster Sampling
- Population divided into clusters
- Random clusters selected entirely
- Useful for geographic sampling

Non-Probability Sampling:

Convenience sampling
Judgment sampling
Quota sampling
Snowball sampling

Sampling Distribution

Definition: The probability distribution of a given statistic based on a random sample.

Central Limit Theorem: As sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of the population distribution.

Key Properties:

Mean of sampling distribution = population mean
Standard deviation of sampling distribution = standard error
Shape becomes more normal with larger samples

Standard Error Formula: $$SE = \frac{\sigma}{\sqrt{n}}$$

Where:

SE = Standard Error
σ = Population standard deviation
n = Sample size

SQL Example:

-- Calculate standard error of mean
WITH sample_stats AS (
    SELECT 
        AVG(satisfaction_score) as sample_mean,
        STDDEV(satisfaction_score) as sample_std,
        COUNT(*) as sample_size
    FROM customer_survey_sample
)
SELECT 
    sample_std / SQRT(sample_size) as standard_error
FROM sample_stats;

Confidence Intervals

What Are Confidence Intervals?

Definition: A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.

Interpretation: We are X% confident that the true population parameter lies within this interval.

Components:

Point Estimate: Sample statistic (e.g., sample mean)
Margin of Error: Range around point estimate
Confidence Level: Probability that interval contains true parameter

Calculating Confidence Intervals

For Population Mean (known σ): $$CI = \bar{x} \pm z \cdot \frac{\sigma}{\sqrt{n}}$$

For Population Mean (unknown σ): $$CI = \bar{x} \pm t \cdot \frac{s}{\sqrt{n}}$$

For Population Proportion: $$CI = \hat{p} \pm z \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$

Where:

$\bar{x}$ = sample mean
$s$ = sample standard deviation
$\hat{p}$ = sample proportion
$z$ = critical value from standard normal distribution
$t$ = critical value from t-distribution
$n$ = sample size

Critical Values

Common Confidence Levels and Z-values:

90% confidence: z = 1.645
95% confidence: z = 1.96
99% confidence: z = 2.576

Example Calculation:

Sample: 100 customers, mean satisfaction = 4.2, std dev = 0.8
95% CI for population mean:

Standard Error = 0.8 / √100 = 0.08
Margin of Error = 1.96 × 0.08 = 0.157
CI = 4.2 ± 0.157 = [4.043, 4.357]

Interpretation: We are 95% confident that true mean satisfaction 
is between 4.043 and 4.357.

SQL Implementation

-- Calculate confidence interval for mean satisfaction
WITH sample_stats AS (
    SELECT 
        AVG(satisfaction_score) as sample_mean,
        STDDEV(satisfaction_score) as sample_std,
        COUNT(*) as sample_size
    FROM customer_survey_sample
),
confidence_interval AS (
    SELECT 
        sample_mean,
        sample_std,
        sample_size,
        sample_std / SQRT(sample_size) as standard_error,
        1.96 * (sample_std / SQRT(sample_size)) as margin_of_error_95,
        sample_mean - 1.96 * (sample_std / SQRT(sample_size)) as lower_bound_95,
        sample_mean + 1.96 * (sample_std / SQRT(sample_size)) as upper_bound_95
    FROM sample_stats
)
SELECT 
    sample_mean,
    standard_error,
    margin_of_error_95,
    lower_bound_95,
    upper_bound_95,
    '95% CI: [' || ROUND(lower_bound_95, 3) || ', ' || ROUND(upper_bound_95, 3) || ']' as confidence_interval
FROM confidence_interval;

Factors Affecting Confidence Intervals

Sample Size: Larger samples → narrower intervals
Variability: More variability → wider intervals
Confidence Level: Higher confidence → wider intervals

Trade-off Example:

Same data, different confidence levels:
90% CI: [4.068, 4.332] (width = 0.264)
95% CI: [4.043, 4.357] (width = 0.314)
99% CI: [3.994, 4.406] (width = 0.412)

Hypothesis Testing Fundamentals

What is Hypothesis Testing?

Definition: A statistical method used to make decisions about population parameters based on sample data.

Purpose: To determine whether there is enough evidence to reject a claim about a population.

Key Components

Null Hypothesis (H₀): The default assumption or claim to be tested.

Represents "no effect" or "no difference"
Assumed true until evidence suggests otherwise

Alternative Hypothesis (H₁ or Hₐ): The claim we're testing against the null.

Represents "effect" or "difference"
What we conclude if we reject H₀

Test Statistic: A value calculated from sample data used to make decisions.

Significance Level (α): Probability of rejecting H₀ when it's actually true.

Common values: 0.05, 0.01, 0.10
Also called Type I error rate

Types of Hypothesis Tests

One-Tailed Tests:

Directional hypotheses
Tests if parameter is > or < a value
More power for detecting specific direction

Two-Tailed Tests:

Non-directional hypotheses
Tests if parameter ≠ a value
More conservative

Example Setup:

Scenario: Testing if new website design increases conversion rate

One-tailed test:
H₀: μ_new ≤ μ_old (new design is not better)
H₁: μ_new > μ_old (new design is better)

Two-tailed test:
H₀: μ_new = μ_old (no difference)
H₁: μ_new ≠ μ_old (there is a difference)

The Hypothesis Testing Process

Step 1: State Hypotheses

H₀: μ = 50 (population mean is 50)
H₁: μ ≠ 50 (population mean is not 50)

Step 2: Choose Significance Level

α = 0.05 (5% significance level)

Step 3: Calculate Test Statistic

Sample: n = 25, x̄ = 52, s = 8
t = (x̄ - μ₀) / (s / √n)
t = (52 - 50) / (8 / √25) = 2 / 1.6 = 1.25

Step 4: Determine Critical Value

Degrees of freedom = n - 1 = 24
Critical t-value (two-tailed, α = 0.05) = ±2.064

Step 5: Make Decision

Test statistic (1.25) < Critical value (2.064)
Fail to reject H₀

Step 6: Interpret Results

Conclusion: Insufficient evidence to conclude population mean differs from 50.

Types of Errors

Type I Error (α): Rejecting H₀ when it's actually true.

False positive
"Claiming an effect when there isn't one"
Probability = significance level

Type II Error (β): Failing to reject H₀ when it's actually false.

False negative
"Missing an effect that exists"
Probability depends on effect size and sample size

Power (1 - β): Probability of correctly rejecting H₀ when it's false.

Ability to detect a true effect
Increases with larger sample sizes

Error Trade-off:

Decreasing α (more stringent) → Increases β (decreases power)
Increasing α (less stringent) → Decreases β (increases power)

p-values and Statistical Significance

What is a p-value?

Definition: The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.

Interpretation: Small p-values provide evidence against the null hypothesis.

Decision Rule:

If p ≤ α: Reject H₀ (statistically significant)
If p > α: Fail to reject H₀ (not statistically significant)

Calculating p-values

For t-test:

-- Calculate p-value for two-tailed t-test
WITH test_stats AS (
    SELECT 
        AVG(conversion_rate) as sample_mean,
        STDDEV(conversion_rate) as sample_std,
        COUNT(*) as sample_size,
        50 as hypothesized_mean
    FROM ab_test_results
    WHERE test_group = 'treatment'
),
t_statistic AS (
    SELECT 
        sample_mean,
        sample_std,
        sample_size,
        (sample_mean - hypothesized_mean) / (sample_std / SQRT(sample_size)) as t_value,
        sample_size - 1 as degrees_of_freedom
    FROM test_stats
)
SELECT 
    t_value,
    degrees_of_freedom,
    -- For two-tailed test, multiply by 2
    2 * (1 - ABS(CUM_DIST_T(t_value, degrees_of_freedom))) as p_value
FROM t_statistic;

Common Misconceptions

Myth 1: "p-value is the probability that H₀ is true"

Reality: p-value assumes H₀ is true

Myth 2: "p-value is the probability of being wrong"

Reality: p-value is about data, not about hypotheses

Myth 3: "Small p-value means large effect"

Reality: p-value depends on effect size AND sample size

Myth 4: "p = 0.05 is magical"

Reality: 0.05 is arbitrary; context matters

Practical Interpretation

p-value Guidelines:

p < 0.01: Very strong evidence against H₀
0.01 ≤ p < 0.05: Strong evidence against H₀
0.05 ≤ p < 0.10: Moderate evidence against H₀
p ≥ 0.10: Little or no evidence against H₀

Example Interpretation:

A/B Test Results:
- Control conversion: 3.2%
- Treatment conversion: 3.8%
- p-value: 0.023
- α = 0.05

Interpretation:
- Statistically significant difference (p < 0.05)
- Evidence suggests treatment improves conversion
- Practical significance: 0.6% absolute improvement

Common Statistical Tests

Choosing the Right Test

Decision Tree:

What type of data?
├── Categorical
│   ├── One sample → Chi-square goodness of fit
│   ├── Two samples → Chi-square test of independence
│   └── More than two → Chi-square test of independence
├── Ordinal
│   ├── Two samples → Mann-Whitney U test
│   ├── Paired samples → Wilcoxon signed-rank test
│   └── More than two → Kruskal-Wallis test
└── Continuous (Interval/Ratio)
    ├── One sample → One-sample t-test
    ├── Two independent samples → Independent t-test
    ├── Two paired samples → Paired t-test
    └── More than two → ANOVA

One-Sample t-test

Purpose: Test if population mean differs from a known value.

When to Use:

One continuous variable
Compare to known or hypothesized value
Sample size < 30 or population σ unknown

Example: Testing if average customer satisfaction differs from target of 4.0

-- One-sample t-test implementation
WITH sample_stats AS (
    SELECT 
        AVG(satisfaction_score) as sample_mean,
        STDDEV(satisfaction_score) as sample_std,
        COUNT(*) as sample_size
    FROM customer_satisfaction
),
t_test AS (
    SELECT 
        sample_mean,
        sample_std,
        sample_size,
        4.0 as hypothesized_mean,
        (sample_mean - 4.0) / (sample_std / SQRT(sample_size)) as t_statistic,
        sample_size - 1 as degrees_of_freedom
    FROM sample_stats
)
SELECT 
    sample_mean,
    t_statistic,
    degrees_of_freedom,
    2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
    CASE 
        WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05 
        THEN 'Significant difference from 4.0'
        ELSE 'No significant difference from 4.0'
    END as conclusion
FROM t_test;

Independent Two-Sample t-test

Purpose: Test if means of two independent groups differ.

When to Use:

Two independent groups
One continuous variable
Compare group means

Example: Comparing test scores between two teaching methods

-- Independent two-sample t-test
WITH group_stats AS (
    SELECT 
        teaching_method,
        AVG(test_score) as mean_score,
        STDDEV(test_score) as std_score,
        COUNT(*) as sample_size,
        VARIANCE(test_score) as variance
    FROM student_scores
    GROUP BY teaching_method
),
pooled_variance AS (
    SELECT 
        g1.teaching_method as group1,
        g2.teaching_method as group2,
        g1.mean_score as mean1,
        g2.mean_score as mean2,
        ((g1.sample_size - 1) * g1.variance + (g2.sample_size - 1) * g2.variance) / 
        (g1.sample_size + g2.sample_size - 2) as pooled_var,
        g1.sample_size as n1,
        g2.sample_size as n2
    FROM group_stats g1, group_stats g2
    WHERE g1.teaching_method = 'traditional' AND g2.teaching_method = 'modern'
),
t_test AS (
    SELECT 
        group1,
        group2,
        mean1,
        mean2,
        pooled_var,
        n1,
        n2,
        (mean1 - mean2) / SQRT(pooled_var * (1/n1 + 1/n2)) as t_statistic,
        n1 + n2 - 2 as degrees_of_freedom
    FROM pooled_variance
)
SELECT 
    group1,
    group2,
    mean1,
    mean2,
    ABS(mean1 - mean2) as mean_difference,
    t_statistic,
    degrees_of_freedom,
    2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
    CASE 
        WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05 
        THEN 'Significant difference between groups'
        ELSE 'No significant difference between groups'
    END as conclusion
FROM t_test;

Paired t-test

Purpose: Test if means of paired observations differ.

When to Use:

Same subjects measured twice
Before/after measurements
Matched pairs

Example: Testing weight loss program effectiveness

-- Paired t-test for before/after measurements
WITH paired_data AS (
    SELECT 
        participant_id,
        before_weight,
        after_weight,
        before_weight - after_weight as difference
    FROM weight_loss_program
),
t_test AS (
    SELECT 
        AVG(difference) as mean_difference,
        STDDEV(difference) as std_difference,
        COUNT(*) as sample_size,
        AVG(difference) / (STDDEV(difference) / SQRT(COUNT(*))) as t_statistic,
        COUNT(*) - 1 as degrees_of_freedom
    FROM paired_data
)
SELECT 
    mean_difference,
    t_statistic,
    degrees_of_freedom,
    2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) as p_value,
    CASE 
        WHEN 2 * (1 - ABS(CUM_DIST_T(t_statistic, degrees_of_freedom))) < 0.05 
        THEN 'Significant weight loss'
        ELSE 'No significant weight loss'
    END as conclusion
FROM t_test;

Chi-Square Test of Independence

Purpose: Test if two categorical variables are independent.

When to Use:

Two categorical variables
Test for association/relationship

Example: Testing relationship between gender and product preference

-- Chi-square test of independence
WITH observed_counts AS (
    SELECT 
        gender,
        product_preference,
        COUNT(*) as observed_count
    FROM customer_survey
    GROUP BY gender, product_preference
),
row_totals AS (
    SELECT 
        gender,
        SUM(observed_count) as row_total
    FROM observed_counts
    GROUP BY gender
),
col_totals AS (
    SELECT 
        product_preference,
        SUM(observed_count) as col_total
    FROM observed_counts
    GROUP BY product_preference
),
grand_total AS (
    SELECT SUM(observed_count) as total_count
    FROM observed_counts
),
expected_counts AS (
    SELECT 
        o.gender,
        o.product_preference,
        o.observed_count,
        (r.row_total * c.col_total / g.total_count) as expected_count,
        POWER(o.observed_count - (r.row_total * c.col_total / g.total_count), 2) / 
        (r.row_total * c.col_total / g.total_count) as chi_square_contribution
    FROM observed_counts o
    JOIN row_totals r ON o.gender = r.gender
    JOIN col_totals c ON o.product_preference = c.product_preference
    CROSS JOIN grand_total g
),
chi_square_test AS (
    SELECT 
        SUM(chi_square_contribution) as chi_square_statistic,
        (SELECT COUNT(DISTINCT gender) FROM observed_counts) - 1 as df_rows,
        (SELECT COUNT(DISTINCT product_preference) FROM observed_counts) - 1 as df_cols
    FROM expected_counts
)
SELECT 
    chi_square_statistic,
    (df_rows * df_cols) as degrees_of_freedom,
    1 - CUM_DIST_CHI_SQUARE(chi_square_statistic, df_rows * df_cols) as p_value,
    CASE 
        WHEN 1 - CUM_DIST_CHI_SQUARE(chi_square_statistic, df_rows * df_cols) < 0.05 
        THEN 'Variables are associated (not independent)'
        ELSE 'Variables are independent'
    END as conclusion
FROM chi_square_test;

ANOVA (Analysis of Variance)

Purpose: Test if means of three or more groups differ.

When to Use:

Three or more independent groups
One continuous variable
Compare multiple group means simultaneously

Example: Comparing sales performance across multiple regions

-- One-way ANOVA
WITH group_stats AS (
    SELECT 
        region,
        AVG(sales_amount) as group_mean,
        COUNT(*) as group_size,
        VARIANCE(sales_amount) as group_variance
    FROM sales_data
    GROUP BY region
),
overall_stats AS (
    SELECT 
        AVG(sales_amount) as grand_mean,
        COUNT(*) as total_n,
        COUNT(DISTINCT region) as num_groups
    FROM sales_data
),
anova_calculations AS (
    SELECT 
        -- Between-group variability
        SUM(group_size * POWER(group_mean - o.grand_mean, 2)) as ss_between,
        -- Within-group variability
        SUM((group_size - 1) * group_variance) as ss_within,
        -- Degrees of freedom
        o.num_groups - 1 as df_between,
        o.total_n - o.num_groups as df_within,
        o.total_n - 1 as df_total
    FROM group_stats g, overall_stats o
),
f_test AS (
    SELECT 
        ss_between / df_between as ms_between,
        ss_within / df_within as ms_within,
        (ss_between / df_between) / (ss_within / df_within) as f_statistic,
        df_between,
        df_within
    FROM anova_calculations
)
SELECT 
    f_statistic,
    df_between,
    df_within,
    1 - CUM_DIST_F(f_statistic, df_between, df_within) as p_value,
    CASE 
        WHEN 1 - CUM_DIST_F(f_statistic, df_between, df_within) < 0.05 
        THEN 'Significant difference among group means'
        ELSE 'No significant difference among group means'
    END as conclusion
FROM f_test;

Practical Applications and Examples

A/B Testing in Business

Scenario: Testing new website design impact on conversion rates

-- A/B test analysis
WITH test_results AS (
    SELECT 
        test_group,
        COUNT(*) as total_visitors,
        SUM(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversions,
        AVG(CASE WHEN converted = 1 THEN 1 ELSE 0 END) as conversion_rate
    FROM ab_test_data
    GROUP BY test_group
),
proportion_test AS (
    SELECT 
        t1.test_group as group1,
        t2.test_group as group2,
        t1.conversion_rate as rate1,
        t2.conversion_rate as rate2,
        t1.total_visitors as n1,
        t2.total_visitors as n2,
        ABS(t1.conversion_rate - t2.conversion_rate) as rate_difference,
        -- Pooled proportion
        (t1.conversions + t2.conversions) / (t1.total_visitors + t2.total_visitors) as pooled_p,
        -- Standard error
        SQRT(pooled_p * (1 - pooled_p) * (1/t1.total_visitors + 1/t2.total_visitors)) as standard_error,
        -- Z-statistic
        ABS(t1.conversion_rate - t2.conversion_rate) / 
        SQRT(pooled_p * (1 - pooled_p) * (1/t1.total_visitors + 1/t2.total_visitors)) as z_statistic
    FROM test_results t1, test_results t2
    WHERE t1.test_group = 'control' AND t2.test_group = 'treatment'
)
SELECT 
    group1,
    group2,
    rate1,
    rate2,
    rate_difference,
    z_statistic,
    2 * (1 - CUM_DIST_NORM(z_statistic)) as p_value,
    CASE 
        WHEN 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05 
        THEN 'Significant difference in conversion rates'
        ELSE 'No significant difference in conversion rates'
    END as conclusion,
    -- Business interpretation
    CASE 
        WHEN rate2 > rate1 AND 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05 
        THEN 'Treatment group performs better - implement new design'
        WHEN rate2 < rate1 AND 2 * (1 - CUM_DIST_NORM(z_statistic)) < 0.05 
        THEN 'Control group performs better - keep current design'
        ELSE 'No clear winner - consider larger sample or different approach'
    END as recommendation
FROM proportion_test;

Quality Control Applications

Scenario: Monitoring manufacturing process quality

-- Process control with confidence intervals
WITH process_stats AS (
    SELECT 
        AVG(measurement) as process_mean,
        STDDEV(measurement) as process_std,
        COUNT(*) as sample_size
    FROM quality_measurements
    WHERE measurement_date >= CURRENT_DATE - INTERVAL '7 days'
),
control_limits AS (
    SELECT 
        process_mean,
        process_std,
        sample_size,
        process_mean - 1.96 * (process_std / SQRT(sample_size)) as lower_control_limit,
        process_mean + 1.96 * (process_std / SQRT(sample_size)) as upper_control_limit
    FROM process_stats
),
current_measurements AS (
    SELECT 
        measurement,
        measurement_time,
        CASE 
            WHEN measurement < (SELECT lower_control_limit FROM control_limits) 
            THEN 'Below Control Limit'
            WHEN measurement > (SELECT upper_control_limit FROM control_limits) 
            THEN 'Above Control Limit'
            ELSE 'In Control'
        END as status
    FROM quality_measurements
    WHERE measurement_date = CURRENT_DATE
)
SELECT 
    COUNT(*) as total_measurements,
    COUNT(CASE WHEN status = 'In Control' THEN 1 END) as in_control,
    COUNT(CASE WHEN status != 'In Control' THEN 1 END) as out_of_control,
    ROUND(COUNT(CASE WHEN status = 'In Control' THEN 1 END) * 100.0 / COUNT(*), 2) as in_control_percentage
FROM current_measurements;

Key Takeaways

Inferential Statistics: Allow us to make predictions about populations from samples
Confidence Intervals: Provide ranges for population parameters with specified confidence
Hypothesis Testing: Systematic method for testing claims about populations
p-values: Measure evidence against null hypothesis, not probability of hypotheses
Test Selection: Choose appropriate test based on data type and research question
Practical Significance: Statistical significance doesn't always mean practical importance
Sample Size Matters: Larger samples increase power and precision of estimates

Next Steps

In the next lesson, we'll explore correlation and causation to understand relationships between variables and avoid common analytical pitfalls.

Data Analyst Fundamentals

01Data Fundamentals and Preparation3 hours

02Statistics, Visualization and Analysis2 hours

03Business Intelligence and Governance2 hours

Inferential Statistics and Hypothesis Testing

Lesson Overview

What You'll Learn:

Key Concepts:

Population vs Sample Statistics

Understanding the Distinction

Why We Use Samples

Sampling Methods

Sampling Distribution

Confidence Intervals

What Are Confidence Intervals?

Calculating Confidence Intervals

Critical Values

SQL Implementation

Factors Affecting Confidence Intervals

Hypothesis Testing Fundamentals

What is Hypothesis Testing?

Key Components

Types of Hypothesis Tests

The Hypothesis Testing Process

Types of Errors

p-values and Statistical Significance

What is a p-value?

Calculating p-values

Common Misconceptions

Practical Interpretation

Common Statistical Tests

Choosing the Right Test

One-Sample t-test

Independent Two-Sample t-test

Paired t-test

Chi-Square Test of Independence

ANOVA (Analysis of Variance)

Practical Applications and Examples

A/B Testing in Business

Quality Control Applications

Key Takeaways

Next Steps

Data Analyst Fundamentals

01Data Fundamentals and Preparation3 hours

02Statistics, Visualization and Analysis2 hours

03Business Intelligence and Governance2 hours

Inferential Statistics and Hypothesis Testing

Lesson Overview

What You'll Learn:

Key Concepts:

Population vs Sample Statistics

Understanding the Distinction

Why We Use Samples

Sampling Methods

Sampling Distribution

Confidence Intervals

What Are Confidence Intervals?

Calculating Confidence Intervals

Critical Values

SQL Implementation

Factors Affecting Confidence Intervals

Hypothesis Testing Fundamentals

What is Hypothesis Testing?

Key Components

Types of Hypothesis Tests

The Hypothesis Testing Process

Types of Errors

p-values and Statistical Significance

What is a p-value?

Calculating p-values

Common Misconceptions

Practical Interpretation

Common Statistical Tests

Choosing the Right Test

One-Sample t-test

Independent Two-Sample t-test

Paired t-test

Chi-Square Test of Independence

ANOVA (Analysis of Variance)

Practical Applications and Examples