This lesson explores the critical distinction between correlation and causation, a fundamental concept in data analysis that helps avoid common analytical pitfalls.
Definition: A statistical relationship between two variables where changes in one variable are associated with changes in another variable.
Key Characteristics:
Important Note: Correlation does NOT imply that one variable causes the other to change.
Definition: A numerical measure between -1 and +1 that quantifies the strength and direction of the linear relationship between two variables.
Formula (Pearson Correlation): $$r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}}$$
Interpretation Scale:
Strength Guidelines:
| Correlation Value | Strength | Interpretation |
|---|---|---|
| 0.9 to 1.0 | Very Strong | Very strong linear relationship |
| 0.7 to 0.9 | Strong | Strong linear relationship |
| 0.5 to 0.7 | Moderate | Moderate linear relationship |
| 0.3 to 0.5 | Weak | Weak linear relationship |
| 0.0 to 0.3 | Very Weak | Little to no linear relationship |
Positive Correlation (r > 0):
Example: Height and weight
As height increases, weight generally increases
r ≈ 0.7 (moderate to strong positive correlation)
Negative Correlation (r < 0):
Example: Hours of sleep and stress levels
As sleep hours increase, stress levels generally decrease
r ≈ -0.6 (moderate negative correlation)
No Correlation (r ≈ 0):
Example: Shoe size and IQ scores
No meaningful relationship between shoe size and intelligence
r ≈ 0 (no correlation)
Manual Calculation Example:
Data: Study hours vs Test scores
Hours (X): [2, 4, 6, 8, 10]
Scores (Y): [65, 70, 75, 85, 90]
Step 1: Calculate means
X̄ = (2+4+6+8+10)/5 = 6
Ȳ = (65+70+75+85+90)/5 = 77
Step 2: Calculate deviations and products
(Xi-X̄)(Yi-Ȳ): (-4)(-12)=48, (-2)(-7)=14, (0)(-2)=0, (2)(8)=16, (4)(13)=52
Sum = 48+14+0+16+52 = 130
Step 3: Calculate squared deviations
Σ(Xi-X̄)² = (-4)²+(-2)²+0²+2²+4² = 16+4+0+4+16 = 40
Σ(Yi-Ȳ)² = (-12)²+(-7)²+(-2)²+8²+13² = 144+49+4+64+169 = 430
Step 4: Calculate correlation
r = 130 / √(40 × 430) = 130 / √17200 = 130 / 131.15 = 0.99
SQL Implementation:
-- Calculate Pearson correlation coefficient
WITH correlation_calculation AS (
SELECT
AVG(study_hours) as mean_hours,
AVG(test_score) as mean_score,
COUNT(*) as n
FROM student_performance
),
correlation_components AS (
SELECT
(study_hours - (SELECT mean_hours FROM correlation_calculation)) *
(test_score - (SELECT mean_score FROM correlation_calculation)) as xy_product,
POWER(study_hours - (SELECT mean_hours FROM correlation_calculation), 2) as x_squared,
POWER(test_score - (SELECT mean_score FROM correlation_calculation), 2) as y_squared
FROM student_performance
)
SELECT
SUM(xy_product) /
SQRT(SUM(x_squared) * SUM(y_squared)) as correlation_coefficient,
CASE
WHEN ABS(SUM(xy_product) / SQRT(SUM(x_squared) * SUM(y_squared))) >= 0.7 THEN 'Strong'
WHEN ABS(SUM(xy_product) / SQRT(SUM(x_squared) * SUM(y_squared))) >= 0.5 THEN 'Moderate'
WHEN ABS(SUM(xy_product) / SQRT(SUM(x_squared) * SUM(y_squared))) >= 0.3 THEN 'Weak'
ELSE 'Very Weak'
END as correlation_strength,
CASE
WHEN SUM(xy_product) / SQRT(SUM(x_squared) * SUM(y_squared)) > 0 THEN 'Positive'
WHEN SUM(xy_product) / SQRT(SUM(x_squared) * SUM(y_squared)) < 0 THEN 'Negative'
ELSE 'No Correlation'
END as correlation_direction
FROM correlation_components;
Correlation vs Covariance:
Correlation vs Determination (r²):
Example:
If r = 0.8, then r² = 0.64
Interpretation: 64% of variance in Y is explained by X
Correlation: Two variables are related or associated
Causation: One variable directly causes changes in another
"Correlation does not imply causation"
Why this matters:
Real-World Example: Ice Cream Sales and Drowning
Observation: Ice cream sales and drowning incidents are highly correlated (r ≈ 0.9)
Fallacy: Ice cream causes drowning
Reality: Both are caused by hot weather (confounding variable)
Bradford Hill Criteria (for establishing causal relationships):
Methods to Establish Causation:
Randomized Controlled Trials (RCTs)
Natural Experiments
Instrumental Variables
Regression Discontinuity
Difference-in-Differences
Example 1: Website Redesign
Correlation: New design correlates with higher conversion
Causation: A/B test proves new design causes higher conversion
Example 2: Training and Performance
Correlation: Training hours correlate with sales performance
Causation: Need to control for experience, motivation, territory quality
Example 3: Social Media and Sales
Correlation: Social media spending correlates with sales
Causation: Seasonal trends may drive both (confounding)
Definition: A variable that influences both the independent variable and dependent variable, creating a spurious association between them.
Characteristics:
Example 1: Coffee and Heart Disease
Observed: Coffee drinkers have higher heart disease rates
Confounding Variable: Smoking
Reality: Smokers tend to drink more coffee and have higher heart disease risk
Example 2: Private Schools and Test Scores
Observed: Private school students score higher on tests
Confounding Variables:
- Socioeconomic status
- Parental education
- Access to resources
Reality: Family background, not school type, drives performance
Example 3: Ice Cream and Crime
Observed: Ice cream sales correlate with crime rates
Confounding Variable: Temperature/Season
Reality: Hot weather increases both ice cream consumption and outdoor activities
Common Sources of Confounding:
Demographic Factors
Environmental Factors
Behavioral Factors
Systematic Factors
Detection Methods:
Stratified Analysis
-- Check if correlation holds across subgroups
SELECT
age_group,
CORR(income, spending) as correlation
FROM customer_data
GROUP BY age_group
ORDER BY age_group;
Multivariate Analysis
-- Control for confounding variables
SELECT
income,
age,
education_level,
spending
FROM customer_data
-- Use regression to control for confounders
Directed Acyclic Graphs (DAGs)
Statistical Methods:
Regression Adjustment
-- Multiple regression to control for confounders
-- This would typically be done in statistical software
-- Concept: spending = β₀ + β₁(income) + β₂(age) + β₃(education) + ε
Propensity Score Matching
-- Match similar individuals across groups
WITH propensity_scores AS (
SELECT
customer_id,
treatment_group,
-- Calculate probability of treatment based on confounders
logistic_regression(treatment_group, age, income, education) as propensity_score
FROM customer_data
),
matched_pairs AS (
SELECT
t1.customer_id as treated_customer,
t2.customer_id as control_customer,
ABS(t1.propensity_score - t2.propensity_score) as distance
FROM propensity_scores t1
JOIN propensity_scores t2 ON t1.treatment_group = 'treated'
AND t2.treatment_group = 'control'
WHERE ABS(t1.propensity_score - t2.propensity_score) < 0.01
)
SELECT COUNT(*) as matched_pairs
FROM matched_pairs;
Stratification
-- Analyze within homogeneous subgroups
SELECT
income_quartile,
AVG(CASE WHEN treatment = 1 THEN outcome END) as treated_outcome,
AVG(CASE WHEN treatment = 0 THEN outcome END) as control_outcome,
AVG(CASE WHEN treatment = 1 THEN outcome END) -
AVG(CASE WHEN treatment = 0 THEN outcome END) as treatment_effect
FROM study_data
GROUP BY income_quartile
ORDER BY income_quartile;
Design Methods:
Randomization
Restriction
Matching
Example 1: Marketing Campaign Effectiveness
Observed: Email campaign correlates with higher sales
Potential Confounders:
- Seasonal shopping patterns
- Competitor promotions
- Economic conditions
- Customer segmentation
Solution: A/B test with random assignment
Example 2: Employee Training and Productivity
Observed: Trained employees are more productive
Potential Confounders:
- Employee motivation
- Manager quality
- Department resources
- Pre-existing skill level
Solution: Pre-post measurement with control group
Example 3: Website Features and User Engagement
Observed: New feature correlates with higher engagement
Potential Confounders:
- User technical proficiency
- Device type
- Time of day
- User tenure
Solution: Cohort analysis with propensity matching
Definition: A graphical representation that displays the relationship between two quantitative variables using points on a Cartesian coordinate system.
Purpose: Visualize patterns, trends, and relationships in data.
Basic Elements:
Enhanced Elements:
1. Positive Linear Relationship
Pattern: Points trend upward from left to right
Interpretation: As X increases, Y tends to increase
Example: Study hours vs Test scores
2. Negative Linear Relationship
Pattern: Points trend downward from left to right
Interpretation: As X increases, Y tends to decrease
Example: Price vs Demand
3. No Relationship
Pattern: Random scatter with no clear direction
Interpretation: No linear relationship between variables
Example: Height vs IQ scores
4. Non-linear Relationships
Pattern: Curved patterns (U-shaped, inverted U, exponential)
Interpretation: Relationship exists but not linear
Example: Experience vs Performance (diminishing returns)
5. Clusters
Pattern: Distinct groups of points
Interpretation: Multiple subpopulations exist
Example: Customer segments by spending and frequency
Visual Assessment of Correlation Strength:
Strong Positive (r ≈ 0.8 to 1.0):
Moderate Positive (r ≈ 0.5 to 0.8):
Weak Positive (r ≈ 0.3 to 0.5):
No Correlation (r ≈ 0):
SQL for Scatter Plot Data:
-- Prepare data for scatter plot analysis
WITH scatter_data AS (
SELECT
advertising_spend as x_variable,
sales_revenue as y_variable,
product_category,
region
FROM marketing_performance
WHERE date BETWEEN '2023-01-01' AND '2023-12-31'
),
summary_stats AS (
SELECT
AVG(x_variable) as mean_x,
AVG(y_variable) as mean_y,
STDDEV(x_variable) as std_x,
STDDEV(y_variable) as std_y,
COUNT(*) as n,
CORR(x_variable, y_variable) as correlation
FROM scatter_data
)
SELECT
s.*,
-- Z-scores for outlier detection
(s.x_variable - ss.mean_x) / ss.std_x as z_score_x,
(s.y_variable - ss.mean_y) / ss.std_y as z_score_y,
CASE
WHEN ABS((s.x_variable - ss.mean_x) / ss.std_x) > 3
OR ABS((s.y_variable - ss.mean_y) / ss.std_y) > 3
THEN 'Outlier'
ELSE 'Normal'
END as outlier_status
FROM scatter_data s, summary_stats ss
ORDER BY s.x_variable;
1. Bubble Charts
-- Add third dimension through point size
SELECT
advertising_spend,
sales_revenue,
market_share as bubble_size,
product_category
FROM marketing_performance
ORDER BY advertising_spend;
2. Color-Coded Scatter Plots
-- Add categorical dimension through color
SELECT
advertising_spend,
sales_revenue,
region as color_group,
product_category
FROM marketing_performance
ORDER BY region, advertising_spend;
3. Time Series Scatter Plots
-- Show temporal progression
SELECT
month_number,
advertising_spend,
sales_revenue,
EXTRACT(MONTH FROM date) as month_label
FROM marketing_performance
ORDER BY date;
1. Overinterpreting Random Patterns
Problem: Seeing patterns in random scatter
Solution: Test statistical significance
2. Ignoring Outliers
Problem: Outliers distort correlation
Solution: Identify and analyze separately
3. Scale Effects
Problem: Different scales create visual bias
Solution: Standardize variables or use appropriate scales
4. Non-linear Relationships
Problem: Linear correlation misses curved relationships
Solution: Use polynomial regression or transformation
5. Heteroscedasticity
Problem: Variance changes with X values
Solution: Transform data or use robust methods
Example 1: Customer Analysis
-- Customer lifetime value vs acquisition cost
WITH customer_metrics AS (
SELECT
customer_id,
acquisition_cost,
lifetime_value,
acquisition_channel,
customer_segment
FROM customer_analytics
),
segment_analysis AS (
SELECT
acquisition_channel,
customer_segment,
CORR(acquisition_cost, lifetime_value) as correlation,
COUNT(*) as customer_count
FROM customer_metrics
GROUP BY acquisition_channel, customer_segment
)
SELECT
acquisition_channel,
customer_segment,
correlation,
customer_count,
CASE
WHEN correlation > 0.7 THEN 'Strong Positive'
WHEN correlation > 0.3 THEN 'Moderate Positive'
WHEN correlation > -0.3 THEN 'Weak/No Correlation'
WHEN correlation > -0.7 THEN 'Moderate Negative'
ELSE 'Strong Negative'
END as relationship_type
FROM segment_analysis
ORDER BY correlation DESC;
Example 2: Quality Control
-- Process parameters vs product quality
WITH quality_data AS (
SELECT
batch_id,
temperature,
pressure,
processing_time,
quality_score
FROM manufacturing_data
),
correlation_analysis AS (
SELECT
'Temperature vs Quality' as relationship,
CORR(temperature, quality_score) as correlation,
COUNT(*) as sample_size
FROM quality_data
UNION ALL
SELECT
'Pressure vs Quality' as relationship,
CORR(pressure, quality_score) as correlation,
COUNT(*) as sample_size
FROM quality_data
UNION ALL
SELECT
'Processing Time vs Quality' as relationship,
CORR(processing_time, quality_score) as correlation,
COUNT(*) as sample_size
FROM quality_data
)
SELECT
relationship,
correlation,
sample_size,
ABS(correlation) as strength_magnitude,
CASE
WHEN ABS(correlation) >= 0.7 THEN 'Strong'
WHEN ABS(correlation) >= 0.5 THEN 'Moderate'
WHEN ABS(correlation) >= 0.3 THEN 'Weak'
ELSE 'Very Weak'
END as strength,
CASE
WHEN correlation > 0 THEN 'Positive'
WHEN correlation < 0 THEN 'Negative'
ELSE 'None'
END as direction
FROM correlation_analysis
ORDER BY strength_magnitude DESC;
In the next lesson, we'll explore regression analysis and time series techniques to model relationships and make predictions from data.