This lesson covers regression analysis and time series forecasting, essential techniques for predictive analytics and understanding relationships in data.
Definition: A statistical method used to model and analyze the relationship between a dependent variable and one or more independent variables.
Purpose:
Dependent Variable (Y):
Independent Variables (X):
Regression Equation: $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon$$
Where:
Based on Variable Types:
Based on Relationship Type:
For Linear Regression:
SQL Example for Checking Assumptions:
-- Check for linearity and outliers
WITH regression_data AS (
SELECT
advertising_spend,
sales_revenue,
-- Calculate residuals (would need actual regression coefficients)
sales_revenue - (intercept + coefficient * advertising_spend) as residual
FROM sales_data
),
assumption_checks AS (
SELECT
-- Check homoscedasticity (constant variance)
CORR(advertising_spend, ABS(residual)) as heteroscedasticity_check,
-- Check normality of residuals (simplified)
AVG(residual) as mean_residual,
STDDEV(residual) as std_residual,
COUNT(*) as sample_size
FROM regression_data
)
SELECT
heteroscedasticity_check,
mean_residual,
std_residual,
sample_size,
CASE
WHEN ABS(heteroscedasticity_check) < 0.3 THEN 'Homoscedasticity likely satisfied'
ELSE 'Potential heteroscedasticity issue'
END as variance_assumption,
CASE
WHEN ABS(mean_residual) < 0.1 * std_residual THEN 'Mean near zero (good)'
ELSE 'Mean not near zero (potential issue)'
END as mean_assumption
FROM assumption_checks;
Definition: Models the relationship between one independent variable and one dependent variable using a straight line.
Equation: $$Y = \beta_0 + \beta_1X + \epsilon$$
Coefficient Interpretation:
Example: Predicting house prices based on square footage
Y = house price
X = square footage
β₀ = $50,000 (base price)
β₁ = $150 (price per square foot)
Equation: Price = 50,000 + 150 × SquareFeet
Interpretation: Each additional square foot adds $150 to house price
SQL Implementation for Simple Regression:
-- Calculate simple linear regression coefficients
WITH regression_stats AS (
SELECT
AVG(advertising_spend) as mean_x,
AVG(sales_revenue) as mean_y,
COUNT(*) as n
FROM sales_data
),
regression_components AS (
SELECT
(advertising_spend - (SELECT mean_x FROM regression_stats)) *
(sales_revenue - (SELECT mean_y FROM regression_stats)) as xy_product,
POWER(advertising_spend - (SELECT mean_x FROM regression_stats), 2) as x_squared
FROM sales_data
),
coefficients AS (
SELECT
-- Calculate slope (β₁)
SUM(xy_product) / SUM(x_squared) as slope,
-- Calculate intercept (β₀)
(SELECT mean_y FROM regression_stats) -
(SUM(xy_product) / SUM(x_squared)) * (SELECT mean_x FROM regression_stats) as intercept
FROM regression_components
)
SELECT
slope as beta_1,
intercept as beta_0,
'Sales Revenue = ' || ROUND(intercept, 2) || ' + ' || ROUND(slope, 2) || ' × Advertising Spend' as regression_equation,
-- R-squared calculation
POWER(CORR(advertising_spend, sales_revenue), 2) as r_squared
FROM coefficients, sales_data
LIMIT 1;
Definition: Models the relationship between multiple independent variables and one dependent variable.
Equation: $$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n + \epsilon$$
Coefficient Interpretation:
Example: Predicting employee salary
Y = salary
X₁ = years of experience
X₂ = education level (years)
X₃ = performance score
Equation: Salary = 30,000 + 2,000×Experience + 3,000×Education + 1,500×Performance
Interpretation:
- Each additional year of experience adds $2,000 to salary
- Each additional year of education adds $3,000 to salary
- Each point in performance adds $1,500 to salary
SQL Example for Multiple Regression Preparation:
-- Prepare data for multiple regression analysis
WITH multiple_regression_data AS (
SELECT
salary as dependent_variable,
years_experience as x1_experience,
education_years as x2_education,
performance_score as x3_performance,
-- Create interaction terms if needed
years_experience * education_years as interaction_exp_edu
FROM employee_data
),
correlation_matrix AS (
-- Check for multicollinearity
SELECT
'Experience vs Education' as pair,
CORR(x1_experience, x2_education) as correlation
FROM multiple_regression_data
UNION ALL
SELECT
'Experience vs Performance' as pair,
CORR(x1_experience, x3_performance) as correlation
FROM multiple_regression_data
UNION ALL
SELECT
'Education vs Performance' as pair,
CORR(x2_education, x3_performance) as correlation
FROM multiple_regression_data
),
descriptive_stats AS (
SELECT
AVG(dependent_variable) as mean_salary,
STDDEV(dependent_variable) as std_salary,
AVG(x1_experience) as mean_experience,
STDDEV(x1_experience) as std_experience,
AVG(x2_education) as mean_education,
STDDEV(x2_education) as std_education,
AVG(x3_performance) as mean_performance,
STDDEV(x3_performance) as std_performance,
COUNT(*) as sample_size
FROM multiple_regression_data
)
SELECT
pair,
correlation,
CASE
WHEN ABS(correlation) > 0.8 THEN 'High multicollinearity concern'
WHEN ABS(correlation) > 0.6 THEN 'Moderate multicollinearity concern'
ELSE 'Low multicollinearity concern'
END as multicollinearity_warning
FROM correlation_matrix
UNION ALL
SELECT
'Sample Statistics' as pair,
NULL as correlation,
'n=' || sample_size || ', Mean Salary=$' || ROUND(mean_salary, 0) as multicollinearity_warning
FROM descriptive_stats;
R-squared (R²):
Adjusted R-squared:
Root Mean Square Error (RMSE):
Mean Absolute Error (MAE):
SQL Example for Model Evaluation:
-- Evaluate regression model performance
WITH model_predictions AS (
SELECT
actual_sales,
predicted_sales,
actual_sales - predicted_sales as residual,
ABS(actual_sales - predicted_sales) as absolute_error,
POWER(actual_sales - predicted_sales, 2) as squared_error
FROM sales_predictions
),
performance_metrics AS (
SELECT
AVG(absolute_error) as mae,
SQRT(AVG(squared_error)) as rmse,
-- R-squared calculation
1 - (SUM(squared_error) / SUM(POWER(actual_sales - AVG(actual_sales), 2))) as r_squared,
COUNT(*) as n,
COUNT(*) - 3 as df_adjusted -- Assuming 3 predictors
FROM model_predictions
)
SELECT
mae as mean_absolute_error,
rmse as root_mean_square_error,
r_squared as r_squared,
-- Adjusted R-squared
1 - ((1 - r_squared) * (n - 1) / df_adjusted) as adjusted_r_squared,
CASE
WHEN r_squared > 0.8 THEN 'Excellent fit'
WHEN r_squared > 0.6 THEN 'Good fit'
WHEN r_squared > 0.4 THEN 'Moderate fit'
ELSE 'Poor fit'
END as model_quality
FROM performance_metrics;
Business Example 1: Sales Forecasting
-- Multiple regression for sales prediction
WITH sales_factors AS (
SELECT
monthly_sales,
advertising_spend,
competitor_price,
seasonality_index,
economic_indicator
FROM monthly_business_data
),
regression_analysis AS (
SELECT
CORR(advertising_spend, monthly_sales) as ad_correlation,
CORR(competitor_price, monthly_sales) as price_correlation,
CORR(seasonality_index, monthly_sales) as season_correlation,
CORR(economic_indicator, monthly_sales) as economic_correlation
FROM sales_factors
)
SELECT
'Advertising Impact' as factor,
ad_correlation as correlation,
CASE
WHEN ad_correlation > 0.5 THEN 'Strong positive driver'
WHEN ad_correlation > 0.3 THEN 'Moderate positive driver'
WHEN ad_correlation > -0.3 THEN 'Weak/no relationship'
ELSE 'Negative relationship'
END as interpretation
FROM regression_analysis
UNION ALL
SELECT
'Price Competition' as factor,
price_correlation as correlation,
CASE
WHEN price_correlation < -0.5 THEN 'Strong negative impact'
WHEN price_correlation < -0.3 THEN 'Moderate negative impact'
WHEN price_correlation < 0.3 THEN 'Weak/no relationship'
ELSE 'Positive relationship (unusual)'
END as interpretation
FROM regression_analysis;
Definition: A statistical method used when the dependent variable is binary (0/1, yes/no, true/false).
Purpose:
Key Difference from Linear Regression:
Sigmoid Function: $$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Logistic Regression Equation: $$P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n)}}$$
Logit Transformation: $$\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n$$
Odds Ratio:
Example: Customer churn prediction
P(Churn=1) = 1 / (1 + e^(-(-2.5 + 0.1×MonthlyCharges - 0.3×Tenure)))
Interpretation:
- Intercept (-2.5): Baseline log-odds when all X = 0
- MonthlyCharges (0.1): Each $1 increase increases odds by e^0.1 = 1.105 (10.5% increase)
- Tenure (-0.3): Each additional month decreases odds by e^-0.3 = 0.741 (25.9% decrease)
Confusion Matrix:
Predicted
Yes No
Actual Yes TP FN
No FP TN
Metrics:
ROC Curve and AUC:
SQL Example for Logistic Regression Evaluation:
-- Evaluate logistic regression model performance
WITH model_predictions AS (
SELECT
actual_outcome,
predicted_probability,
CASE
WHEN predicted_probability > 0.5 THEN 1
ELSE 0
END as predicted_outcome
FROM churn_predictions
),
confusion_matrix AS (
SELECT
SUM(CASE WHEN actual_outcome = 1 AND predicted_outcome = 1 THEN 1 ELSE 0 END) as true_positive,
SUM(CASE WHEN actual_outcome = 0 AND predicted_outcome = 1 THEN 1 ELSE 0 END) as false_positive,
SUM(CASE WHEN actual_outcome = 1 AND predicted_outcome = 0 THEN 1 ELSE 0 END) as false_negative,
SUM(CASE WHEN actual_outcome = 0 AND predicted_outcome = 0 THEN 1 ELSE 0 END) as true_negative,
COUNT(*) as total
FROM model_predictions
),
performance_metrics AS (
SELECT
true_positive,
false_positive,
false_negative,
true_negative,
total,
-- Calculate metrics
(true_positive + true_negative)::DECIMAL / total as accuracy,
true_positive::DECIMAL / (true_positive + false_positive) as precision,
true_positive::DECIMAL / (true_positive + false_negative) as recall,
true_negative::DECIMAL / (true_negative + false_positive) as specificity,
-- F1-score
2 * (true_positive::DECIMAL / (true_positive + false_positive) *
true_positive::DECIMAL / (true_positive + false_negative)) /
(true_positive::DECIMAL / (true_positive + false_positive) +
true_positive::DECIMAL / (true_positive + false_negative)) as f1_score
FROM confusion_matrix
)
SELECT
accuracy,
precision,
recall,
specificity,
f1_score,
CASE
WHEN accuracy > 0.8 THEN 'Excellent accuracy'
WHEN accuracy > 0.7 THEN 'Good accuracy'
WHEN accuracy > 0.6 THEN 'Fair accuracy'
ELSE 'Poor accuracy'
END as accuracy_assessment,
CASE
WHEN f1_score > 0.8 THEN 'Excellent balance'
WHEN f1_score > 0.6 THEN 'Good balance'
WHEN f1_score > 0.4 THEN 'Fair balance'
ELSE 'Poor balance'
END as f1_assessment
FROM performance_metrics;
Business Example 1: Customer Churn Prediction
-- Logistic regression for churn analysis
WITH customer_features AS (
SELECT
customer_id,
churned as target_variable,
monthly_charges,
tenure_months,
contract_type,
payment_method,
customer_service_calls
FROM customer_data
),
feature_analysis AS (
SELECT
contract_type,
AVG(CASE WHEN target_variable = 1 THEN 1.0 ELSE 0.0 END) as churn_rate,
COUNT(*) as customer_count,
AVG(monthly_charges) as avg_charges,
AVG(tenure_months) as avg_tenure
FROM customer_features
GROUP BY contract_type
ORDER BY churn_rate DESC
)
SELECT
contract_type,
ROUND(churn_rate * 100, 2) as churn_percentage,
customer_count,
ROUND(avg_charges, 2) as average_monthly_charges,
ROUND(avg_tenure, 1) as average_tenure_months,
CASE
WHEN churn_rate > 0.3 THEN 'High churn risk segment'
WHEN churn_rate > 0.2 THEN 'Medium churn risk segment'
ELSE 'Low churn risk segment'
END as risk_level
FROM feature_analysis;
Business Example 2: Loan Default Prediction
-- Logistic regression for credit risk
WITH loan_applications AS (
SELECT
loan_id,
defaulted as target,
credit_score,
debt_to_income_ratio,
loan_amount,
employment_years,
home_ownership
FROM loan_data
),
risk_factors AS (
SELECT
CASE
WHEN credit_score >= 750 THEN 'Excellent'
WHEN credit_score >= 700 THEN 'Good'
WHEN credit_score >= 650 THEN 'Fair'
ELSE 'Poor'
END as credit_category,
AVG(CASE WHEN target = 1 THEN 1.0 ELSE 0.0 END) as default_rate,
COUNT(*) as loan_count,
AVG(debt_to_income_ratio) as avg_dti,
AVG(loan_amount) as avg_loan_amount
FROM loan_applications
GROUP BY credit_category
ORDER BY default_rate DESC
)
SELECT
credit_category,
ROUND(default_rate * 100, 2) as default_percentage,
loan_count,
ROUND(avg_dti, 3) as avg_debt_to_income,
ROUND(avg_loan_amount, 0) as avg_loan_amount,
CASE
WHEN default_rate > 0.2 THEN 'Very high risk'
WHEN default_rate > 0.1 THEN 'High risk'
WHEN default_rate > 0.05 THEN 'Moderate risk'
ELSE 'Low risk'
END as risk_assessment
FROM risk_factors;
Definition: A sequence of data points collected at successive time intervals.
Characteristics:
Examples:
1. Trend Component
2. Seasonal Component
3. Cyclical Component
4. Irregular/Random Component
Decomposition Model: $$Y_t = Trend_t + Seasonal_t + Cyclical_t + Irregular_t$$
Visual Analysis:
Statistical Tests:
SQL Example for Time Series Analysis:
-- Analyze time series components
WITH time_series_data AS (
SELECT
date,
sales_amount,
EXTRACT(YEAR FROM date) as year,
EXTRACT(MONTH FROM date) as month,
EXTRACT(QUARTER FROM date) as quarter
FROM daily_sales
WHERE date >= '2022-01-01' AND date <= '2023-12-31'
),
trend_analysis AS (
-- Calculate moving average for trend
SELECT
date,
sales_amount,
AVG(sales_amount) OVER (
ORDER BY date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as moving_avg_30day,
-- Year-over-year comparison
sales_amount - LAG(sales_amount, 365) OVER (ORDER BY date) as yoy_change
FROM time_series_data
),
seasonal_analysis AS (
-- Calculate seasonal patterns
SELECT
month,
AVG(sales_amount) as avg_monthly_sales,
STDDEV(sales_amount) as std_monthly_sales,
COUNT(*) as observations
FROM time_series_data
GROUP BY month
ORDER BY month
),
stationarity_test AS (
-- Simple stationarity check (mean and variance over time)
SELECT
AVG(sales_amount) as overall_mean,
STDDEV(sales_amount) as overall_std,
-- Split data to check stability
AVG(CASE WHEN date < '2023-01-01' THEN sales_amount END) as first_half_mean,
STDDEV(CASE WHEN date < '2023-01-01' THEN sales_amount END) as first_half_std,
AVG(CASE WHEN date >= '2023-01-01' THEN sales_amount END) as second_half_mean,
STDDEV(CASE WHEN date >= '2023-01-01' THEN sales_amount END) as second_half_std
FROM time_series_data
)
SELECT
'Trend Analysis' as analysis_type,
COUNT(*) as data_points,
ROUND(AVG(yoy_change), 2) as avg_yoy_change,
CASE
WHEN AVG(yoy_change) > 0 THEN 'Upward trend'
WHEN AVG(yoy_change) < 0 THEN 'Downward trend'
ELSE 'No clear trend'
END as trend_direction
FROM trend_analysis
UNION ALL
SELECT
'Seasonal Analysis' as analysis_type,
COUNT(*) as months_analyzed,
ROUND(MAX(avg_monthly_sales) - MIN(avg_monthly_sales), 2) as seasonal_range,
CASE
WHEN MAX(avg_monthly_sales) / MIN(avg_monthly_sales) > 1.5 THEN 'Strong seasonality'
WHEN MAX(avg_monthly_sales) / MIN(avg_monthly_sales) > 1.2 THEN 'Moderate seasonality'
ELSE 'Weak seasonality'
END as seasonal_strength
FROM seasonal_analysis
UNION ALL
SELECT
'Stationarity Check' as analysis_type,
NULL as data_points,
ROUND(ABS(first_half_mean - second_half_mean), 2) as mean_difference,
CASE
WHEN ABS(first_half_mean - second_half_mean) / overall_mean < 0.1 THEN 'Stationary (stable mean)'
ELSE 'Non-stationary (unstable mean)'
END as stationarity_assessment
FROM stationarity_test;
1. Naive Methods
2. Moving Average
3. Exponential Smoothing
4. ARIMA Models
5. Prophet
SQL Example for Basic Forecasting:
-- Simple time series forecasting using moving averages
WITH historical_data AS (
SELECT
date,
sales_amount,
-- Calculate moving averages
AVG(sales_amount) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as ma_7day,
AVG(sales_amount) OVER (
ORDER BY date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as ma_30day,
-- Seasonal adjustment (same month last year)
LAG(sales_amount, 365) OVER (ORDER BY date) as same_month_last_year
FROM daily_sales
WHERE date < CURRENT_DATE
),
forecast_preparation AS (
SELECT
date,
sales_amount,
ma_7day,
ma_30day,
same_month_last_year,
-- Simple forecast components
ma_30day as trend_component,
(sales_amount / same_month_last_year) as seasonal_factor
FROM historical_data
WHERE same_month_last_year IS NOT NULL
),
forecast_model AS (
-- Calculate average seasonal factors by month
SELECT
EXTRACT(MONTH FROM date) as month,
AVG(seasonal_factor) as avg_seasonal_factor,
COUNT(*) as observations
FROM forecast_preparation
GROUP BY EXTRACT(MONTH FROM date)
),
future_forecast AS (
-- Generate forecasts for next 30 days
SELECT
CURRENT_DATE + INTERVAL '1 day' * generate_series as forecast_date,
EXTRACT(MONTH FROM CURRENT_DATE + INTERVAL '1 day' * generate_series) as forecast_month,
-- Get latest trend
(SELECT AVG(ma_30day) FROM forecast_preparation
WHERE date >= CURRENT_DATE - INTERVAL '7 days') as base_trend,
-- Apply seasonal adjustment
fm.avg_seasonal_factor
FROM generate_series(1, 30) gs
JOIN forecast_model fm ON EXTRACT(MONTH FROM CURRENT_DATE + INTERVAL '1 day' * generate_series) = fm.month
)
SELECT
forecast_date,
forecast_month,
ROUND(base_trend * avg_seasonal_factor, 2) as forecasted_sales,
ROUND(base_trend, 2) as trend_base,
ROUND(avg_seasonal_factor, 3) as seasonal_multiplier,
CASE
WHEN avg_seasonal_factor > 1.1 THEN 'High season'
WHEN avg_seasonal_factor < 0.9 THEN 'Low season'
ELSE 'Normal season'
END as seasonal_classification
FROM future_forecast
ORDER BY forecast_date;
Common Metrics:
SQL Example for Forecast Evaluation:
-- Evaluate time series forecast accuracy
WITH forecast_comparison AS (
SELECT
date,
actual_sales,
forecasted_sales,
actual_sales - forecasted_sales as forecast_error,
ABS(actual_sales - forecasted_sales) as absolute_error,
ABS((actual_sales - forecasted_sales) / actual_sales) as percentage_error,
-- Naive forecast (last period actual)
LAG(actual_sales, 1) OVER (ORDER BY date) as naive_forecast
FROM sales_forecast_evaluation
WHERE date >= '2023-01-01' AND date < CURRENT_DATE
),
error_metrics AS (
SELECT
AVG(absolute_error) as mae,
SQRT(AVG(POWER(forecast_error, 2))) as rmse,
AVG(percentage_error) * 100 as mape,
-- MASE calculation
AVG(absolute_error) / AVG(ABS(actual_sales - naive_forecast)) as mase,
COUNT(*) as forecast_periods
FROM forecast_comparison
WHERE naive_forecast IS NOT NULL
)
SELECT
mae as mean_absolute_error,
rmse as root_mean_square_error,
mape as mean_absolute_percentage_error,
mase as mean_absolute_scaled_error,
forecast_periods,
CASE
WHEN mape < 5 THEN 'Excellent forecast accuracy'
WHEN mape < 10 THEN 'Good forecast accuracy'
WHEN mape < 20 THEN 'Moderate forecast accuracy'
ELSE 'Poor forecast accuracy'
END as accuracy_assessment,
CASE
WHEN mase < 1 THEN 'Better than naive forecast'
ELSE 'Worse than naive forecast'
END as mase_assessment
FROM error_metrics;
Business Example 1: Sales Forecasting
-- Comprehensive sales forecasting analysis
WITH sales_decomposition AS (
SELECT
date,
sales_amount,
EXTRACT(YEAR FROM date) as year,
EXTRACT(MONTH FROM date) as month,
EXTRACT(DOW FROM date) as day_of_week,
-- Trend component (30-day moving average)
AVG(sales_amount) OVER (
ORDER BY date
ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
) as trend,
-- Year-over-year growth
sales_amount / LAG(sales_amount, 365) OVER (ORDER BY date) - 1 as yoy_growth
FROM daily_sales
),
seasonal_patterns AS (
-- Analyze seasonal patterns by month and day of week
SELECT
month,
day_of_week,
AVG(sales_amount) as avg_sales,
STDDEV(sales_amount) as std_sales,
COUNT(*) as observations
FROM sales_decomposition
GROUP BY month, day_of_week
ORDER BY month, day_of_week
),
forecast_accuracy AS (
-- Backtest forecasting accuracy
SELECT
'30-day MA Forecast' as method,
AVG(ABS(sales_amount - trend)) as mae,
SQRT(AVG(POWER(sales_amount - trend, 2))) as rmse,
AVG(ABS((sales_amount - trend) / sales_amount)) * 100 as mape
FROM sales_decomposition
WHERE date >= '2023-01-01'
)
SELECT
month,
day_of_week,
ROUND(avg_sales, 2) as average_sales,
ROUND(std_sales, 2) as sales_volatility,
observations,
CASE
WHEN day_of_week IN (0, 6) THEN 'Weekend'
ELSE 'Weekday'
END as day_type,
CASE
WHEN month IN (11, 12) THEN 'Holiday Season'
WHEN month IN (6, 7, 8) THEN 'Summer Season'
ELSE 'Regular Season'
END as season_type
FROM seasonal_patterns
WHERE observations >= 4 -- At least 4 weeks of data
ORDER BY month, day_of_week;
Business Example 2: Inventory Demand Forecasting
-- Inventory demand forecasting with seasonality
WITH product_demand AS (
SELECT
product_id,
product_category,
date,
units_sold,
inventory_level,
-- Calculate demand patterns
AVG(units_sold) OVER (
PARTITION BY product_id
ORDER BY date
RANGE BETWEEN INTERVAL '30 days' PRECEDING AND CURRENT ROW
) as avg_demand_30d,
-- Seasonal index (compared to yearly average)
units_sold / AVG(units_sold) OVER (
PARTITION BY product_id, EXTRACT(MONTH FROM date)
) as seasonal_index
FROM product_sales
),
reorder_analysis AS (
-- Calculate reorder points based on forecast
SELECT
product_id,
product_category,
AVG(units_sold) as daily_demand,
STDDEV(units_sold) as demand_volatility,
AVG(avg_demand_30d) as forecasted_demand,
-- Safety stock (2 standard deviations)
2 * STDDEV(units_sold) as safety_stock,
-- Reorder point (lead time demand + safety stock)
AVG(avg_demand_30d) * 7 + 2 * STDDEV(units_sold) as reorder_point,
COUNT(*) as data_points
FROM product_demand
GROUP BY product_id, product_category
)
SELECT
product_category,
COUNT(*) as products_in_category,
ROUND(AVG(daily_demand), 2) as avg_daily_demand,
ROUND(AVG(forecasted_demand), 2) as avg_forecasted_demand,
ROUND(AVG(safety_stock), 2) as avg_safety_stock,
ROUND(AVG(reorder_point), 2) as avg_reorder_point,
CASE
WHEN AVG(demand_volatility) / AVG(daily_demand) > 0.5 THEN 'High volatility'
WHEN AVG(demand_volatility) / AVG(daily_demand) > 0.3 THEN 'Medium volatility'
ELSE 'Low volatility'
END as demand_volatility_level
FROM reorder_analysis
GROUP BY product_category
ORDER BY avg_daily_demand DESC;
In the next lesson, we'll explore data visualization fundamentals to effectively communicate insights from regression and time series analyses.