Whylitics

As part of our Market Forecasting services at Whylitics, we empower organizations to anticipate customer demand, revenue fluctuations, and regional growth potential. To showcase our approach while safeguarding client privacy, we leverage an anonymized retail dataset from Kaggle.

This dataset covers weekly sales from 45 stores across various departments and includes three data sources:

Stores: Store type and size
Features: Regional factors like temperature, CPI, unemployment, and promotional markdowns
Sales: Weekly sales per department from 2010–2012

The goal was to predict future weekly sales. However, the challenge included working with partially obscured variables like unknown store types and markdown categories—reflecting the ambiguity often present in real-world datasets.

Exploratory Insights

Seasonal trends were the first to emerge. Sales spiked sharply between mid-November and the New Year, emphasizing the influence of holiday events. We compared holiday weeks with regular weeks by calculating average daily revenue and found that, while holidays bring surges, regular weekdays remain slightly stronger in daily performance.

Markdown analysis revealed a counterintuitive trend: higher weekly sales often occurred alongside lower markdown values. This likely reflects pricing tiers—lower-priced, high-demand goods require minimal discounts to move volume.

Environmental and macroeconomic factors were also examined:

Fuel Price: No strong linear correlation with sales
CPI: Minimal direct influence on sales
Temperature: Peak sales occurred in moderate weather (40–60°F)
Unemployment: Slight negative trend—higher unemployment may reduce consumer spending

We then identified the top 10 stores by total revenue—each labeled as type "A." Although anonymized, this consistent label likely indicates high-traffic, flagship stores.

Modeling Sales Performance

We selected LightGBM for modeling due to its efficiency, scalability, and native support for categorical features. The initial model achieved:

RMSE: 5162.96 — average prediction error
R² Score: 0.9489 — 94.89% of variance explained
Accuracy (within ±10%): 24.53%

To improve this baseline, we introduced two key temporal features:

Sales_last_week: Previous week's sales for each department
Sales_4_weeks_avg: Rolling average over the prior 4 weeks

These additions provided temporal memory and sales context. Feature importance plots then guided retraining with the 10 most influential features.

The enhanced model significantly outperformed the original:

RMSE: 3,233.51 — sharper prediction accuracy
R² Score: 0.9799 — nearly 98% of sales variance explained
Accuracy (within ±10%): 48.87% — nearly half of predictions fell within 10% of true values

Conclusion: Forecasting Sales with Confidence

The results from our enhanced LightGBM model illustrate the real potential of data-driven retail forecasting. With over 97% of variance explained and prediction accuracy improving by 2x after feature optimization, this approach proves not just feasible but powerful.

While this study relied on open data, using your organization’s own sales history, campaign structure, and store metadata can yield even more precise insights. Integrating internal KPIs and business events can further elevate the model’s predictive power.

This example was designed to simulate our forecasting workflow without exposing real customer data. It demonstrates how Whylitics combines EDA, feature engineering, and machine learning to make forecasting actionable and business-aligned.

The takeaway: Predictive insights are not magic—they are measurable, testable, and extremely valuable. We help you unlock them.

Retail Sales Forecasting Using Machine Learning: A Data-Driven Approach

Exploratory Insights

Modeling Sales Performance

Conclusion: Forecasting Sales with Confidence