Applying machine learning to open retail data for accurate, demand-driven inventory forecasting.
This article demonstrates Whylitics' forecasting methodology using a public dataset to protect client confidentiality.
In today’s competitive retail environment, effective inventory and demand planning is critical to profitability. As part of Whylitics’ commitment to delivering Optimal Production & Inventory Planning services, we conducted a comprehensive analysis using publicly available sales and inventory data from a liquor retail chain. The dataset, sourced from Kaggle, provided detailed insights into the store-level operations, covering over a year of sales, purchases, and inventory records. Our objective was to explore inventory inefficiencies, predict future demand, and develop a scalable decision-support tool for store managers.
The dataset included several interconnected tables:
Through initial data preparation, we harmonized units of measurement (e.g., converting bottle sizes like "750mL", "1.75L", "5.0 Oz", "128.0 Gal" or "50mL 5 Pk" into consistent milliliter values), cleaned and converted timestamps, and encoded categorical variables efficiently. The entire dataset spanned sales and inventory movements across dozens of stores, covering a wide range of alcoholic beverages.
One of the most critical goals was identifying operational inefficiencies across stores. We developed a financial metrics table per store, calculating: PurchaseSpend, FreightSpend, ExciseTax, Revenue, TotalOnHand (ending inventory), UnsoldInventoryValue, TotalExpense, Profit, EfficiencyIndex = Profit / UnsoldInventoryValue, ProfitMargin = Profit / Revenue.
Key Findings:
The Efficiency Index served as a balanced metric to evaluate stores. While some stores had minimal leftovers, their revenues were negligible. Others made significant purchases but couldn’t convert them into efficient sales, suggesting a mismatch between purchasing and customer demand.
Our exploration identified both the top-selling and most overstocked products:
Interestingly, certain products like Absolut appeared on both the overstock and high-sales lists. This indicated misallocated distribution — some stores may overstock fast-movers beyond actual demand.
We aggregated transaction data into monthly buckets per product-store combination. The dataset totaled nearly 500,000 monthly records. Key statistics revealed high variance and skewness in sales, price, volume, and excise taxes.
For instance:
To prepare the data for modeling, we performed:
To enable forward-looking inventory optimization, we developed a predictive model to estimate monthly product demand at each store. This model relied on a combination of cleaned and engineered features, where past behavior was leveraged to predict future needs. A key feature introduced was the lagged variable PreviousMonthlySales, which captured the momentum of product-level demand from the prior month—a strong predictor in retail environments.
To enhance model performance and reduce computational load, we limited the dataset to a subset of representative stores—specifically stores 50, 73, 67, 34, 76, and 69. This allowed us to preserve data diversity while streamlining processing. Categorical variables such as store ID, brand, classification, and vendor number were then encoded using one-hot encoding to make them compatible with machine learning algorithms. After completing feature preparation, we trained a Random Forest Regressor on the structured dataset, using an 80/20 train-test split to validate model accuracy.
The results were highly promising. The model achieved an R² score of 0.9967, indicating an excellent fit between predicted and actual sales. The Mean Absolute Error (MAE) was just 0.0019, and the Root Mean Squared Error (RMSE) stood at 0.0295—both suggesting exceptional accuracy in forecasting monthly sales quantities. This level of precision was largely attributed to thoughtful feature engineering, rigorous outlier control, and the strong temporal signal captured through historical sales data.
To operationalize the forecasting model, we developed a user-friendly Decision-support Tool tailored for store managers. This tool allows managers to input key product and store attributes—such as store ID, brand ID, classification, vendor number, and previous month’s sales—and receive a precise recommendation for the optimal purchase quantity for the upcoming month.
By translating complex predictive outputs into actionable guidance, the tool empowers managers to make informed, data-driven procurement decisions aligned with their store’s specific demand patterns. This localized intelligence enhances purchasing accuracy, reduces overstock risk, and supports leaner, more efficient inventory management across the retail network.
This analysis underscores the transformative potential of data science in retail operations—revealing hidden inefficiencies, quantifying financial leakage, and enabling more strategic, evidence-based purchasing decisions. For the liquor retail chain under study, several key insights emerged:
To address these challenges, we advocate for the adoption of predictive analytics tools that transform historical sales patterns into forward-looking recommendations. When applied consistently, such tools allow organizations to:
Importantly, while the model provides robust monthly demand forecasts, it is based on one year of historical data. To avoid distortion caused by seasonal spikes, we excluded January and February—months that showed unusually high sales likely tied to holidays and New Year events. Managers using the tool should remain mindful of these seasonal peaks and proactively adjust their purchase plans for such periods, anticipating higher-than-average demand.
In conclusion, this initiative demonstrates that inventory planning is no longer a function of guesswork or static rules. With machine learning and store-level analytics, retail organizations can shift from reactive stock management to a predictive, agile approach—ensuring better financial outcomes and stronger customer satisfaction.
View the full code behind this analysis in our Google Colab notebook.