A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.
Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.
SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.
To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.
The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.
#Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.4 huggingface_hub==0.34.0 -q━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 6.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223.6/223.6 MB 5.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 558.7/558.7 kB 9.9 MB/s eta 0:00:00
Note:
# import libraries for reading and manipulation of data
import os
import numpy as np
import pandas as pd
# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter
# import libraries to split datasets into training and testing sets
from sklearn.model_selection import train_test_split
# import libraries to import ensemble classifiers
from sklearn.ensemble import (
BaggingRegressor,
RandomForestRegressor,
AdaBoostRegressor,
GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
# import library to compute classification metrics
from sklearn.metrics import (
confusion_matrix,
accuracy_score,
precision_score,
recall_score,
f1_score,
mean_squared_error,
mean_absolute_error,
r2_score,
mean_absolute_percentage_error
)
from sklearn.metrics import mean_squared_error as mse
# import libraries to create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline
# import library to tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder
# import library to serialize the model
import joblib
# import library for API requests
import requests
# import library for hugging face space authentication to upload files
from huggingface_hub import login, HfApi
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)
# import library to suppress unnecessary warnings
import warnings
warnings.filterwarnings('ignore')# run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')Mounted at /content/drive
# read the dataset from the Google Colab drive Python Course
products = pd.read_csv('/content/drive/MyDrive/Python Course/SuperKart.csv')# creating a copy of the data
data = products.copy()# pull the first 5 rows of data from dataset
data.head(5)| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FD6114 | 12.66 | Low Sugar | 0.027 | Frozen Foods | 117.08 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 2842.40 |
| 1 | FD7839 | 16.54 | Low Sugar | 0.144 | Dairy | 171.43 | OUT003 | 1999 | Medium | Tier 1 | Departmental Store | 4830.02 |
| 2 | FD5075 | 14.28 | Regular | 0.031 | Canned | 162.08 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4130.16 |
| 3 | FD8233 | 12.10 | Low Sugar | 0.112 | Baking Goods | 186.31 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4132.18 |
| 4 | NC1180 | 9.57 | No Sugar | 0.010 | Health and Hygiene | 123.67 | OUT002 | 1998 | Small | Tier 3 | Food Mart | 2279.36 |
# pull the last 5 rows of the data from the dataset
data.tail(5)| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8758 | NC7546 | 14.80 | No Sugar | 0.016 | Health and Hygiene | 140.53 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 3806.53 |
| 8759 | NC584 | 14.06 | No Sugar | 0.142 | Household | 144.51 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 5020.74 |
| 8760 | NC2471 | 13.48 | No Sugar | 0.017 | Health and Hygiene | 88.58 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 2443.42 |
| 8761 | NC7187 | 13.89 | No Sugar | 0.193 | Household | 168.44 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4171.82 |
| 8762 | FD306 | 14.73 | Low Sugar | 0.177 | Snack Foods | 224.93 | OUT002 | 1998 | Small | Tier 3 | Food Mart | 2186.08 |
# view the number of rows and columns that are present in the data
data.shape(8763, 12)
Observations: The dataset has 8763 rows and 12 columns
# pull the datatypes for each column and entries for each column in the dataset
data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Product_Id 8763 non-null object
1 Product_Weight 8763 non-null float64
2 Product_Sugar_Content 8763 non-null object
3 Product_Allocated_Area 8763 non-null float64
4 Product_Type 8763 non-null object
5 Product_MRP 8763 non-null float64
6 Store_Id 8763 non-null object
7 Store_Establishment_Year 8763 non-null int64
8 Store_Size 8763 non-null object
9 Store_Location_City_Type 8763 non-null object
10 Store_Type 8763 non-null object
11 Product_Store_Sales_Total 8763 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB
Observations:
# check data for any records with no data entered for the column
data.isnull().sum()| 0 | |
|---|---|
| Product_Id | 0 |
| Product_Weight | 0 |
| Product_Sugar_Content | 0 |
| Product_Allocated_Area | 0 |
| Product_Type | 0 |
| Product_MRP | 0 |
| Store_Id | 0 |
| Store_Establishment_Year | 0 |
| Store_Size | 0 |
| Store_Location_City_Type | 0 |
| Store_Type | 0 |
| Product_Store_Sales_Total | 0 |
Observations: There are no null values in this dataset as displayed by this request.
# check for duplicate values in the dataset
data.duplicated().sum()np.int64(0)
Observations: There are no duplicates in the data.
# check the statistical information for each varaiable (column) in the dataset
data.describe().T| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Product_Weight | 8763.0 | 12.653792 | 2.217320 | 4.000 | 11.150 | 12.660 | 14.180 | 22.000 |
| Product_Allocated_Area | 8763.0 | 0.068786 | 0.048204 | 0.004 | 0.031 | 0.056 | 0.096 | 0.298 |
| Product_MRP | 8763.0 | 147.032539 | 30.694110 | 31.000 | 126.160 | 146.740 | 167.585 | 266.000 |
| Store_Establishment_Year | 8763.0 | 2002.032751 | 8.388381 | 1987.000 | 1998.000 | 2009.000 | 2009.000 | 2009.000 |
| Product_Store_Sales_Total | 8763.0 | 3464.003640 | 1065.630494 | 33.000 | 2761.715 | 3452.340 | 4145.165 | 8000.000 |
# setup function to create combined boxplot and histogram for univariate analysis of numerical variables in dataset
# data - dataframe dataset; feature - column in dataset; figsize - figure size; kde - density curve displayed; bins - interval of groups in the histogram
def histogram_boxplot(data, feature, figsize=(20, 10), kde=False, bins=None):
# create the subplots
# nrows - Number of rows in the subplot grid; sharex - x-axis will be shared among all subplots
f2, (ax_box2, ax_hist2) = plt.subplots(nrows=2, sharex=True, gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize,)
# create the boxplot which will display a triangle to indicate the mean value of the variable
sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="aquamarine")
# create the histogram which will display a straight line for the mean of the variable and dotted line for the median of the variable
sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, color="violet")
ax_hist2.axvline(data[feature].mean(), color="black", linestyle="--")
ax_hist2.axvline(data[feature].median(), color="gold", linestyle="-")# setup function to create barplot with the percentage on top for univariate analysis of category variables in dataset
# data - dataframe dataset; feature - column in dataset; perc - display of percentages instead of count (set to False);
# n - display the top n category levels (set to display all levels)
def labeled_barplot(data, feature, perc=False, n=None):
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(data=data, x=feature, palette="pastel",
order=data[feature].value_counts().index[:n],)
# set percentage of each class of category, count of each category,
# set width and height of the plot
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(100 * p.get_height() / total)
else:
label = p.get_height()
x = p.get_x() + p.get_width() / 2
y = p.get_height()
# annotate the percentage
ax.annotate(label, (x, y), ha="center", va="center",
size=12, xytext=(0, 5), textcoords="offset points",)
plt.show()histogram_boxplot(data, "Product_Weight")
Observations: The Product Weight variable distribution looks mildy left skewed with multiple outliers in lower and upper quartiles. The average product weight of the dataset is ~12.7.
histogram_boxplot(data, "Product_Allocated_Area")
Observations: The Product Allocated Area variable distribution is heavily right skewed with all of the outliers in the upper quartiles. The average product allocated area of the dataset is ~0.07.
histogram_boxplot(data, "Product_MRP")
Observations: The Product Weight variable distribution looks mildly left skewed with multiple outliers in lower and upper quartiles. The average product MRP of the dataset is ~147.
histogram_boxplot(data, "Product_Store_Sales_Total")
Observations: The Product Allocated Area variable distribution looks equally distributed with multiple outliers in lower and upper quartiles. The average product store sales total of the dataset is ~3500.
labeled_barplot(data, "Product_Sugar_Content", perc=True)
Observations: The products with Low Sugar content make up the majority of the product population at 57% (almost 5000 products) while Regular (25.7%) and No Sugar (17.3%) come in second and third respectively. 'reg' which makes up 1.2% most likely refers to 'Regular' sugar content so that will need to be adjusted.
labeled_barplot(data, "Product_Type", perc=True)
Observations: Fruits (14.3%) and Snack Foods (13.1%) are the top 2 product types in this dataset. They are also the only two product types that are double digit in percentage as well. Starchy Foods (1.6%), Breakfast (1.2%), and Seafood (0.9%) round out the bottom 3.
labeled_barplot(data, "Store_Id", perc=True)
Observations: The vast majority of the data comes from Store ID OUT004 at 53.4% while the others stores (OUT001 - 18.1%, OUT003 - 15.4%, OUT002 - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 3 stores? Is store 4 a significantly larger store? Where are these stores located? This needs to be explored further.
labeled_barplot(data, "Store_Size", perc=True)
Observations: The vast majority of the data comes from Medium size stores at 68.8% while the others stores (High - 18.1% and Small - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 2 store sizes? Where are these stores located? This needs to be explored further.
labeled_barplot(data, "Store_Location_City_Type", perc=True)
Observations: The vast majority of the data comes from Tier 2 stores at 71.5% while the others stores (Tier 1 - 15.4% and Tier 3 - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 2 store location city types? Where are these stores located? This needs to be explored further.
labeled_barplot(data, "Store_Type", perc=True)
Observations: The vast majority of the data comes from Supermarket Type2 at 53.4% while the others stores (Supermarket Type 2 - 18.1%, Departmental Store - 15.4%, Food Mart - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 3 stores? Is store 4 a significantly larger store? Where are these stores located? This needs to be explored further.
# setup function to create category counts and plot a stacked bar chart for bivariate analysis of variables in dataset
# data - dataframe dataset; predictor - independent variable, target - target variable
def stacked_barplot(data, predictor, target):
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(loc="lower left", frameon=False,)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()# setup function to create scatterplot to see how one variable relates to another and whether the predictor categories show distinct behaviors on the target variable
def scatterplot_distribution(data, predictor, target):
plt.figure(figsize=(12, 6))
sns.scatterplot(data, x=predictor, y=target, hue=predictor)
plt.title(f"Scatterplot for {predictor} Vs {target}")
plt.show()# setup function to create boxplot to show the data's median, spread, range, and outlier points
def boxplot_distribution(data, predictor, target):
plt.figure(figsize=[12, 6])
sns.boxplot(data, x=predictor, y=target, hue=predictor)
plt.xticks(rotation=90)
plt.title(f"Boxplot for {predictor} Vs {target}")
plt.show()# setup function to create grouped barplot to compare multiple related categories side by side within each main category, revealing patterns, differences, and trends in the dataset.
def grouped_barplot(data, group_cols, value_col, x, y, hue=None,
agg_func='sum', figsize=(12, 6), title='Grouped Bar Plot'):
grouped = data.groupby(group_cols)[value_col].agg(agg_func).reset_index()
plt.figure(figsize=figsize)
ax = sns.barplot(data=grouped, x=x, y=y, hue=hue)
ax.set(xlabel=x, ylabel=y, title=title)
ax.ticklabel_format(style='plain', axis='y')
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation=90)
if hue:
plt.legend(title=hue, loc='upper left')
plt.tight_layout()
plt.show()# setup function to create barplot to compare the sum of the revenue to other variables
def revenue_barplot(data, predictor, target):
group_cols = [predictor]
agg_data = data.groupby(group_cols)[target].sum().reset_index()
plt.figure(figsize=(12, 6))
ax = sns.barplot(data=agg_data, x=predictor, y=target)
ax.set(xlabel=predictor, ylabel=f'Total {target}', title=f'Total {target} by {predictor}')
ax.ticklabel_format(style='plain', axis='y')
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()# create a heatmap for the correlation of the numeric features
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Store_Establishment_Year')
plt.figure(figsize=(20, 10))
sns.heatmap(
data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap='coolwarm'
)
plt.show()
Observations: The highest correlated features are Product MRP and Product Sales Total 0.79 and the second highest is Product Weight and Product Store Sales Total at 0.74.
scatterplot_distribution(data, 'Product_Weight', 'Product_Store_Sales_Total')
Observation: This scatterplot shows the relationship between Product Weight and Product Store Sales Total a Positive correlation which means as as product weight increases the total store sales will also rise. This could be the result of bulk purchases or greater value for heavier products. This is consistent with the Heat Map results.
scatterplot_distribution(data, 'Product_Allocated_Area', 'Product_Store_Sales_Total')
Observation: This scatterplot shows the relationship between Product Allocated Area and Product Store Sales Total shows a tight vertical density correlation. There is not a strong relationship between the two variables. In other words, Allocated Area does not strongly influence Sales. This is consistent with the Heat Map results.
scatterplot_distribution(data, 'Product_MRP', 'Product_Store_Sales_Total')
Observation: This scatterplot shows the relationship between Product Maximum Retail Price and Product Store Sales Total a Positive correlation which means as as product Maximum Retail Price increases the total store sales will also rise. This could be the result of a number of factors randing from product quality, bulk items, or consumer behavior. This is consistent with the Heat Map results.
boxplot_distribution(data, 'Product_Sugar_Content', 'Product_Store_Sales_Total')
Observations: There is 'reg' sugar content which will need to be normalized to 'Regular'. The medians for each product sugar content hover around the same Sales Total (3300-3500) so this suggests that sugar content does not meaningfully affect sales.
boxplot_distribution(data, 'Product_Type', 'Product_Store_Sales_Total')
Observations: The medians for each product type hover around the same Sales Total (3300-3500) so this suggests that product does not meaningfully affect sales.
boxplot_distribution(data, 'Store_Id', 'Product_Store_Sales_Total')
Observations: The Store ID OUT003 has the highest median at about 4900 while the lowest median of 1800 is at Store OUT002. So Store ID does meaningfully affect sales. May need to research OUT002 to see sales could be boosted. OUT001 seems to pretty stable. OUT004 has a large number of outliers in the upper and lower quartiles with the greatest density in the upper quartile. Could research OUT004 to develop strategies for other stores or review the outliers.
boxplot_distribution(data, 'Store_Size', 'Product_Store_Sales_Total')
Observations: High store zize has the highest median at about 4000 while the lowest median of 1800 is at Small store size. So Store Size does meaningfully affect sales. May need to research Small store to see sales could be boosted.
boxplot_distribution(data, 'Store_Location_City_Type', 'Product_Store_Sales_Total')
Observations: Tier 1 location city type has the highest median at about 4000 while the lowest median of 1800 is at Tier 3 location city type. So Store Location City Type does meaningfully affect sales. May need to research Tier 3 store to see sales could be boosted.
boxplot_distribution(data, 'Store_Type', 'Product_Store_Sales_Total')
Observations: Departmental Store type has the highest median at about 4000 while the lowest median of 1800 is at Food Mart type. So Store Type does meaningfully affect sales. May need to research Food Mart store to see sales could be boosted.
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Product_Sugar_Content', y='Product_Weight', hue='Product_Sugar_Content')
plt.xticks(rotation=90)
plt.title('Boxplot of Product Weight vs Product Sugar Content')
plt.show()
Observations: There is 'reg' sugar content which will need to be normalized to 'Regular'. The medians for each product sugar content hover around the same Product Weight (12.5) so this suggests that sugar content does not meaningfully affect product weight.
Observations:
plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Product_Type', y='Product_Weight', hue='Product_Type')
plt.xticks(rotation=90)
plt.title('Boxplot of Product Weight vs Product Type')
plt.show()
Observations: The medians for each product type hover around the same Product Weight (12.5) so this suggests that product type does not meaningfully affect product weight.
store_ids = ['OUT001', 'OUT002', 'OUT003', 'OUT004']cols_list = ['Store_Establishment_Year', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
for store in store_ids:
print(f'\n**** Statistics for Store ID: {store} ****')
display(data.loc[data['Store_Id'] == store, cols_list].describe(include='all').T)
**** Statistics for Store ID: OUT001 ****
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Store_Establishment_Year | 1586.0 | NaN | NaN | NaN | 1987.0 | 0.0 | 1987.0 | 1987.0 | 1987.0 | 1987.0 | 1987.0 |
| Store_Size | 1586 | 1 | High | 1586 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Location_City_Type | 1586 | 1 | Tier 2 | 1586 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Type | 1586 | 1 | Supermarket Type1 | 1586 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
**** Statistics for Store ID: OUT002 ****
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Store_Establishment_Year | 1152.0 | NaN | NaN | NaN | 1998.0 | 0.0 | 1998.0 | 1998.0 | 1998.0 | 1998.0 | 1998.0 |
| Store_Size | 1152 | 1 | Small | 1152 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Location_City_Type | 1152 | 1 | Tier 3 | 1152 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Type | 1152 | 1 | Food Mart | 1152 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
**** Statistics for Store ID: OUT003 ****
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Store_Establishment_Year | 1349.0 | NaN | NaN | NaN | 1999.0 | 0.0 | 1999.0 | 1999.0 | 1999.0 | 1999.0 | 1999.0 |
| Store_Size | 1349 | 1 | Medium | 1349 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Location_City_Type | 1349 | 1 | Tier 1 | 1349 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Type | 1349 | 1 | Departmental Store | 1349 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
**** Statistics for Store ID: OUT004 ****
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Store_Establishment_Year | 4676.0 | NaN | NaN | NaN | 2009.0 | 0.0 | 2009.0 | 2009.0 | 2009.0 | 2009.0 | 2009.0 |
| Store_Size | 4676 | 1 | Medium | 4676 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Location_City_Type | 4676 | 1 | Tier 2 | 4676 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Store_Type | 4676 | 1 | Supermarket Type2 | 4676 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Observations:
OUT001 Established 1987; Store Size: High; Store Location City Type: Tier 2; Store Type: Supermarket Type 1
OUT002 Established 1998; Store Size: Small; Store Location City Type: Tier 3; Store Type: Food Mart
OUT003 Established 1999; Store Size: Medium; Store Location City Type: Tier 1; Store Type: Departmental Store
OUT004 Established 2009; Store Size: Medium; Store Location City Type: Tier 2; Store Type: Supermarket Type 2
STORE SIZE
STORE LOCATION CITY TYPE
STORE TYPE
COUNTS
After reviewing the statistics for each store, some of the previous observations start to become a little more clear.
store_revenue = (
data.groupby('Store_Id')['Product_Store_Sales_Total']
.sum()
.loc[store_ids]
)
for store, revenue in store_revenue.items():
print(f'Store ID: {store}, Total Revenue: ${revenue:,.2f}')Store ID: OUT001, Total Revenue: $6,223,113.18
Store ID: OUT002, Total Revenue: $2,030,909.72
Store ID: OUT003, Total Revenue: $6,673,457.57
Store ID: OUT004, Total Revenue: $15,427,583.43
Observations: OUT004 reports the most revenue but it also has a larger number of reported data at a count of 4676. OUT002 has the lowest revenue but also has the lowest counts of 1152.
revenue_barplot(data, 'Store_Id', 'Product_Store_Sales_Total')
Observations: OUT004 reports the most revenue but it also has a larger number of reported data at a count of 4676. OUT002 has the lowest revenue but also has the lowest counts of 1152.
revenue_barplot(data, 'Store_Size', 'Product_Store_Sales_Total')
Observations: Medium reports the most revenue but it also has a larger number of reported data of counts. Small has the lowest revenue but also has the lowest counts.
revenue_barplot(data, 'Store_Location_City_Type', 'Product_Store_Sales_Total')
Observations: Tier 2 reports the most revenue but it also has a larger number of reported data of counts. Tier 3 has the lowest revenue but also has the lowest counts.
revenue_barplot(data, 'Store_Type', 'Product_Store_Sales_Total')
Observations: Supermarket reports the most revenue but it also has a larger number of reported data of counts. Food Mart has the lowest revenue but also has the lowest counts.
revenue_barplot(data, 'Product_Sugar_Content', 'Product_Store_Sales_Total')
Observations: Low Sugar content dominates the other content types with over 17,000,000 in revenue while No Sugar content is lowest in revenue with around 5,000,000.
revenue_product = data.groupby('Product_Type')['Product_Store_Sales_Total'].sum().reset_index()
plt.figure(figsize=(12, 6))
sns.barplot(data=revenue_product, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.xlabel('Product Types')
plt.ylabel('Product Store Sales Total')
plt.title('Revenue by Product Type')
plt.show()
Observations: Fruits and Vegetables and Snack Foods are the product types that generate the most revenue. Breakfast and Seafood products generate the least amount.
grouped_barplot(
data=data,
group_cols=['Store_Id', 'Product_Type'],
value_col='Product_Store_Sales_Total',
x='Store_Id',
y='Product_Store_Sales_Total',
hue='Product_Type',
title='Revenue generated by each Store ID for each Product Type'
)
Observations: Fruits and Vegetables and Snacks are the best revenue generators for each Store ID.
grouped_barplot(
data=data,
group_cols=['Product_Sugar_Content', 'Store_Id'],
value_col='Product_Store_Sales_Total',
x='Product_Sugar_Content',
y='Product_Store_Sales_Total',
hue='Store_Id',
title='Revenue generated by each Store Type for each Product Type'
)
Observations: Low Sugar content generates the most revenue for each store ID while No Sugar contents generates the least. The 'reg' product sugar content needs to be normalized at this point.
# updating the product sugar content to move reg to Regular
data.Product_Sugar_Content.replace(to_replace=['reg'], value='Regular', inplace=True)
data.Product_Sugar_Content.value_counts()| count | |
|---|---|
| Product_Sugar_Content | |
| Low Sugar | 4885 |
| Regular | 2359 |
| No Sugar | 1519 |
Observations: Normalized 'reg' Product Sugar Content to 'Regular'
# create a variable for number of years that store has been in operation
data['Store_Years_In_Operation'] = 2025 - data.Store_Establishment_Year
data.head()| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | Store_Years_In_Operation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FD6114 | 12.66 | Low Sugar | 0.027 | Frozen Foods | 117.08 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 2842.40 | 16 |
| 1 | FD7839 | 16.54 | Low Sugar | 0.144 | Dairy | 171.43 | OUT003 | 1999 | Medium | Tier 1 | Departmental Store | 4830.02 | 26 |
| 2 | FD5075 | 14.28 | Regular | 0.031 | Canned | 162.08 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4130.16 | 38 |
| 3 | FD8233 | 12.10 | Low Sugar | 0.112 | Baking Goods | 186.31 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4132.18 | 38 |
| 4 | NC1180 | 9.57 | No Sugar | 0.010 | Health and Hygiene | 123.67 | OUT002 | 1998 | Small | Tier 3 | Food Mart | 2279.36 | 27 |
# create a variable for product codes to reduce number of product ids for model
data['Product_Code'] = data['Product_Id'].str[:2]
data['Product_Code'].unique()array(['FD', 'NC', 'DR'], dtype=object)
# print product code arrays
codes = ['FD', 'NC', 'DR']
for code in codes:
types = data.loc[data['Product_Code'] == code, 'Product_Type'].unique()
print(f'{code}: {types}')FD: ['Frozen Foods' 'Dairy' 'Canned' 'Baking Goods' 'Snack Foods' 'Meat'
'Fruits and Vegetables' 'Breads' 'Breakfast' 'Starchy Foods' 'Seafood']
NC: ['Health and Hygiene' 'Household' 'Others']
DR: ['Hard Drinks' 'Soft Drinks']
# create a variable for product categories to reduce number of product types for model
food = [
'Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Snack Foods', 'Meat', 'Fruits and Vegetables', 'Breads', 'Breakfast', 'Starchy Foods', 'Seafood'
]
data['Product_Category'] = np.where(data['Product_Type'].isin(food), 'Food', 'Non Food')
data.head()| Product_Id | Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_Type | Product_MRP | Store_Id | Store_Establishment_Year | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | Store_Years_In_Operation | Product_Code | Product_Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | FD6114 | 12.66 | Low Sugar | 0.027 | Frozen Foods | 117.08 | OUT004 | 2009 | Medium | Tier 2 | Supermarket Type2 | 2842.40 | 16 | FD | Food |
| 1 | FD7839 | 16.54 | Low Sugar | 0.144 | Dairy | 171.43 | OUT003 | 1999 | Medium | Tier 1 | Departmental Store | 4830.02 | 26 | FD | Food |
| 2 | FD5075 | 14.28 | Regular | 0.031 | Canned | 162.08 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4130.16 | 38 | FD | Food |
| 3 | FD8233 | 12.10 | Low Sugar | 0.112 | Baking Goods | 186.31 | OUT001 | 1987 | High | Tier 2 | Supermarket Type1 | 4132.18 | 38 | FD | Food |
| 4 | NC1180 | 9.57 | No Sugar | 0.010 | Health and Hygiene | 123.67 | OUT002 | 1998 | Small | Tier 3 | Food Mart | 2279.36 | 27 | NC | Non Food |
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove('Store_Establishment_Year')
numeric_columns.remove('Store_Years_In_Operation')
plt.figure(figsize=(15, 10))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations
# drop some of the categorical features for modeling as new concise variables have been created
data = data.drop(columns=['Product_Id', 'Product_Type', 'Store_Id', 'Store_Establishment_Year'])
data.shape(8763, 11)
data.head()| Product_Weight | Product_Sugar_Content | Product_Allocated_Area | Product_MRP | Store_Size | Store_Location_City_Type | Store_Type | Product_Store_Sales_Total | Store_Years_In_Operation | Product_Code | Product_Category | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 12.66 | Low Sugar | 0.027 | 117.08 | Medium | Tier 2 | Supermarket Type2 | 2842.40 | 16 | FD | Food |
| 1 | 16.54 | Low Sugar | 0.144 | 171.43 | Medium | Tier 1 | Departmental Store | 4830.02 | 26 | FD | Food |
| 2 | 14.28 | Regular | 0.031 | 162.08 | High | Tier 2 | Supermarket Type1 | 4130.16 | 38 | FD | Food |
| 3 | 12.10 | Low Sugar | 0.112 | 186.31 | High | Tier 2 | Supermarket Type1 | 4132.18 | 38 | FD | Food |
| 4 | 9.57 | No Sugar | 0.010 | 123.67 | Small | Tier 3 | Food Mart | 2279.36 | 27 | NC | Non Food |
# define the independent and dependent variables
X = data.drop(['Product_Store_Sales_Total'], axis=1)
y = data['Product_Store_Sales_Total']# splitting data into training and test set
# split data into 2 parts: Train + Temp (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)# print the number of rows of each dataset
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])Number of rows in train data = 7010
Number of rows in test data = 1753
Observations: The train data of 7010 is 80% of the original 8763. And the test data of 1753 is 20% of the original 8763.
# create a categorical feature that stores a list of the categorical column names
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type',
'Product_Code',
'Product_Category']
# create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
(Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mean_absolute_percentage_error(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perfChose the ML models below randomly as part of this project exercise:
# fitting the random forest model
rf_estimator = RandomForestRegressor(random_state=42)
rf_estimator = make_pipeline(preprocessor, rf_estimator)
rf_estimator.fit(X_train, y_train)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type', 'Product_Code',
'Product_Category'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type', 'Product_Code',
'Product_Category'])])),
('randomforestregressor',
RandomForestRegressor(random_state=42))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content', 'Store_Size',
'Store_Location_City_Type', 'Store_Type',
'Product_Code', 'Product_Category'])])['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Code', 'Product_Category']
OneHotEncoder(handle_unknown='ignore')
RandomForestRegressor(random_state=42)
# calculating the training metric
rf_estimator_model_train_perf = model_performance_regression(rf_estimator, X_train,y_train)
print("Training performance \n",rf_estimator_model_train_perf)Training performance
RMSE MAE R-squared Adj. R-squared MAPE
0 604.135272 475.762704 0.67816 0.6777 0.173159
# calculating the testing metric
rf_estimator_model_test_perf = model_performance_regression(rf_estimator, X_test,y_test)
print("Testing performance \n", rf_estimator_model_test_perf)Testing performance
RMSE MAE R-squared Adj. R-squared MAPE
0 597.595427 469.084204 0.687017 0.68522 0.168082
Observations: There is some consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.
###XGBoost Regressor Model
# fitting the XGBoost model
xgb_estimator = XGBRegressor(random_state=42)
xgb_estimator = make_pipeline(preprocessor, xgb_estimator)
xgb_estimator.fit(X_train, y_train)Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type', 'Product_Code',
'Product_Category'])])),
('xgbregressor',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_...
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type', 'Product_Code',
'Product_Category'])])),
('xgbregressor',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_...
feature_types=None, gamma=None, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content', 'Store_Size',
'Store_Location_City_Type', 'Store_Type',
'Product_Code', 'Product_Category'])])['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Code', 'Product_Category']
OneHotEncoder(handle_unknown='ignore')
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=42, ...)# calculating the training metric
xgb_estimator_model_train_perf = model_performance_regression(xgb_estimator, X_train,y_train)
print("Training performance \n",xgb_estimator_model_train_perf)Training performance
RMSE MAE R-squared Adj. R-squared MAPE
0 604.129941 475.48628 0.678166 0.677706 0.173112
# calculating the testing metric
xgb_estimator_model_test_perf = model_performance_regression(xgb_estimator, X_test,y_test)
print("Testing performance \n", xgb_estimator_model_test_perf)Testing performance
RMSE MAE R-squared Adj. R-squared MAPE
0 597.658808 468.827931 0.68695 0.685153 0.168053
Observations: There is some consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.
# initialize the Random Forest regressor model
rf_tuned = RandomForestRegressor(random_state=42)
rf_tuned = make_pipeline(preprocessor, rf_tuned)
# set grid of parameters to choose from
param_grid = {
'randomforestregressor__n_estimators': [80, 90, 100, 110],
'randomforestregressor__max_depth': [4, 6, 8, None],
'randomforestregressor__max_features': ['sqrt', 'log2', None],
'randomforestregressor__min_samples_split': [2, 5, 10],
}
# running the grid search
grid_search_rf = GridSearchCV(rf_tuned, param_grid, scoring=r2_score, cv=3, n_jobs=-1)
grid_search_rf = grid_search_rf.fit(X_train, y_train)
# printing the best combination of parameters
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
rf_tuned = grid_search_rf.best_estimator_Best parameters for Random Forest: {'randomforestregressor__max_depth': 4, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__min_samples_split': 2, 'randomforestregressor__n_estimators': 80}
# calculating the training metric
rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
print("Training performance \n", rf_tuned_model_train_perf)Training performance
RMSE MAE R-squared Adj. R-squared MAPE
0 605.301232 479.987872 0.676917 0.676455 0.17419
# calculating the testing metric
rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
print("Testing performance \n", rf_tuned_model_test_perf)Testing performance
RMSE MAE R-squared Adj. R-squared MAPE
0 597.359834 471.864334 0.687263 0.685468 0.168637
Observations: The R2 went down slightly with tuning, but there is still consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.
# initialize the XGBoost regressor model
xgb_tuned = XGBRegressor(random_state=42)
xgb_tuned = make_pipeline(preprocessor, xgb_tuned)
# set grid of parameters to choose from
param_grid = {
'xgbregressor__n_estimators': [75, 100, 125],
'xgbregressor__subsample': [0.7, 0.8, 0.9],
'xgbregressor__gamma': [0, 1, 3],
'xgbregressor__colsample_bytree':[0.7, 0.8, 0.9],
'xgbregressor__colsample_bylevel':[0.7, 0.8, 0.9]
}
# running the grid search
grid_search_xgb = GridSearchCV(xgb_tuned, param_grid, scoring=r2_score, cv=3, n_jobs=-1)
grid_search_xgb = grid_search_xgb.fit(X_train, y_train)
# printing the best combination of parameters
print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")
xgb_tuned = grid_search_xgb.best_estimator_
Best parameters for XGBoost: {'xgbregressor__colsample_bylevel': 0.7, 'xgbregressor__colsample_bytree': 0.7, 'xgbregressor__gamma': 0, 'xgbregressor__n_estimators': 75, 'xgbregressor__subsample': 0.7}
# calculating the training metric
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
print("Training performance \n", xgb_tuned_model_train_perf)Training performance
RMSE MAE R-squared Adj. R-squared MAPE
0 604.197753 475.240343 0.678094 0.677634 0.172907
# calculating the testing metric
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
print("Testing performance \n", xgb_tuned_model_test_perf)Testing performance
RMSE MAE R-squared Adj. R-squared MAPE
0 597.659713 468.865812 0.686949 0.685152 0.168031
Observations: The R2 went up slightly with tuning, but there is still consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.
# calculating the training model performance comparison
models_train_comp_df = pd.concat(
[
rf_tuned_model_train_perf.T,
xgb_tuned_model_train_perf.T
],
axis=1,
)
models_train_comp_df.columns = ['Random Forest', 'XGBoost']
print('Training performance comparison:')
models_train_comp_dfTraining performance comparison:
| Random Forest | XGBoost | |
|---|---|---|
| RMSE | 605.301232 | 604.197753 |
| MAE | 479.987872 | 475.240343 |
| R-squared | 0.676917 | 0.678094 |
| Adj. R-squared | 0.676455 | 0.677634 |
| MAPE | 0.174190 | 0.172907 |
# calculating the training model performance comparison
models_test_comp_df = pd.concat(
[
rf_tuned_model_test_perf.T,
xgb_tuned_model_test_perf.T
],
axis=1,
)
models_test_comp_df.columns = ['Random Forest', 'XGBoost']
print('Testing performance comparison:')
models_test_comp_dfTesting performance comparison:
| Random Forest | XGBoost | |
|---|---|---|
| RMSE | 597.359834 | 597.659713 |
| MAE | 471.864334 | 468.865812 |
| R-squared | 0.687263 | 0.686949 |
| Adj. R-squared | 0.685468 | 0.685152 |
| MAPE | 0.168637 | 0.168031 |
# calculating the difference in the training model performance comparison
(models_train_comp_df - models_test_comp_df).iloc[2]| R-squared | |
|---|---|
| Random Forest | -0.010347 |
| XGBoost | -0.008856 |
The R2 differences show that both the Random Forest and XGBoost models have very stable, consistent performance with very minor variance in R-squared between train and test models. As a result this would suggest that these models are reliable and generalize well.
Model selection: Will choose the the tuned XGBoost model as the best model as the R2 score is just slightly better.
# create a folder to store the files that will be used for the backend server deployment
import os
os.makedirs("backend_files", exist_ok=True)# define the file path to save (serialize) the trained regression model
model_path = "backend_files/sales_prediction_model_v1_0.joblib"# save the trained regression model and preprocessor using joblibe
joblib.dump(xgb_tuned, model_path)
print(f'Model saved successfully at {model_path}')Model saved successfully at backend_files/sales_prediction_model_v1_0.joblib
saved_model = joblib.load('backend_files/sales_prediction_model_v1_0.joblib')
print('Model loaded successfully.')Model loaded successfully.
saved_modelPipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type', 'Product_Code',
'Product_Category'])])),
('xgbregressor',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_...
feature_types=None, gamma=0, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=42, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content',
'Store_Size',
'Store_Location_City_Type',
'Store_Type', 'Product_Code',
'Product_Category'])])),
('xgbregressor',
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_...
feature_types=None, gamma=0, grow_policy=None,
importance_type=None,
interaction_constraints=None, learning_rate=None,
max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None,
max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=75, n_jobs=None,
num_parallel_tree=None, random_state=42, ...))])ColumnTransformer(transformers=[('pipeline',
Pipeline(steps=[('encoder',
OneHotEncoder(handle_unknown='ignore'))]),
['Product_Sugar_Content', 'Store_Size',
'Store_Location_City_Type', 'Store_Type',
'Product_Code', 'Product_Category'])])['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Code', 'Product_Category']
OneHotEncoder(handle_unknown='ignore')
XGBRegressor(base_score=None, booster=None, callbacks=None,
colsample_bylevel=0.7, colsample_bynode=None, colsample_bytree=0.7,
device=None, early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, feature_types=None, gamma=0, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=75,
n_jobs=None, num_parallel_tree=None, random_state=42, ...)saved_model.predict(X_test)array([3283.6877, 3282.5544, 3995.6821, ..., 3859.6404, 3282.5544,
3995.6821], dtype=float32)
# import the login function from the huggingface_hub library
from huggingface_hub import login, HfApi
# import the create_repo function from the huggingface_hub library
from huggingface_hub import create_repo# access the secret key in Python
from google.colab import userdata
secret_value = userdata.get('BE_Token')# login to HuggingFace with access token
login(token=secret_value)# create the repository for the HuggingFace Space
try:
create_repo("BigGnTX/superkart_backend",
repo_type="space",
space_sdk="docker",
private=False
)
except Exception as e:
# handle any potential errors
if "RepositoryAlreadyExistsError" in str(e):
print("Repository already exists. Respository not created.")
else:
print(f"Error: {e}")# create the app.py file in the backend_files folder
%%writefile backend_files/app.py
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify
# initialize the flask with a name
sales_forecast_api = Flask("Sales Forecast Predictor")
# load the trained sales forecast model
model = joblib.load("sales_prediction_model_v1_0.joblib")
# define the route for the home page
@sales_forecast_api.get('/')
def home():
return "Welcome to the Sales Forecast Prediction API!"
# define the endpoint to predict sales forecast
@sales_forecast_api.post('/v1/predict')
def predict_sales():
# get the JSON data from the request
predict_data = request.get_json()
# extract relevant features from the input data.
sample = {
'Product_Weight': predict_data['Product_Weight'],
'Product_Sugar_Content': predict_data['Product_Sugar_Content'],
'Product_Allocated_Area': predict_data['Product_Allocated_Area'],
'Product_MRP': predict_data['Product_MRP'],
'Store_Size': predict_data['Store_Size'],
'Store_Location_City_Type': predict_data['Store_Location_City_Type'],
'Store_Type': predict_data['Store_Type'],
'Store_Years_In_Operation': predict_data['Store_Years_In_Operation'],
'Product_Code': predict_data['Product_Code'],
'Product_Category': predict_data['Product_Category']
}
# convert the extracted data into a DataFrame
input_data = pd.DataFrame([sample])
# make a sales forecast prediction using the trained model
prediction = model.predict(input_data).tolist()[0]
# return the prediction as a JSON response
return jsonify({'Sales': prediction})
# Run the Flask app in debug mode
if __name__ == '__main__':
sales_forecast_api.run(debug=True)Writing backend_files/app.py
# create the requirements.txt file in the backend_files folder
%%writefile backend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
seaborn==0.13.2
joblib==1.4.2
xgboost==2.1.4
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.32.4
Writing backend_files/requirements.txt
# create the Dockerfile in the backend_files folder
%%writefile backend_files/Dockerfile
FROM python:3.9-slim
# Set the working directory inside the container
WORKDIR /app
# Copy all files from the current directory to the container's working directory
COPY . .
# Install dependencies from the requirements file without using cache to reduce image size
RUN pip install --no-cache-dir --upgrade -r requirements.txt
# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:sales_forecast_api"]Writing backend_files/Dockerfile
# for hugging face space authentication to upload files
from huggingface_hub import HfApi
repo_id = "BigGnTX/superkart_backend"
# initialize the API
api = HfApi()
# upload Streamlit app files stored in the folder called backend_files
api.upload_folder(
folder_path="backend_files",
repo_id=repo_id,
repo_type="space"
){"model_id":"971072dbb57c4b33993b9f5b96cbbce0","version_major":2,"version_minor":0}{"model_id":"7fc82ecaa43246f2a6663fddeb689e0b","version_major":2,"version_minor":0}{"model_id":"da2fba831f3a44d98829a87e915d3133","version_major":2,"version_minor":0}{"type":"string"}Creating Spaces and Adding Secrets in Hugging Face
from Week 1# Create a folder for storing the files needed for frontend UI deployment
os.makedirs('frontend_files', exist_ok=True)#create the app.py file in the frontend_files folder
%%writefile frontend_files/app.py
import requests
import streamlit as st
import pandas as pd
st.title('Sales Forecast Prediction')
# input fields for store and product data
Product_Weight = st.slider('Product Weight', min_value=0.0, max_value=30.0, value=12.66)
Product_Sugar_Content = st.selectbox('Product Sugar Content', ['Low Sugar', 'Regular', 'No Sugar'])
Product_Allocated_Area = st.slider('Product Allocated Area', min_value=0.0, max_value=1.0, value = 0.027)
Product_MRP = st.slider('Product MRP', min_value=0.0, max_value=300.0, value = 117.08)
Store_Size = st.selectbox('Store Size', ['Small', 'Medium', 'High'])
Store_Location_City_Type = st.selectbox('Store Location City Type', ['Tier 1', 'Tier 2', 'Tier 3'])
Store_Type = st.selectbox('Store Type', ['Supermarket Type 1', 'Supermarket Type 2', 'Departmental Store', 'Food Mart'])
Store_Years_In_Operation = st.slider('Store Years In Operation', min_value=1, max_value=50, value = 20)
Product_Code = st.selectbox('Product Code', ['FD', 'NC', 'DR'])
Product_Category = st.selectbox('Product Category', ['Food', 'Non Food'])
# converting user input into a DataFrame
forecast_data = {
'Product_Weight': Product_Weight,
'Product_Sugar_Content': Product_Sugar_Content,
'Product_Allocated_Area': Product_Allocated_Area,
'Product_MRP': Product_MRP,
'Store_Size': Store_Size,
'Store_Location_City_Type': Store_Location_City_Type,
'Store_Type': Store_Type,
'Store_Years_In_Operation': Store_Years_In_Operation,
'Product_Code': Product_Code,
'Product_Category': Product_Category
}
# making prediction when the "Predict" button is clicked
if st.button('Predict', type='primary'):
response = requests.post('https://biggntx-superkart-backend.hf.space/v1/predict', json=forecast_data)
if response.status_code == 200:
result = response.json()
sales_prediction = result['Sales']
st.write(f'The Predicted Product Store Sales Total is: ${sales_prediction:.2f}.')
else:
st.error(f"Error in API request: Status code {response.status_code}\n{response.text}")Writing frontend_files/app.py
#create the requirements.txt file in the frontend_files folder
%%writefile frontend_files/requirements.txt
pandas==2.2.2
requests==2.32.4
streamlit==1.45.0Writing frontend_files/requirements.txt
#create a Dockerfile in the frontend_files folder
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim
# set the working directory inside the container to /app
WORKDIR /app
# copy all files from the current directory on the host to the container's /app directory
COPY . .
# install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt
# define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]
# NOTE: Disable XSRF protection for easier external access in order to make batch predictionsWriting frontend_files/Dockerfile
# setting access key and repo_id of the front end HuggingFace app
# access the front end forecast at https://huggingface.co/spaces/BigGnTX/superkart_forecast
repo_id = "BigGnTX/superkart_forecast"
access_key = "HF_Token"
# login to HuggingFace with access token
login(token=access_key)
# initialize the API
api = HfApi()
# uploading Streamlit app files stored in the folder called frontend_files
api.upload_folder(
folder_path='/content/frontend_files',
repo_id=repo_id,
repo_type='space',
){"type":"string"}