Project: Model Deployment: SuperKart by Gabriel Hinojos

Problem Statement

Business Context

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Data Description

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

Installing and Importing the necessary libraries

#Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.4 huggingface_hub==0.34.0 -q
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 6.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223.6/223.6 MB 5.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 558.7/558.7 kB 9.9 MB/s eta 0:00:00

Note:

# import libraries for reading and manipulation of data
import os
import numpy as np
import pandas as pd

# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter

# import libraries to split datasets into training and testing sets
from sklearn.model_selection import train_test_split

# import libraries to import ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor


# import library to compute classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)
from sklearn.metrics import mean_squared_error as mse

# import libraries to create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline

# import library to tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder

# import library to serialize the model
import joblib

# import library for API requests
import requests

# import library for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)

# import library to suppress unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

Loading the dataset

# run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# read the dataset from the Google Colab drive Python Course
products = pd.read_csv('/content/drive/MyDrive/Python Course/SuperKart.csv')
# creating a copy of the data
data = products.copy()

Data Overview

View the first and last 5 rows of the dataset

# pull the first 5 rows of data from dataset
data.head(5)
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36
# pull the last 5 rows of the data from the dataset
data.tail(5)
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total
8758 NC7546 14.80 No Sugar 0.016 Health and Hygiene 140.53 OUT004 2009 Medium Tier 2 Supermarket Type2 3806.53
8759 NC584 14.06 No Sugar 0.142 Household 144.51 OUT004 2009 Medium Tier 2 Supermarket Type2 5020.74
8760 NC2471 13.48 No Sugar 0.017 Health and Hygiene 88.58 OUT001 1987 High Tier 2 Supermarket Type1 2443.42
8761 NC7187 13.89 No Sugar 0.193 Household 168.44 OUT001 1987 High Tier 2 Supermarket Type1 4171.82
8762 FD306 14.73 Low Sugar 0.177 Snack Foods 224.93 OUT002 1998 Small Tier 3 Food Mart 2186.08

Understand the shape of the dataset

# view the number of rows and columns that are present in the data
data.shape
(8763, 12)

Observations: The dataset has 8763 rows and 12 columns

Check the data types of the columns for the dataset

# pull the datatypes for each column and entries for each column in the dataset
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8763 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8763 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8763 non-null   int64  
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB

Observations:

Checking for missing values

# check data for any records with no data entered for the column
data.isnull().sum()
0
Product_Id 0
Product_Weight 0
Product_Sugar_Content 0
Product_Allocated_Area 0
Product_Type 0
Product_MRP 0
Store_Id 0
Store_Establishment_Year 0
Store_Size 0
Store_Location_City_Type 0
Store_Type 0
Product_Store_Sales_Total 0

Observations: There are no null values in this dataset as displayed by this request.

Checking for duplicate values

# check for duplicate values in the dataset
data.duplicated().sum()
np.int64(0)

Observations: There are no duplicates in the data.

Checking the statistical summary

# check the statistical information for each varaiable (column) in the dataset
data.describe().T
count mean std min 25% 50% 75% max
Product_Weight 8763.0 12.653792 2.217320 4.000 11.150 12.660 14.180 22.000
Product_Allocated_Area 8763.0 0.068786 0.048204 0.004 0.031 0.056 0.096 0.298
Product_MRP 8763.0 147.032539 30.694110 31.000 126.160 146.740 167.585 266.000
Store_Establishment_Year 8763.0 2002.032751 8.388381 1987.000 1998.000 2009.000 2009.000 2009.000
Product_Store_Sales_Total 8763.0 3464.003640 1065.630494 33.000 2761.715 3452.340 4145.165 8000.000

Exploratory Data Analysis (EDA)

Univariate Analysis

# setup function to create combined boxplot and histogram for univariate analysis of numerical variables in dataset
# data - dataframe dataset; feature - column in dataset; figsize - figure size; kde - density curve displayed; bins - interval of groups in the histogram
def histogram_boxplot(data, feature, figsize=(20, 10), kde=False, bins=None):
    # create the subplots
    # nrows - Number of rows in the subplot grid; sharex - x-axis will be shared among all subplots
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows=2, sharex=True, gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize,)
    # create the boxplot which will display a triangle to indicate the mean value of the variable
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="aquamarine")
    # create the histogram which will display a straight line for the mean of the variable and dotted line for the median of the variable
    sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, color="violet")
    ax_hist2.axvline(data[feature].mean(), color="black", linestyle="--")
    ax_hist2.axvline(data[feature].median(), color="gold", linestyle="-")
# setup function to create barplot with the percentage on top for univariate analysis of category variables in dataset
# data - dataframe dataset; feature - column in dataset; perc - display of percentages instead of count (set to False);
# n - display the top n category levels (set to display all levels)
def labeled_barplot(data, feature, perc=False, n=None):
    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(data=data, x=feature, palette="pastel",
        order=data[feature].value_counts().index[:n],)
    # set percentage of each class of category, count of each category,
    # set width and height of the plot
    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(100 * p.get_height() / total)
        else:
            label = p.get_height()

        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        # annotate the percentage
        ax.annotate(label, (x, y), ha="center", va="center",
            size=12, xytext=(0, 5), textcoords="offset points",)

    plt.show()

Distribution of numerical variables

Product Weight

histogram_boxplot(data, "Product_Weight")

Observations: The Product Weight variable distribution looks mildy left skewed with multiple outliers in lower and upper quartiles. The average product weight of the dataset is ~12.7.

Product Allocated Area

histogram_boxplot(data, "Product_Allocated_Area")

Observations: The Product Allocated Area variable distribution is heavily right skewed with all of the outliers in the upper quartiles. The average product allocated area of the dataset is ~0.07.

Product MRP

histogram_boxplot(data, "Product_MRP")

Observations: The Product Weight variable distribution looks mildly left skewed with multiple outliers in lower and upper quartiles. The average product MRP of the dataset is ~147.

Product Store Sales Total

histogram_boxplot(data, "Product_Store_Sales_Total")

Observations: The Product Allocated Area variable distribution looks equally distributed with multiple outliers in lower and upper quartiles. The average product store sales total of the dataset is ~3500.

Distribution of categorical variables

Product Sugar Content

labeled_barplot(data, "Product_Sugar_Content", perc=True)

Observations: The products with Low Sugar content make up the majority of the product population at 57% (almost 5000 products) while Regular (25.7%) and No Sugar (17.3%) come in second and third respectively. 'reg' which makes up 1.2% most likely refers to 'Regular' sugar content so that will need to be adjusted.

Product Type

labeled_barplot(data, "Product_Type", perc=True)

Observations: Fruits (14.3%) and Snack Foods (13.1%) are the top 2 product types in this dataset. They are also the only two product types that are double digit in percentage as well. Starchy Foods (1.6%), Breakfast (1.2%), and Seafood (0.9%) round out the bottom 3.

Store ID

labeled_barplot(data, "Store_Id", perc=True)

Observations: The vast majority of the data comes from Store ID OUT004 at 53.4% while the others stores (OUT001 - 18.1%, OUT003 - 15.4%, OUT002 - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 3 stores? Is store 4 a significantly larger store? Where are these stores located? This needs to be explored further.

Store Size

labeled_barplot(data, "Store_Size", perc=True)

Observations: The vast majority of the data comes from Medium size stores at 68.8% while the others stores (High - 18.1% and Small - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 2 store sizes? Where are these stores located? This needs to be explored further.

Store Location City Type

labeled_barplot(data, "Store_Location_City_Type", perc=True)

Observations: The vast majority of the data comes from Tier 2 stores at 71.5% while the others stores (Tier 1 - 15.4% and Tier 3 - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 2 store location city types? Where are these stores located? This needs to be explored further.

Store Type

labeled_barplot(data, "Store_Type", perc=True)

Observations: The vast majority of the data comes from Supermarket Type2 at 53.4% while the others stores (Supermarket Type 2 - 18.1%, Departmental Store - 15.4%, Food Mart - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 3 stores? Is store 4 a significantly larger store? Where are these stores located? This needs to be explored further.

Bivariate Analysis

Setup Functions for Bivariate Analysis

# setup function to create category counts and plot a stacked bar chart for bivariate analysis of variables in dataset
# data - dataframe dataset; predictor - independent variable, target - target variable
def stacked_barplot(data, predictor, target):
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False)
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False)
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(loc="lower left", frameon=False,)
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
# setup function to create scatterplot to see how one variable relates to another and whether the predictor categories show distinct behaviors on the target variable
def scatterplot_distribution(data, predictor, target):
  plt.figure(figsize=(12, 6))
  sns.scatterplot(data, x=predictor, y=target, hue=predictor)
  plt.title(f"Scatterplot for {predictor}  Vs {target}")
  plt.show()
# setup function to create boxplot to show the data's median, spread, range, and outlier points
def boxplot_distribution(data, predictor, target):
  plt.figure(figsize=[12, 6])
  sns.boxplot(data, x=predictor, y=target, hue=predictor)
  plt.xticks(rotation=90)
  plt.title(f"Boxplot for {predictor}  Vs {target}")
  plt.show()
# setup function to create grouped barplot to compare multiple related categories side by side within each main category, revealing patterns, differences, and trends in the dataset.
def grouped_barplot(data, group_cols, value_col, x, y, hue=None,
                     agg_func='sum', figsize=(12, 6), title='Grouped Bar Plot'):

    grouped = data.groupby(group_cols)[value_col].agg(agg_func).reset_index()

    plt.figure(figsize=figsize)
    ax = sns.barplot(data=grouped, x=x, y=y, hue=hue)
    ax.set(xlabel=x, ylabel=y, title=title)
    ax.ticklabel_format(style='plain', axis='y')
    ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))
    plt.xticks(rotation=90)
    if hue:
        plt.legend(title=hue, loc='upper left')
    plt.tight_layout()
    plt.show()
# setup function to create barplot to compare the sum of the revenue to other variables
def revenue_barplot(data, predictor, target):
  group_cols = [predictor]
  agg_data = data.groupby(group_cols)[target].sum().reset_index()
  plt.figure(figsize=(12, 6))
  ax = sns.barplot(data=agg_data, x=predictor, y=target)
  ax.set(xlabel=predictor, ylabel=f'Total {target}', title=f'Total {target} by {predictor}')
  ax.ticklabel_format(style='plain', axis='y')
  ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))
  plt.xticks(rotation=90)
  plt.tight_layout()
  plt.show()

Correlation Check - Heatmap

# create a heatmap for the correlation of the numeric features
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Store_Establishment_Year')
plt.figure(figsize=(20, 10))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap='coolwarm'
)
plt.show()

Observations: The highest correlated features are Product MRP and Product Sales Total 0.79 and the second highest is Product Weight and Product Store Sales Total at 0.74.

Distribution check for relationship/pattern between Product Stores Sales Total and other numeric variables

Product Weight vs Product Store Sales Total

scatterplot_distribution(data, 'Product_Weight', 'Product_Store_Sales_Total')

Observation: This scatterplot shows the relationship between Product Weight and Product Store Sales Total a Positive correlation which means as as product weight increases the total store sales will also rise. This could be the result of bulk purchases or greater value for heavier products. This is consistent with the Heat Map results.

Product Allocated Area vs Product Store Sales Total

scatterplot_distribution(data, 'Product_Allocated_Area', 'Product_Store_Sales_Total')

Observation: This scatterplot shows the relationship between Product Allocated Area and Product Store Sales Total shows a tight vertical density correlation. There is not a strong relationship between the two variables. In other words, Allocated Area does not strongly influence Sales. This is consistent with the Heat Map results.

Product Maximum Retail Price (MRP) vs Product Store Sales Total

scatterplot_distribution(data, 'Product_MRP', 'Product_Store_Sales_Total')

Observation: This scatterplot shows the relationship between Product Maximum Retail Price and Product Store Sales Total a Positive correlation which means as as product Maximum Retail Price increases the total store sales will also rise. This could be the result of a number of factors randing from product quality, bulk items, or consumer behavior. This is consistent with the Heat Map results.

Distribution check for relationship/pattern between Product Stores Sales Total and other catergorical variables

Product Sugar Content vs Product Store Sales Total

boxplot_distribution(data, 'Product_Sugar_Content', 'Product_Store_Sales_Total')

Observations: There is 'reg' sugar content which will need to be normalized to 'Regular'. The medians for each product sugar content hover around the same Sales Total (3300-3500) so this suggests that sugar content does not meaningfully affect sales.

Product Type vs Product Store Sales Total

boxplot_distribution(data, 'Product_Type', 'Product_Store_Sales_Total')

Observations: The medians for each product type hover around the same Sales Total (3300-3500) so this suggests that product does not meaningfully affect sales.

Store ID vs Product Store Sales Total

boxplot_distribution(data, 'Store_Id', 'Product_Store_Sales_Total')

Observations: The Store ID OUT003 has the highest median at about 4900 while the lowest median of 1800 is at Store OUT002. So Store ID does meaningfully affect sales. May need to research OUT002 to see sales could be boosted. OUT001 seems to pretty stable. OUT004 has a large number of outliers in the upper and lower quartiles with the greatest density in the upper quartile. Could research OUT004 to develop strategies for other stores or review the outliers.

Store Size vs Product Store Sales Total

boxplot_distribution(data, 'Store_Size', 'Product_Store_Sales_Total')

Observations: High store zize has the highest median at about 4000 while the lowest median of 1800 is at Small store size. So Store Size does meaningfully affect sales. May need to research Small store to see sales could be boosted.

Store Location City Type vs Product Store Sales Total

boxplot_distribution(data, 'Store_Location_City_Type', 'Product_Store_Sales_Total')

Observations: Tier 1 location city type has the highest median at about 4000 while the lowest median of 1800 is at Tier 3 location city type. So Store Location City Type does meaningfully affect sales. May need to research Tier 3 store to see sales could be boosted.

Store Type vs Product Store Sales Totals

boxplot_distribution(data, 'Store_Type', 'Product_Store_Sales_Total')

Observations: Departmental Store type has the highest median at about 4000 while the lowest median of 1800 is at Food Mart type. So Store Type does meaningfully affect sales. May need to research Food Mart store to see sales could be boosted.

Observations of product weight vs other variables

Product Sugar Content vs Product Weight

plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Product_Sugar_Content', y='Product_Weight', hue='Product_Sugar_Content')
plt.xticks(rotation=90)
plt.title('Boxplot of Product Weight vs Product Sugar Content')
plt.show()

Observations: There is 'reg' sugar content which will need to be normalized to 'Regular'. The medians for each product sugar content hover around the same Product Weight (12.5) so this suggests that sugar content does not meaningfully affect product weight.

Observations:

Product Type vs Product Weight

plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Product_Type', y='Product_Weight', hue='Product_Type')
plt.xticks(rotation=90)
plt.title('Boxplot of Product Weight vs Product Type')
plt.show()

Observations: The medians for each product type hover around the same Product Weight (12.5) so this suggests that product type does not meaningfully affect product weight.

Statistics on each of the stores

store_ids = ['OUT001', 'OUT002', 'OUT003', 'OUT004']
cols_list = ['Store_Establishment_Year', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
for store in store_ids:
    print(f'\n**** Statistics for Store ID: {store} ****')
    display(data.loc[data['Store_Id'] == store, cols_list].describe(include='all').T)

**** Statistics for Store ID: OUT001 ****
count unique top freq mean std min 25% 50% 75% max
Store_Establishment_Year 1586.0 NaN NaN NaN 1987.0 0.0 1987.0 1987.0 1987.0 1987.0 1987.0
Store_Size 1586 1 High 1586 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 1586 1 Tier 2 1586 NaN NaN NaN NaN NaN NaN NaN
Store_Type 1586 1 Supermarket Type1 1586 NaN NaN NaN NaN NaN NaN NaN

**** Statistics for Store ID: OUT002 ****
count unique top freq mean std min 25% 50% 75% max
Store_Establishment_Year 1152.0 NaN NaN NaN 1998.0 0.0 1998.0 1998.0 1998.0 1998.0 1998.0
Store_Size 1152 1 Small 1152 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 1152 1 Tier 3 1152 NaN NaN NaN NaN NaN NaN NaN
Store_Type 1152 1 Food Mart 1152 NaN NaN NaN NaN NaN NaN NaN

**** Statistics for Store ID: OUT003 ****
count unique top freq mean std min 25% 50% 75% max
Store_Establishment_Year 1349.0 NaN NaN NaN 1999.0 0.0 1999.0 1999.0 1999.0 1999.0 1999.0
Store_Size 1349 1 Medium 1349 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 1349 1 Tier 1 1349 NaN NaN NaN NaN NaN NaN NaN
Store_Type 1349 1 Departmental Store 1349 NaN NaN NaN NaN NaN NaN NaN

**** Statistics for Store ID: OUT004 ****
count unique top freq mean std min 25% 50% 75% max
Store_Establishment_Year 4676.0 NaN NaN NaN 2009.0 0.0 2009.0 2009.0 2009.0 2009.0 2009.0
Store_Size 4676 1 Medium 4676 NaN NaN NaN NaN NaN NaN NaN
Store_Location_City_Type 4676 1 Tier 2 4676 NaN NaN NaN NaN NaN NaN NaN
Store_Type 4676 1 Supermarket Type2 4676 NaN NaN NaN NaN NaN NaN NaN

Observations:

OUT001 Established 1987; Store Size: High; Store Location City Type: Tier 2; Store Type: Supermarket Type 1

OUT002 Established 1998; Store Size: Small; Store Location City Type: Tier 3; Store Type: Food Mart

OUT003 Established 1999; Store Size: Medium; Store Location City Type: Tier 1; Store Type: Departmental Store

OUT004 Established 2009; Store Size: Medium; Store Location City Type: Tier 2; Store Type: Supermarket Type 2

STORE SIZE

STORE LOCATION CITY TYPE

STORE TYPE

COUNTS

After reviewing the statistics for each store, some of the previous observations start to become a little more clear.

Observations on total revenue for each Store ID

store_revenue = (
    data.groupby('Store_Id')['Product_Store_Sales_Total']
    .sum()
    .loc[store_ids]
)

for store, revenue in store_revenue.items():
    print(f'Store ID: {store}, Total Revenue: ${revenue:,.2f}')
Store ID: OUT001, Total Revenue: $6,223,113.18
Store ID: OUT002, Total Revenue: $2,030,909.72
Store ID: OUT003, Total Revenue: $6,673,457.57
Store ID: OUT004, Total Revenue: $15,427,583.43

Observations: OUT004 reports the most revenue but it also has a larger number of reported data at a count of 4676. OUT002 has the lowest revenue but also has the lowest counts of 1152.

Observations on product store sales total (revenue) generated vs other variables

Store ID vs Product Store Sales Total

revenue_barplot(data, 'Store_Id', 'Product_Store_Sales_Total')

Observations: OUT004 reports the most revenue but it also has a larger number of reported data at a count of 4676. OUT002 has the lowest revenue but also has the lowest counts of 1152.

Store Size vs Product Store Sales Total

revenue_barplot(data, 'Store_Size', 'Product_Store_Sales_Total')

Observations: Medium reports the most revenue but it also has a larger number of reported data of counts. Small has the lowest revenue but also has the lowest counts.

Store Location City Type vs Product Store Sales Total

revenue_barplot(data, 'Store_Location_City_Type', 'Product_Store_Sales_Total')

Observations: Tier 2 reports the most revenue but it also has a larger number of reported data of counts. Tier 3 has the lowest revenue but also has the lowest counts.

Store Type vs Product Store Sales Total

revenue_barplot(data, 'Store_Type', 'Product_Store_Sales_Total')

Observations: Supermarket reports the most revenue but it also has a larger number of reported data of counts. Food Mart has the lowest revenue but also has the lowest counts.

Product Sugar Content vs Product Store Sales Total

revenue_barplot(data, 'Product_Sugar_Content', 'Product_Store_Sales_Total')

Observations: Low Sugar content dominates the other content types with over 17,000,000 in revenue while No Sugar content is lowest in revenue with around 5,000,000.

Observations on product store sales total (revenue) generated vs other variables per product type

Revenue by Product Type

revenue_product = data.groupby('Product_Type')['Product_Store_Sales_Total'].sum().reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(data=revenue_product, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.xlabel('Product Types')
plt.ylabel('Product Store Sales Total')
plt.title('Revenue by Product Type')
plt.show()

Observations: Fruits and Vegetables and Snack Foods are the product types that generate the most revenue. Breakfast and Seafood products generate the least amount.

Store ID vs Product Store Sales Total (for each product type)

grouped_barplot(
   data=data,
    group_cols=['Store_Id', 'Product_Type'],
    value_col='Product_Store_Sales_Total',
    x='Store_Id',
    y='Product_Store_Sales_Total',
    hue='Product_Type',
    title='Revenue generated by each Store ID for each Product Type'
)

Observations: Fruits and Vegetables and Snacks are the best revenue generators for each Store ID.

Product Type vs Product Store Sales Total (for each Store ID)

grouped_barplot(
    data=data,
    group_cols=['Product_Sugar_Content', 'Store_Id'],
    value_col='Product_Store_Sales_Total',
    x='Product_Sugar_Content',
    y='Product_Store_Sales_Total',
    hue='Store_Id',
    title='Revenue generated by each Store Type for each Product Type'
)

Observations: Low Sugar content generates the most revenue for each store ID while No Sugar contents generates the least. The 'reg' product sugar content needs to be normalized at this point.

Data Preprocessing

Feature Engineering

# updating the product sugar content to move reg to Regular
data.Product_Sugar_Content.replace(to_replace=['reg'], value='Regular', inplace=True)
data.Product_Sugar_Content.value_counts()
count
Product_Sugar_Content
Low Sugar 4885
Regular 2359
No Sugar 1519

Observations: Normalized 'reg' Product Sugar Content to 'Regular'

# create a variable for number of years that store has been in operation
data['Store_Years_In_Operation'] = 2025 - data.Store_Establishment_Year
data.head()
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Store_Years_In_Operation
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40 16
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02 26
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16 38
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18 38
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36 27
# create a variable for product codes to reduce number of product ids for model
data['Product_Code'] = data['Product_Id'].str[:2]
data['Product_Code'].unique()
array(['FD', 'NC', 'DR'], dtype=object)
# print product code arrays
codes = ['FD', 'NC', 'DR']
for code in codes:
    types = data.loc[data['Product_Code'] == code, 'Product_Type'].unique()
    print(f'{code}: {types}')
FD: ['Frozen Foods' 'Dairy' 'Canned' 'Baking Goods' 'Snack Foods' 'Meat'
 'Fruits and Vegetables' 'Breads' 'Breakfast' 'Starchy Foods' 'Seafood']
NC: ['Health and Hygiene' 'Household' 'Others']
DR: ['Hard Drinks' 'Soft Drinks']
# create a variable for product categories to reduce number of product types for model
food = [
    'Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Snack Foods', 'Meat', 'Fruits and Vegetables', 'Breads', 'Breakfast', 'Starchy Foods', 'Seafood'
]
data['Product_Category'] = np.where(data['Product_Type'].isin(food), 'Food', 'Non Food')
data.head()
Product_Id Product_Weight Product_Sugar_Content Product_Allocated_Area Product_Type Product_MRP Store_Id Store_Establishment_Year Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Store_Years_In_Operation Product_Code Product_Category
0 FD6114 12.66 Low Sugar 0.027 Frozen Foods 117.08 OUT004 2009 Medium Tier 2 Supermarket Type2 2842.40 16 FD Food
1 FD7839 16.54 Low Sugar 0.144 Dairy 171.43 OUT003 1999 Medium Tier 1 Departmental Store 4830.02 26 FD Food
2 FD5075 14.28 Regular 0.031 Canned 162.08 OUT001 1987 High Tier 2 Supermarket Type1 4130.16 38 FD Food
3 FD8233 12.10 Low Sugar 0.112 Baking Goods 186.31 OUT001 1987 High Tier 2 Supermarket Type1 4132.18 38 FD Food
4 NC1180 9.57 No Sugar 0.010 Health and Hygiene 123.67 OUT002 1998 Small Tier 3 Food Mart 2279.36 27 NC Non Food

Outlier Check

# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove('Store_Establishment_Year')
numeric_columns.remove('Store_Years_In_Operation')

plt.figure(figsize=(15, 10))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observations

Data Preparation for modeling

# drop some of the categorical features for modeling as new concise variables have been created
data = data.drop(columns=['Product_Id', 'Product_Type', 'Store_Id', 'Store_Establishment_Year'])
data.shape
(8763, 11)
data.head()
Product_Weight Product_Sugar_Content Product_Allocated_Area Product_MRP Store_Size Store_Location_City_Type Store_Type Product_Store_Sales_Total Store_Years_In_Operation Product_Code Product_Category
0 12.66 Low Sugar 0.027 117.08 Medium Tier 2 Supermarket Type2 2842.40 16 FD Food
1 16.54 Low Sugar 0.144 171.43 Medium Tier 1 Departmental Store 4830.02 26 FD Food
2 14.28 Regular 0.031 162.08 High Tier 2 Supermarket Type1 4130.16 38 FD Food
3 12.10 Low Sugar 0.112 186.31 High Tier 2 Supermarket Type1 4132.18 38 FD Food
4 9.57 No Sugar 0.010 123.67 Small Tier 3 Food Mart 2279.36 27 NC Non Food
# define the independent and dependent variables
X = data.drop(['Product_Store_Sales_Total'], axis=1)
y = data['Product_Store_Sales_Total']
# splitting data into training and test set

# split data into 2 parts: Train + Temp (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
# print the number of rows of each dataset
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 7010
Number of rows in test data = 1753

Observations: The train data of 7010 is 80% of the original 8763. And the test data of 1753 is 20% of the original 8763.

Data Preprocessing Pipeline

# create a categorical feature that stores a list of the categorical column names
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features
['Product_Sugar_Content',
 'Store_Size',
 'Store_Location_City_Type',
 'Store_Type',
 'Product_Code',
 'Product_Category']
# create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
    (Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)

Model Building

Define functions for Model Evaluation

# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

Chose the ML models below randomly as part of this project exercise:

  1. Random Forest
  2. XGBoost

Random Forest Model

# fitting the random forest model
rf_estimator = RandomForestRegressor(random_state=42)
rf_estimator = make_pipeline(preprocessor, rf_estimator)
rf_estimator.fit(X_train, y_train)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# calculating the training metric
rf_estimator_model_train_perf = model_performance_regression(rf_estimator, X_train,y_train)
print("Training performance \n",rf_estimator_model_train_perf)
Training performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  604.135272  475.762704    0.67816          0.6777  0.173159
# calculating the testing metric
rf_estimator_model_test_perf = model_performance_regression(rf_estimator, X_test,y_test)
print("Testing performance \n", rf_estimator_model_test_perf)
Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.595427  469.084204   0.687017         0.68522  0.168082

Observations: There is some consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

###XGBoost Regressor Model

# fitting the XGBoost model
xgb_estimator = XGBRegressor(random_state=42)
xgb_estimator = make_pipeline(preprocessor, xgb_estimator)
xgb_estimator.fit(X_train, y_train)
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# calculating the training metric
xgb_estimator_model_train_perf = model_performance_regression(xgb_estimator, X_train,y_train)
print("Training performance \n",xgb_estimator_model_train_perf)
Training performance 
          RMSE        MAE  R-squared  Adj. R-squared      MAPE
0  604.129941  475.48628   0.678166        0.677706  0.173112
# calculating the testing metric
xgb_estimator_model_test_perf = model_performance_regression(xgb_estimator, X_test,y_test)
print("Testing performance \n", xgb_estimator_model_test_perf)
Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.658808  468.827931    0.68695        0.685153  0.168053

Observations: There is some consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

Model Performance Improvement - Hyperparameter Tuning

Random Forest Model Hyperparameter Tuning

# initialize the Random Forest regressor model
rf_tuned = RandomForestRegressor(random_state=42)
rf_tuned = make_pipeline(preprocessor, rf_tuned)

# set grid of parameters to choose from
param_grid = {
    'randomforestregressor__n_estimators': [80, 90, 100, 110],
    'randomforestregressor__max_depth': [4, 6, 8, None],
    'randomforestregressor__max_features': ['sqrt', 'log2', None],
    'randomforestregressor__min_samples_split': [2, 5, 10],
}

# running the grid search
grid_search_rf = GridSearchCV(rf_tuned, param_grid, scoring=r2_score, cv=3, n_jobs=-1)
grid_search_rf = grid_search_rf.fit(X_train, y_train)

# printing the best combination of parameters
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
rf_tuned = grid_search_rf.best_estimator_
Best parameters for Random Forest: {'randomforestregressor__max_depth': 4, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__min_samples_split': 2, 'randomforestregressor__n_estimators': 80}
# calculating the training metric
rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
print("Training performance \n", rf_tuned_model_train_perf)
Training performance 
          RMSE         MAE  R-squared  Adj. R-squared     MAPE
0  605.301232  479.987872   0.676917        0.676455  0.17419
# calculating the testing metric
rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
print("Testing performance \n", rf_tuned_model_test_perf)
Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.359834  471.864334   0.687263        0.685468  0.168637

Observations: The R2 went down slightly with tuning, but there is still consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

XGBoost Model Hyperparameter Tuning

# initialize the XGBoost regressor model
xgb_tuned = XGBRegressor(random_state=42)
xgb_tuned = make_pipeline(preprocessor, xgb_tuned)

# set grid of parameters to choose from
param_grid = {
    'xgbregressor__n_estimators': [75, 100, 125],
    'xgbregressor__subsample': [0.7, 0.8, 0.9],
    'xgbregressor__gamma': [0, 1, 3],
    'xgbregressor__colsample_bytree':[0.7, 0.8, 0.9],
    'xgbregressor__colsample_bylevel':[0.7, 0.8, 0.9]
}

# running the grid search
grid_search_xgb = GridSearchCV(xgb_tuned, param_grid, scoring=r2_score, cv=3, n_jobs=-1)
grid_search_xgb = grid_search_xgb.fit(X_train, y_train)

# printing the best combination of parameters
print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")
xgb_tuned = grid_search_xgb.best_estimator_
Best parameters for XGBoost: {'xgbregressor__colsample_bylevel': 0.7, 'xgbregressor__colsample_bytree': 0.7, 'xgbregressor__gamma': 0, 'xgbregressor__n_estimators': 75, 'xgbregressor__subsample': 0.7}
# calculating the training metric
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
print("Training performance \n", xgb_tuned_model_train_perf)
Training performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  604.197753  475.240343   0.678094        0.677634  0.172907
# calculating the testing metric
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
print("Testing performance \n", xgb_tuned_model_test_perf)
Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.659713  468.865812   0.686949        0.685152  0.168031

Observations: The R2 went up slightly with tuning, but there is still consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

Model Performance Comparison, Final Model Selection, and Serialization

# calculating the training model performance comparison
models_train_comp_df = pd.concat(
    [
        rf_tuned_model_train_perf.T,
        xgb_tuned_model_train_perf.T
    ],
    axis=1,
)
models_train_comp_df.columns = ['Random Forest', 'XGBoost']
print('Training performance comparison:')
models_train_comp_df
Training performance comparison:
Random Forest XGBoost
RMSE 605.301232 604.197753
MAE 479.987872 475.240343
R-squared 0.676917 0.678094
Adj. R-squared 0.676455 0.677634
MAPE 0.174190 0.172907
# calculating the training model performance comparison
models_test_comp_df = pd.concat(
    [
        rf_tuned_model_test_perf.T,
        xgb_tuned_model_test_perf.T
    ],
    axis=1,
)
models_test_comp_df.columns = ['Random Forest', 'XGBoost']
print('Testing performance comparison:')
models_test_comp_df
Testing performance comparison:
Random Forest XGBoost
RMSE 597.359834 597.659713
MAE 471.864334 468.865812
R-squared 0.687263 0.686949
Adj. R-squared 0.685468 0.685152
MAPE 0.168637 0.168031
# calculating the difference in the training model performance comparison
(models_train_comp_df - models_test_comp_df).iloc[2]
R-squared
Random Forest -0.010347
XGBoost -0.008856

The R2 differences show that both the Random Forest and XGBoost models have very stable, consistent performance with very minor variance in R-squared between train and test models. As a result this would suggest that these models are reliable and generalize well.

Model selection: Will choose the the tuned XGBoost model as the best model as the R2 score is just slightly better.

Model Serialization

# create a folder to store the files that will be used for the backend server deployment
import os
os.makedirs("backend_files", exist_ok=True)
# define the file path to save (serialize) the trained regression model
model_path = "backend_files/sales_prediction_model_v1_0.joblib"
# save the trained regression model and preprocessor using joblibe
joblib.dump(xgb_tuned, model_path)

print(f'Model saved successfully at {model_path}')
Model saved successfully at backend_files/sales_prediction_model_v1_0.joblib
saved_model = joblib.load('backend_files/sales_prediction_model_v1_0.joblib')

print('Model loaded successfully.')
Model loaded successfully.
saved_model
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_...
                              feature_types=None, gamma=0, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=75, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
saved_model.predict(X_test)
array([3283.6877, 3282.5544, 3995.6821, ..., 3859.6404, 3282.5544,
       3995.6821], dtype=float32)

Deployment - Backend

# import the login function from the huggingface_hub library
from huggingface_hub import login, HfApi

# import the create_repo function from the huggingface_hub library
from huggingface_hub import create_repo
# access the secret key in Python
from google.colab import userdata
secret_value = userdata.get('BE_Token')
# login to HuggingFace with access token
login(token=secret_value)
# create the repository for the HuggingFace Space
try:
    create_repo("BigGnTX/superkart_backend",
        repo_type="space",
        space_sdk="docker",
        private=False
    )
except Exception as e:
    # handle any potential errors
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Respository not created.")
    else:
        print(f"Error: {e}")

Flask Web Framework

# create the app.py file in the backend_files folder
%%writefile backend_files/app.py
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify

# initialize the flask with a name
sales_forecast_api = Flask("Sales Forecast Predictor")

# load the trained sales forecast model
model = joblib.load("sales_prediction_model_v1_0.joblib")

# define the route for the home page
@sales_forecast_api.get('/')
def home():
    return "Welcome to the Sales Forecast Prediction API!"

# define the endpoint to predict sales forecast
@sales_forecast_api.post('/v1/predict')
def predict_sales():
    # get the JSON data from the request
    predict_data = request.get_json()


    # extract relevant features from the input data.
    sample = {
        'Product_Weight': predict_data['Product_Weight'],
        'Product_Sugar_Content': predict_data['Product_Sugar_Content'],
        'Product_Allocated_Area': predict_data['Product_Allocated_Area'],
        'Product_MRP': predict_data['Product_MRP'],
        'Store_Size': predict_data['Store_Size'],
        'Store_Location_City_Type': predict_data['Store_Location_City_Type'],
        'Store_Type': predict_data['Store_Type'],
        'Store_Years_In_Operation': predict_data['Store_Years_In_Operation'],
        'Product_Code': predict_data['Product_Code'],
        'Product_Category': predict_data['Product_Category']

    }

    # convert the extracted data into a DataFrame
    input_data = pd.DataFrame([sample])

    # make a sales forecast prediction using the trained model
    prediction = model.predict(input_data).tolist()[0]

    # return the prediction as a JSON response
    return jsonify({'Sales': prediction})

# Run the Flask app in debug mode
if __name__ == '__main__':
    sales_forecast_api.run(debug=True)
Writing backend_files/app.py

Dependencies File

# create the requirements.txt file in the backend_files folder
%%writefile backend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
seaborn==0.13.2
joblib==1.4.2
xgboost==2.1.4
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.32.4
Writing backend_files/requirements.txt

Dockerfile

# create the Dockerfile in the backend_files folder
%%writefile backend_files/Dockerfile
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy all files from the current directory to the container's working directory
COPY . .

# Install dependencies from the requirements file without using cache to reduce image size
RUN pip install --no-cache-dir --upgrade -r requirements.txt

# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:sales_forecast_api"]
Writing backend_files/Dockerfile

Setting up a Hugging Face Docker Space for the Backend

Uploading Files to Hugging Face Space (Docker Space)

# for hugging face space authentication to upload files
from huggingface_hub import HfApi

repo_id = "BigGnTX/superkart_backend"

# initialize the API
api = HfApi()

# upload Streamlit app files stored in the folder called backend_files
api.upload_folder(
    folder_path="backend_files",
    repo_id=repo_id,
    repo_type="space"
)
{"model_id":"971072dbb57c4b33993b9f5b96cbbce0","version_major":2,"version_minor":0}
{"model_id":"7fc82ecaa43246f2a6663fddeb689e0b","version_major":2,"version_minor":0}
{"model_id":"da2fba831f3a44d98829a87e915d3133","version_major":2,"version_minor":0}
{"type":"string"}

Deployment - Frontend

Points to note before executing the below cells

Streamlit for Interactive UI

# Create a folder for storing the files needed for frontend UI deployment
os.makedirs('frontend_files', exist_ok=True)
#create the app.py file in the frontend_files folder
%%writefile frontend_files/app.py
import requests
import streamlit as st
import pandas as pd

st.title('Sales Forecast Prediction')

# input fields for store and product data
Product_Weight = st.slider('Product Weight', min_value=0.0, max_value=30.0, value=12.66)
Product_Sugar_Content = st.selectbox('Product Sugar Content', ['Low Sugar', 'Regular', 'No Sugar'])
Product_Allocated_Area = st.slider('Product Allocated Area', min_value=0.0, max_value=1.0, value = 0.027)
Product_MRP = st.slider('Product MRP', min_value=0.0, max_value=300.0, value = 117.08)
Store_Size = st.selectbox('Store Size', ['Small', 'Medium', 'High'])
Store_Location_City_Type = st.selectbox('Store Location City Type', ['Tier 1', 'Tier 2', 'Tier 3'])
Store_Type = st.selectbox('Store Type', ['Supermarket Type 1', 'Supermarket Type 2', 'Departmental Store', 'Food Mart'])
Store_Years_In_Operation = st.slider('Store Years In Operation', min_value=1, max_value=50, value = 20)
Product_Code = st.selectbox('Product Code', ['FD', 'NC', 'DR'])
Product_Category = st.selectbox('Product Category', ['Food', 'Non Food'])

# converting user input into a DataFrame
forecast_data = {
    'Product_Weight': Product_Weight,
    'Product_Sugar_Content': Product_Sugar_Content,
    'Product_Allocated_Area': Product_Allocated_Area,
    'Product_MRP': Product_MRP,
    'Store_Size': Store_Size,
    'Store_Location_City_Type': Store_Location_City_Type,
    'Store_Type': Store_Type,
    'Store_Years_In_Operation': Store_Years_In_Operation,
    'Product_Code': Product_Code,
    'Product_Category': Product_Category
}

# making prediction when the "Predict" button is clicked
if st.button('Predict', type='primary'):
  response = requests.post('https://biggntx-superkart-backend.hf.space/v1/predict', json=forecast_data)
  if response.status_code == 200:
    result = response.json()
    sales_prediction = result['Sales']
    st.write(f'The Predicted Product Store Sales Total is: ${sales_prediction:.2f}.')
  else:
    st.error(f"Error in API request: Status code {response.status_code}\n{response.text}")
Writing frontend_files/app.py

Dependencies File

#create the requirements.txt file in the frontend_files folder
%%writefile frontend_files/requirements.txt
pandas==2.2.2
requests==2.32.4
streamlit==1.45.0
Writing frontend_files/requirements.txt

DockerFile

#create a Dockerfile in the frontend_files folder
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# set the working directory inside the container to /app
WORKDIR /app

# copy all files from the current directory on the host to the container's /app directory
COPY . .

# install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

# NOTE: Disable XSRF protection for easier external access in order to make batch predictions
Writing frontend_files/Dockerfile

Uploading Files to Hugging Face Space (Streamlit Space)

# setting access key and repo_id of the front end HuggingFace app
# access the front end forecast at https://huggingface.co/spaces/BigGnTX/superkart_forecast
repo_id = "BigGnTX/superkart_forecast"
access_key = "HF_Token"


# login to HuggingFace with access token
login(token=access_key)

# initialize the API
api = HfApi()

# uploading Streamlit app files stored in the folder called frontend_files
api.upload_folder(
    folder_path='/content/frontend_files',
    repo_id=repo_id,
    repo_type='space',
)
{"type":"string"}

Actionable Insights and Business Recommendations