Project: Model Deployment: SuperKart by Gabriel Hinojos

Problem Statement

Business Context

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

Objective

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

Data Description

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
Product_Weight - weight of each product
Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
Product_MRP - maximum retail price of each product
Store_Id - unique identifier of each store
Store_Establishment_Year - year in which the store was established
Store_Size - size of the store depending on sq. feet like high, medium and low
Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store

Installing and Importing the necessary libraries

#Installing the libraries with the specified versions
!pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.4 huggingface_hub==0.34.0 -q

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 301.8/301.8 kB 6.2 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 223.6/223.6 MB 5.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 558.7/558.7 kB 9.9 MB/s eta 0:00:00

Note:

After running the above cell, I restarted the runtime in Google Colab.

# import libraries for reading and manipulation of data
import os
import numpy as np
import pandas as pd

# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter
from matplotlib.ticker import FuncFormatter

# import libraries to split datasets into training and testing sets
from sklearn.model_selection import train_test_split

# import libraries to import ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor


# import library to compute classification metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)
from sklearn.metrics import mean_squared_error as mse

# import libraries to create the pipeline
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline,Pipeline

# import library to tune different models and standardize
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler,OneHotEncoder

# import library to serialize the model
import joblib

# import library for API requests
import requests

# import library for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)

# import library to suppress unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

Loading the dataset

# run the following lines for Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

# read the dataset from the Google Colab drive Python Course
products = pd.read_csv('/content/drive/MyDrive/Python Course/SuperKart.csv')

# creating a copy of the data
data = products.copy()

Data Overview

View the first and last 5 rows of the dataset

# pull the first 5 rows of data from dataset
data.head(5)

	Product_Id	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_Type	Product_MRP	Store_Id	Store_Establishment_Year	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total
0	FD6114	12.66	Low Sugar	0.027	Frozen Foods	117.08	OUT004	2009	Medium	Tier 2	Supermarket Type2	2842.40
1	FD7839	16.54	Low Sugar	0.144	Dairy	171.43	OUT003	1999	Medium	Tier 1	Departmental Store	4830.02
2	FD5075	14.28	Regular	0.031	Canned	162.08	OUT001	1987	High	Tier 2	Supermarket Type1	4130.16
3	FD8233	12.10	Low Sugar	0.112	Baking Goods	186.31	OUT001	1987	High	Tier 2	Supermarket Type1	4132.18
4	NC1180	9.57	No Sugar	0.010	Health and Hygiene	123.67	OUT002	1998	Small	Tier 3	Food Mart	2279.36

# pull the last 5 rows of the data from the dataset
data.tail(5)

	Product_Id	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_Type	Product_MRP	Store_Id	Store_Establishment_Year	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total
8758	NC7546	14.80	No Sugar	0.016	Health and Hygiene	140.53	OUT004	2009	Medium	Tier 2	Supermarket Type2	3806.53
8759	NC584	14.06	No Sugar	0.142	Household	144.51	OUT004	2009	Medium	Tier 2	Supermarket Type2	5020.74
8760	NC2471	13.48	No Sugar	0.017	Health and Hygiene	88.58	OUT001	1987	High	Tier 2	Supermarket Type1	2443.42
8761	NC7187	13.89	No Sugar	0.193	Household	168.44	OUT001	1987	High	Tier 2	Supermarket Type1	4171.82
8762	FD306	14.73	Low Sugar	0.177	Snack Foods	224.93	OUT002	1998	Small	Tier 3	Food Mart	2186.08

Understand the shape of the dataset

# view the number of rows and columns that are present in the data
data.shape

(8763, 12)

Observations: The dataset has 8763 rows and 12 columns

Check the data types of the columns for the dataset

# pull the datatypes for each column and entries for each column in the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8763 entries, 0 to 8762
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Product_Id                 8763 non-null   object 
 1   Product_Weight             8763 non-null   float64
 2   Product_Sugar_Content      8763 non-null   object 
 3   Product_Allocated_Area     8763 non-null   float64
 4   Product_Type               8763 non-null   object 
 5   Product_MRP                8763 non-null   float64
 6   Store_Id                   8763 non-null   object 
 7   Store_Establishment_Year   8763 non-null   int64  
 8   Store_Size                 8763 non-null   object 
 9   Store_Location_City_Type   8763 non-null   object 
 10  Store_Type                 8763 non-null   object 
 11  Product_Store_Sales_Total  8763 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 821.7+ KB

Observations:

The object datatypes: Product ID, Product Sugar Content, Product Type, Store ID, Store Size, Store Location City Type, and Store Type.
The float/int datatypes: Product Weight, Product Allocated Area, Product MRP, Store Establishment Year, and Product Store Sales Total.
The datatypes are assigned as expected.
There are no missing values identified as shown in the info veiw.

Checking for missing values

# check data for any records with no data entered for the column
data.isnull().sum()

	0
Product_Id	0
Product_Weight	0
Product_Sugar_Content	0
Product_Allocated_Area	0
Product_Type	0
Product_MRP	0
Store_Id	0
Store_Establishment_Year	0
Store_Size	0
Store_Location_City_Type	0
Store_Type	0
Product_Store_Sales_Total	0

dtype: int64

Observations: There are no null values in this dataset as displayed by this request.

Checking for duplicate values

# check for duplicate values in the dataset
data.duplicated().sum()

np.int64(0)

Observations: There are no duplicates in the data.

Checking the statistical summary

# check the statistical information for each varaiable (column) in the dataset
data.describe().T

	count	mean	std	min	25%	50%	75%	max
Product_Weight	8763.0	12.653792	2.217320	4.000	11.150	12.660	14.180	22.000
Product_Allocated_Area	8763.0	0.068786	0.048204	0.004	0.031	0.056	0.096	0.298
Product_MRP	8763.0	147.032539	30.694110	31.000	126.160	146.740	167.585	266.000
Store_Establishment_Year	8763.0	2002.032751	8.388381	1987.000	1998.000	2009.000	2009.000	2009.000
Product_Store_Sales_Total	8763.0	3464.003640	1065.630494	33.000	2761.715	3452.340	4145.165	8000.000

Exploratory Data Analysis (EDA)

Univariate Analysis

# setup function to create combined boxplot and histogram for univariate analysis of numerical variables in dataset
# data - dataframe dataset; feature - column in dataset; figsize - figure size; kde - density curve displayed; bins - interval of groups in the histogram
def histogram_boxplot(data, feature, figsize=(20, 10), kde=False, bins=None):
    # create the subplots
    # nrows - Number of rows in the subplot grid; sharex - x-axis will be shared among all subplots
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows=2, sharex=True, gridspec_kw={"height_ratios": (0.25, 0.75)}, figsize=figsize,)
    # create the boxplot which will display a triangle to indicate the mean value of the variable
    sns.boxplot(data=data, x=feature, ax=ax_box2, showmeans=True, color="aquamarine")
    # create the histogram which will display a straight line for the mean of the variable and dotted line for the median of the variable
    sns.histplot(data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins,
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, color="violet")
    ax_hist2.axvline(data[feature].mean(), color="black", linestyle="--")
    ax_hist2.axvline(data[feature].median(), color="gold", linestyle="-")

# setup function to create barplot with the percentage on top for univariate analysis of category variables in dataset
# data - dataframe dataset; feature - column in dataset; perc - display of percentages instead of count (set to False);
# n - display the top n category levels (set to display all levels)
def labeled_barplot(data, feature, perc=False, n=None):
    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(data=data, x=feature, palette="pastel",
        order=data[feature].value_counts().index[:n],)
    # set percentage of each class of category, count of each category,
    # set width and height of the plot
    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(100 * p.get_height() / total)
        else:
            label = p.get_height()

        x = p.get_x() + p.get_width() / 2
        y = p.get_height()
        # annotate the percentage
        ax.annotate(label, (x, y), ha="center", va="center",
            size=12, xytext=(0, 5), textcoords="offset points",)

    plt.show()

Distribution of numerical variables

Product Weight

histogram_boxplot(data, "Product_Weight")

Observations: The Product Weight variable distribution looks mildy left skewed with multiple outliers in lower and upper quartiles. The average product weight of the dataset is ~12.7.

Product Allocated Area

histogram_boxplot(data, "Product_Allocated_Area")

Observations: The Product Allocated Area variable distribution is heavily right skewed with all of the outliers in the upper quartiles. The average product allocated area of the dataset is ~0.07.

Product MRP

histogram_boxplot(data, "Product_MRP")

Observations: The Product Weight variable distribution looks mildly left skewed with multiple outliers in lower and upper quartiles. The average product MRP of the dataset is ~147.

Product Store Sales Total

histogram_boxplot(data, "Product_Store_Sales_Total")

Observations: The Product Allocated Area variable distribution looks equally distributed with multiple outliers in lower and upper quartiles. The average product store sales total of the dataset is ~3500.

Distribution of categorical variables

Product Sugar Content

labeled_barplot(data, "Product_Sugar_Content", perc=True)

Observations: The products with Low Sugar content make up the majority of the product population at 57% (almost 5000 products) while Regular (25.7%) and No Sugar (17.3%) come in second and third respectively. 'reg' which makes up 1.2% most likely refers to 'Regular' sugar content so that will need to be adjusted.

Product Type

labeled_barplot(data, "Product_Type", perc=True)

Observations: Fruits (14.3%) and Snack Foods (13.1%) are the top 2 product types in this dataset. They are also the only two product types that are double digit in percentage as well. Starchy Foods (1.6%), Breakfast (1.2%), and Seafood (0.9%) round out the bottom 3.

Store ID

labeled_barplot(data, "Store_Id", perc=True)

Observations: The vast majority of the data comes from Store ID OUT004 at 53.4% while the others stores (OUT001 - 18.1%, OUT003 - 15.4%, OUT002 - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 3 stores? Is store 4 a significantly larger store? Where are these stores located? This needs to be explored further.

Store Size

labeled_barplot(data, "Store_Size", perc=True)

Observations: The vast majority of the data comes from Medium size stores at 68.8% while the others stores (High - 18.1% and Small - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 2 store sizes? Where are these stores located? This needs to be explored further.

Store Location City Type

labeled_barplot(data, "Store_Location_City_Type", perc=True)

Observations: The vast majority of the data comes from Tier 2 stores at 71.5% while the others stores (Tier 1 - 15.4% and Tier 3 - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 2 store location city types? Where are these stores located? This needs to be explored further.

Store Type

labeled_barplot(data, "Store_Type", perc=True)

Observations: The vast majority of the data comes from Supermarket Type2 at 53.4% while the others stores (Supermarket Type 2 - 18.1%, Departmental Store - 15.4%, Food Mart - 13.1%) are significantly lower in reporting. There are a number of questions that probably need to be asked about this datapoint including: Is there a reporting of data issue from the other 3 stores? Is store 4 a significantly larger store? Where are these stores located? This needs to be explored further.

Bivariate Analysis

Setup Functions for Bivariate Analysis

# setup function to create category counts and plot a stacked bar chart for bivariate analysis of variables in dataset
# data - dataframe dataset; predictor - independent variable, target - target variable
def stacked_barplot(data, predictor, target):
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False)
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False)
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(loc="lower left", frameon=False,)
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()

# setup function to create scatterplot to see how one variable relates to another and whether the predictor categories show distinct behaviors on the target variable
def scatterplot_distribution(data, predictor, target):
  plt.figure(figsize=(12, 6))
  sns.scatterplot(data, x=predictor, y=target, hue=predictor)
  plt.title(f"Scatterplot for {predictor}  Vs {target}")
  plt.show()

# setup function to create boxplot to show the data's median, spread, range, and outlier points
def boxplot_distribution(data, predictor, target):
  plt.figure(figsize=[12, 6])
  sns.boxplot(data, x=predictor, y=target, hue=predictor)
  plt.xticks(rotation=90)
  plt.title(f"Boxplot for {predictor}  Vs {target}")
  plt.show()

# setup function to create grouped barplot to compare multiple related categories side by side within each main category, revealing patterns, differences, and trends in the dataset.
def grouped_barplot(data, group_cols, value_col, x, y, hue=None,
                     agg_func='sum', figsize=(12, 6), title='Grouped Bar Plot'):

    grouped = data.groupby(group_cols)[value_col].agg(agg_func).reset_index()

    plt.figure(figsize=figsize)
    ax = sns.barplot(data=grouped, x=x, y=y, hue=hue)
    ax.set(xlabel=x, ylabel=y, title=title)
    ax.ticklabel_format(style='plain', axis='y')
    ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))
    plt.xticks(rotation=90)
    if hue:
        plt.legend(title=hue, loc='upper left')
    plt.tight_layout()
    plt.show()

# setup function to create barplot to compare the sum of the revenue to other variables
def revenue_barplot(data, predictor, target):
  group_cols = [predictor]
  agg_data = data.groupby(group_cols)[target].sum().reset_index()
  plt.figure(figsize=(12, 6))
  ax = sns.barplot(data=agg_data, x=predictor, y=target)
  ax.set(xlabel=predictor, ylabel=f'Total {target}', title=f'Total {target} by {predictor}')
  ax.ticklabel_format(style='plain', axis='y')
  ax.yaxis.set_major_formatter(FuncFormatter(lambda x, _: f'{x:,.0f}'))
  plt.xticks(rotation=90)
  plt.tight_layout()
  plt.show()

Correlation Check - Heatmap

# create a heatmap for the correlation of the numeric features
cols_list = data.select_dtypes(include=np.number).columns.tolist()
cols_list.remove('Store_Establishment_Year')
plt.figure(figsize=(20, 10))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap='coolwarm'
)
plt.show()

Observations: The highest correlated features are Product MRP and Product Sales Total 0.79 and the second highest is Product Weight and Product Store Sales Total at 0.74.

Distribution check for relationship/pattern between Product Stores Sales Total and other numeric variables

Product Weight vs Product Store Sales Total

scatterplot_distribution(data, 'Product_Weight', 'Product_Store_Sales_Total')

Observation: This scatterplot shows the relationship between Product Weight and Product Store Sales Total a Positive correlation which means as as product weight increases the total store sales will also rise. This could be the result of bulk purchases or greater value for heavier products. This is consistent with the Heat Map results.

Product Allocated Area vs Product Store Sales Total

scatterplot_distribution(data, 'Product_Allocated_Area', 'Product_Store_Sales_Total')

Observation: This scatterplot shows the relationship between Product Allocated Area and Product Store Sales Total shows a tight vertical density correlation. There is not a strong relationship between the two variables. In other words, Allocated Area does not strongly influence Sales. This is consistent with the Heat Map results.

Product Maximum Retail Price (MRP) vs Product Store Sales Total

scatterplot_distribution(data, 'Product_MRP', 'Product_Store_Sales_Total')

Observation: This scatterplot shows the relationship between Product Maximum Retail Price and Product Store Sales Total a Positive correlation which means as as product Maximum Retail Price increases the total store sales will also rise. This could be the result of a number of factors randing from product quality, bulk items, or consumer behavior. This is consistent with the Heat Map results.

Distribution check for relationship/pattern between Product Stores Sales Total and other catergorical variables

Product Sugar Content vs Product Store Sales Total

boxplot_distribution(data, 'Product_Sugar_Content', 'Product_Store_Sales_Total')

Observations: There is 'reg' sugar content which will need to be normalized to 'Regular'. The medians for each product sugar content hover around the same Sales Total (3300-3500) so this suggests that sugar content does not meaningfully affect sales.

Product Type vs Product Store Sales Total

boxplot_distribution(data, 'Product_Type', 'Product_Store_Sales_Total')

Observations: The medians for each product type hover around the same Sales Total (3300-3500) so this suggests that product does not meaningfully affect sales.

Store ID vs Product Store Sales Total

boxplot_distribution(data, 'Store_Id', 'Product_Store_Sales_Total')

Observations: The Store ID OUT003 has the highest median at about 4900 while the lowest median of 1800 is at Store OUT002. So Store ID does meaningfully affect sales. May need to research OUT002 to see sales could be boosted. OUT001 seems to pretty stable. OUT004 has a large number of outliers in the upper and lower quartiles with the greatest density in the upper quartile. Could research OUT004 to develop strategies for other stores or review the outliers.

Store Size vs Product Store Sales Total

boxplot_distribution(data, 'Store_Size', 'Product_Store_Sales_Total')

Observations: High store zize has the highest median at about 4000 while the lowest median of 1800 is at Small store size. So Store Size does meaningfully affect sales. May need to research Small store to see sales could be boosted.

Store Location City Type vs Product Store Sales Total

boxplot_distribution(data, 'Store_Location_City_Type', 'Product_Store_Sales_Total')

Observations: Tier 1 location city type has the highest median at about 4000 while the lowest median of 1800 is at Tier 3 location city type. So Store Location City Type does meaningfully affect sales. May need to research Tier 3 store to see sales could be boosted.

Store Type vs Product Store Sales Totals

boxplot_distribution(data, 'Store_Type', 'Product_Store_Sales_Total')

Observations: Departmental Store type has the highest median at about 4000 while the lowest median of 1800 is at Food Mart type. So Store Type does meaningfully affect sales. May need to research Food Mart store to see sales could be boosted.

Observations of product weight vs other variables

Product Sugar Content vs Product Weight

plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Product_Sugar_Content', y='Product_Weight', hue='Product_Sugar_Content')
plt.xticks(rotation=90)
plt.title('Boxplot of Product Weight vs Product Sugar Content')
plt.show()

Observations: There is 'reg' sugar content which will need to be normalized to 'Regular'. The medians for each product sugar content hover around the same Product Weight (12.5) so this suggests that sugar content does not meaningfully affect product weight.

Observations:

Product Type vs Product Weight

plt.figure(figsize=(12, 6))
sns.boxplot(data=data, x='Product_Type', y='Product_Weight', hue='Product_Type')
plt.xticks(rotation=90)
plt.title('Boxplot of Product Weight vs Product Type')
plt.show()

Observations: The medians for each product type hover around the same Product Weight (12.5) so this suggests that product type does not meaningfully affect product weight.

Statistics on each of the stores

store_ids = ['OUT001', 'OUT002', 'OUT003', 'OUT004']

cols_list = ['Store_Establishment_Year', 'Store_Size', 'Store_Location_City_Type', 'Store_Type']
for store in store_ids:
    print(f'\n**** Statistics for Store ID: {store} ****')
    display(data.loc[data['Store_Id'] == store, cols_list].describe(include='all').T)


**** Statistics for Store ID: OUT001 ****

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Store_Establishment_Year	1586.0	NaN	NaN	NaN	1987.0	0.0	1987.0	1987.0	1987.0	1987.0	1987.0
Store_Size	1586	1	High	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	1586	1	Tier 2	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	1586	1	Supermarket Type1	1586	NaN	NaN	NaN	NaN	NaN	NaN	NaN


**** Statistics for Store ID: OUT002 ****

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Store_Establishment_Year	1152.0	NaN	NaN	NaN	1998.0	0.0	1998.0	1998.0	1998.0	1998.0	1998.0
Store_Size	1152	1	Small	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	1152	1	Tier 3	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	1152	1	Food Mart	1152	NaN	NaN	NaN	NaN	NaN	NaN	NaN


**** Statistics for Store ID: OUT003 ****

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Store_Establishment_Year	1349.0	NaN	NaN	NaN	1999.0	0.0	1999.0	1999.0	1999.0	1999.0	1999.0
Store_Size	1349	1	Medium	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	1349	1	Tier 1	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	1349	1	Departmental Store	1349	NaN	NaN	NaN	NaN	NaN	NaN	NaN


**** Statistics for Store ID: OUT004 ****

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
Store_Establishment_Year	4676.0	NaN	NaN	NaN	2009.0	0.0	2009.0	2009.0	2009.0	2009.0	2009.0
Store_Size	4676	1	Medium	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Location_City_Type	4676	1	Tier 2	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Store_Type	4676	1	Supermarket Type2	4676	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Observations:

OUT001 Established 1987; Store Size: High; Store Location City Type: Tier 2; Store Type: Supermarket Type 1

OUT002 Established 1998; Store Size: Small; Store Location City Type: Tier 3; Store Type: Food Mart

OUT003 Established 1999; Store Size: Medium; Store Location City Type: Tier 1; Store Type: Departmental Store

OUT004 Established 2009; Store Size: Medium; Store Location City Type: Tier 2; Store Type: Supermarket Type 2

STORE SIZE

OUT001 has store size High
OUT002 has store size Small
OUT003 and OUT004 has store size Medium

STORE LOCATION CITY TYPE

OUT001 and OUT004 has store location city type Tier 2
OUT002 has store city location city type Tier 3
OUT003 has store location city type Tier 1

STORE TYPE

OUT001 has store type Supermarket Type 1
OUT002 has store type Food Mart
OUT003 has store type Departmental Store
OUT004 has store type Supermarket Type 2

COUNTS

OUT001 has a count of 1586
OUT002 has a count of 1152
OUT003 has a count of 1349
OUT004 has a count of 4676

After reviewing the statistics for each store, some of the previous observations start to become a little more clear.

Observations on total revenue for each Store ID

store_revenue = (
    data.groupby('Store_Id')['Product_Store_Sales_Total']
    .sum()
    .loc[store_ids]
)

for store, revenue in store_revenue.items():
    print(f'Store ID: {store}, Total Revenue: ${revenue:,.2f}')

Store ID: OUT001, Total Revenue: $6,223,113.18
Store ID: OUT002, Total Revenue: $2,030,909.72
Store ID: OUT003, Total Revenue: $6,673,457.57
Store ID: OUT004, Total Revenue: $15,427,583.43

Observations: OUT004 reports the most revenue but it also has a larger number of reported data at a count of 4676. OUT002 has the lowest revenue but also has the lowest counts of 1152.

Observations on product store sales total (revenue) generated vs other variables

Store ID vs Product Store Sales Total

revenue_barplot(data, 'Store_Id', 'Product_Store_Sales_Total')

Observations: OUT004 reports the most revenue but it also has a larger number of reported data at a count of 4676. OUT002 has the lowest revenue but also has the lowest counts of 1152.

Store Size vs Product Store Sales Total

revenue_barplot(data, 'Store_Size', 'Product_Store_Sales_Total')

Observations: Medium reports the most revenue but it also has a larger number of reported data of counts. Small has the lowest revenue but also has the lowest counts.

Store Location City Type vs Product Store Sales Total

revenue_barplot(data, 'Store_Location_City_Type', 'Product_Store_Sales_Total')

Observations: Tier 2 reports the most revenue but it also has a larger number of reported data of counts. Tier 3 has the lowest revenue but also has the lowest counts.

Store Type vs Product Store Sales Total

revenue_barplot(data, 'Store_Type', 'Product_Store_Sales_Total')

Observations: Supermarket reports the most revenue but it also has a larger number of reported data of counts. Food Mart has the lowest revenue but also has the lowest counts.

Product Sugar Content vs Product Store Sales Total

revenue_barplot(data, 'Product_Sugar_Content', 'Product_Store_Sales_Total')

Observations: Low Sugar content dominates the other content types with over 17,000,000 in revenue while No Sugar content is lowest in revenue with around 5,000,000.

Observations on product store sales total (revenue) generated vs other variables per product type

Revenue by Product Type

revenue_product = data.groupby('Product_Type')['Product_Store_Sales_Total'].sum().reset_index()

plt.figure(figsize=(12, 6))
sns.barplot(data=revenue_product, x='Product_Type', y='Product_Store_Sales_Total')
plt.xticks(rotation=90)
plt.xlabel('Product Types')
plt.ylabel('Product Store Sales Total')
plt.title('Revenue by Product Type')
plt.show()

Observations: Fruits and Vegetables and Snack Foods are the product types that generate the most revenue. Breakfast and Seafood products generate the least amount.

Store ID vs Product Store Sales Total (for each product type)

grouped_barplot(
   data=data,
    group_cols=['Store_Id', 'Product_Type'],
    value_col='Product_Store_Sales_Total',
    x='Store_Id',
    y='Product_Store_Sales_Total',
    hue='Product_Type',
    title='Revenue generated by each Store ID for each Product Type'
)

Observations: Fruits and Vegetables and Snacks are the best revenue generators for each Store ID.

Product Type vs Product Store Sales Total (for each Store ID)

grouped_barplot(
    data=data,
    group_cols=['Product_Sugar_Content', 'Store_Id'],
    value_col='Product_Store_Sales_Total',
    x='Product_Sugar_Content',
    y='Product_Store_Sales_Total',
    hue='Store_Id',
    title='Revenue generated by each Store Type for each Product Type'
)

Observations: Low Sugar content generates the most revenue for each store ID while No Sugar contents generates the least. The 'reg' product sugar content needs to be normalized at this point.

Data Preprocessing

Feature Engineering

# updating the product sugar content to move reg to Regular
data.Product_Sugar_Content.replace(to_replace=['reg'], value='Regular', inplace=True)
data.Product_Sugar_Content.value_counts()

	count
Product_Sugar_Content
Low Sugar	4885
Regular	2359
No Sugar	1519

dtype: int64

Observations: Normalized 'reg' Product Sugar Content to 'Regular'

# create a variable for number of years that store has been in operation
data['Store_Years_In_Operation'] = 2025 - data.Store_Establishment_Year
data.head()

	Product_Id	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_Type	Product_MRP	Store_Id	Store_Establishment_Year	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total	Store_Years_In_Operation
0	FD6114	12.66	Low Sugar	0.027	Frozen Foods	117.08	OUT004	2009	Medium	Tier 2	Supermarket Type2	2842.40	16
1	FD7839	16.54	Low Sugar	0.144	Dairy	171.43	OUT003	1999	Medium	Tier 1	Departmental Store	4830.02	26
2	FD5075	14.28	Regular	0.031	Canned	162.08	OUT001	1987	High	Tier 2	Supermarket Type1	4130.16	38
3	FD8233	12.10	Low Sugar	0.112	Baking Goods	186.31	OUT001	1987	High	Tier 2	Supermarket Type1	4132.18	38
4	NC1180	9.57	No Sugar	0.010	Health and Hygiene	123.67	OUT002	1998	Small	Tier 3	Food Mart	2279.36	27

# create a variable for product codes to reduce number of product ids for model
data['Product_Code'] = data['Product_Id'].str[:2]
data['Product_Code'].unique()

array(['FD', 'NC', 'DR'], dtype=object)

# print product code arrays
codes = ['FD', 'NC', 'DR']
for code in codes:
    types = data.loc[data['Product_Code'] == code, 'Product_Type'].unique()
    print(f'{code}: {types}')

FD: ['Frozen Foods' 'Dairy' 'Canned' 'Baking Goods' 'Snack Foods' 'Meat'
 'Fruits and Vegetables' 'Breads' 'Breakfast' 'Starchy Foods' 'Seafood']
NC: ['Health and Hygiene' 'Household' 'Others']
DR: ['Hard Drinks' 'Soft Drinks']

# create a variable for product categories to reduce number of product types for model
food = [
    'Frozen Foods', 'Dairy', 'Canned', 'Baking Goods', 'Snack Foods', 'Meat', 'Fruits and Vegetables', 'Breads', 'Breakfast', 'Starchy Foods', 'Seafood'
]
data['Product_Category'] = np.where(data['Product_Type'].isin(food), 'Food', 'Non Food')
data.head()

	Product_Id	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_Type	Product_MRP	Store_Id	Store_Establishment_Year	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total	Store_Years_In_Operation	Product_Code	Product_Category
0	FD6114	12.66	Low Sugar	0.027	Frozen Foods	117.08	OUT004	2009	Medium	Tier 2	Supermarket Type2	2842.40	16	FD	Food
1	FD7839	16.54	Low Sugar	0.144	Dairy	171.43	OUT003	1999	Medium	Tier 1	Departmental Store	4830.02	26	FD	Food
2	FD5075	14.28	Regular	0.031	Canned	162.08	OUT001	1987	High	Tier 2	Supermarket Type1	4130.16	38	FD	Food
3	FD8233	12.10	Low Sugar	0.112	Baking Goods	186.31	OUT001	1987	High	Tier 2	Supermarket Type1	4132.18	38	FD	Food
4	NC1180	9.57	No Sugar	0.010	Health and Hygiene	123.67	OUT002	1998	Small	Tier 3	Food Mart	2279.36	27	NC	Non Food

Outlier Check

# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove('Store_Establishment_Year')
numeric_columns.remove('Store_Years_In_Operation')

plt.figure(figsize=(15, 10))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Observations

There are a number of outliers in the data.
Will not treat them as they are proper values.

Data Preparation for modeling

# drop some of the categorical features for modeling as new concise variables have been created
data = data.drop(columns=['Product_Id', 'Product_Type', 'Store_Id', 'Store_Establishment_Year'])
data.shape

(8763, 11)

data.head()

	Product_Weight	Product_Sugar_Content	Product_Allocated_Area	Product_MRP	Store_Size	Store_Location_City_Type	Store_Type	Product_Store_Sales_Total	Store_Years_In_Operation	Product_Code	Product_Category
0	12.66	Low Sugar	0.027	117.08	Medium	Tier 2	Supermarket Type2	2842.40	16	FD	Food
1	16.54	Low Sugar	0.144	171.43	Medium	Tier 1	Departmental Store	4830.02	26	FD	Food
2	14.28	Regular	0.031	162.08	High	Tier 2	Supermarket Type1	4130.16	38	FD	Food
3	12.10	Low Sugar	0.112	186.31	High	Tier 2	Supermarket Type1	4132.18	38	FD	Food
4	9.57	No Sugar	0.010	123.67	Small	Tier 3	Food Mart	2279.36	27	NC	Non Food

# define the independent and dependent variables
X = data.drop(['Product_Store_Sales_Total'], axis=1)
y = data['Product_Store_Sales_Total']

# splitting data into training and test set

# split data into 2 parts: Train + Temp (80%) and Test (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

# print the number of rows of each dataset
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])

Number of rows in train data = 7010
Number of rows in test data = 1753

Observations: The train data of 7010 is 80% of the original 8763. And the test data of 1753 is 20% of the original 8763.

Data Preprocessing Pipeline

# create a categorical feature that stores a list of the categorical column names
categorical_features = data.select_dtypes(include=['object', 'category']).columns.tolist()
categorical_features

['Product_Sugar_Content',
 'Store_Size',
 'Store_Location_City_Type',
 'Store_Type',
 'Product_Code',
 'Product_Category']

# create a preprocessing pipeline for the categorical features
preprocessor = make_column_transformer(
    (Pipeline([('encoder', OneHotEncoder(handle_unknown='ignore'))]), categorical_features)
)

Model Building

Define functions for Model Evaluation

# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

Chose the ML models below randomly as part of this project exercise:

Random Forest
XGBoost

Random Forest Model

# fitting the random forest model
rf_estimator = RandomForestRegressor(random_state=42)
rf_estimator = make_pipeline(preprocessor, rf_estimator)
rf_estimator.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('randomforestregressor',
                 RandomForestRegressor(random_state=42))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

# calculating the training metric
rf_estimator_model_train_perf = model_performance_regression(rf_estimator, X_train,y_train)
print("Training performance \n",rf_estimator_model_train_perf)

Training performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  604.135272  475.762704    0.67816          0.6777  0.173159

# calculating the testing metric
rf_estimator_model_test_perf = model_performance_regression(rf_estimator, X_test,y_test)
print("Testing performance \n", rf_estimator_model_test_perf)

Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.595427  469.084204   0.687017         0.68522  0.168082

Observations: There is some consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

###XGBoost Regressor Model

# fitting the XGBoost model
xgb_estimator = XGBRegressor(random_state=42)
xgb_estimator = make_pipeline(preprocessor, xgb_estimator)
xgb_estimator.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_...
                              feature_types=None, gamma=None, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=None, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])

columntransformer: ColumnTransformer

?Documentation for columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Code', 'Product_Category'])])

pipeline

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Code', 'Product_Category']

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)

# calculating the training metric
xgb_estimator_model_train_perf = model_performance_regression(xgb_estimator, X_train,y_train)
print("Training performance \n",xgb_estimator_model_train_perf)

Training performance 
          RMSE        MAE  R-squared  Adj. R-squared      MAPE
0  604.129941  475.48628   0.678166        0.677706  0.173112

# calculating the testing metric
xgb_estimator_model_test_perf = model_performance_regression(xgb_estimator, X_test,y_test)
print("Testing performance \n", xgb_estimator_model_test_perf)

Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.658808  468.827931    0.68695        0.685153  0.168053

Observations: There is some consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

Model Performance Improvement - Hyperparameter Tuning

Random Forest Model Hyperparameter Tuning

# initialize the Random Forest regressor model
rf_tuned = RandomForestRegressor(random_state=42)
rf_tuned = make_pipeline(preprocessor, rf_tuned)

# set grid of parameters to choose from
param_grid = {
    'randomforestregressor__n_estimators': [80, 90, 100, 110],
    'randomforestregressor__max_depth': [4, 6, 8, None],
    'randomforestregressor__max_features': ['sqrt', 'log2', None],
    'randomforestregressor__min_samples_split': [2, 5, 10],
}

# running the grid search
grid_search_rf = GridSearchCV(rf_tuned, param_grid, scoring=r2_score, cv=3, n_jobs=-1)
grid_search_rf = grid_search_rf.fit(X_train, y_train)

# printing the best combination of parameters
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
rf_tuned = grid_search_rf.best_estimator_

Best parameters for Random Forest: {'randomforestregressor__max_depth': 4, 'randomforestregressor__max_features': 'sqrt', 'randomforestregressor__min_samples_split': 2, 'randomforestregressor__n_estimators': 80}

# calculating the training metric
rf_tuned_model_train_perf = model_performance_regression(rf_tuned, X_train, y_train)
print("Training performance \n", rf_tuned_model_train_perf)

Training performance 
          RMSE         MAE  R-squared  Adj. R-squared     MAPE
0  605.301232  479.987872   0.676917        0.676455  0.17419

# calculating the testing metric
rf_tuned_model_test_perf = model_performance_regression(rf_tuned, X_test, y_test)
print("Testing performance \n", rf_tuned_model_test_perf)

Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.359834  471.864334   0.687263        0.685468  0.168637

Observations: The R2 went down slightly with tuning, but there is still consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

XGBoost Model Hyperparameter Tuning

# initialize the XGBoost regressor model
xgb_tuned = XGBRegressor(random_state=42)
xgb_tuned = make_pipeline(preprocessor, xgb_tuned)

# set grid of parameters to choose from
param_grid = {
    'xgbregressor__n_estimators': [75, 100, 125],
    'xgbregressor__subsample': [0.7, 0.8, 0.9],
    'xgbregressor__gamma': [0, 1, 3],
    'xgbregressor__colsample_bytree':[0.7, 0.8, 0.9],
    'xgbregressor__colsample_bylevel':[0.7, 0.8, 0.9]
}

# running the grid search
grid_search_xgb = GridSearchCV(xgb_tuned, param_grid, scoring=r2_score, cv=3, n_jobs=-1)
grid_search_xgb = grid_search_xgb.fit(X_train, y_train)

# printing the best combination of parameters
print(f"Best parameters for XGBoost: {grid_search_xgb.best_params_}")
xgb_tuned = grid_search_xgb.best_estimator_

Best parameters for XGBoost: {'xgbregressor__colsample_bylevel': 0.7, 'xgbregressor__colsample_bytree': 0.7, 'xgbregressor__gamma': 0, 'xgbregressor__n_estimators': 75, 'xgbregressor__subsample': 0.7}

# calculating the training metric
xgb_tuned_model_train_perf = model_performance_regression(xgb_tuned, X_train, y_train)
print("Training performance \n", xgb_tuned_model_train_perf)

Training performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  604.197753  475.240343   0.678094        0.677634  0.172907

# calculating the testing metric
xgb_tuned_model_test_perf = model_performance_regression(xgb_tuned, X_test, y_test)
print("Testing performance \n", xgb_tuned_model_test_perf)

Testing performance 
          RMSE         MAE  R-squared  Adj. R-squared      MAPE
0  597.659713  468.865812   0.686949        0.685152  0.168031

Observations: The R2 went up slightly with tuning, but there is still consistent behavior on R-squared for both training and testing. There is no overfitting observed in these metrics. The R2 score is solid predictor of accuracy.

Model Performance Comparison, Final Model Selection, and Serialization

# calculating the training model performance comparison
models_train_comp_df = pd.concat(
    [
        rf_tuned_model_train_perf.T,
        xgb_tuned_model_train_perf.T
    ],
    axis=1,
)
models_train_comp_df.columns = ['Random Forest', 'XGBoost']
print('Training performance comparison:')
models_train_comp_df

Training performance comparison:

	Random Forest	XGBoost
RMSE	605.301232	604.197753
MAE	479.987872	475.240343
R-squared	0.676917	0.678094
Adj. R-squared	0.676455	0.677634
MAPE	0.174190	0.172907

# calculating the training model performance comparison
models_test_comp_df = pd.concat(
    [
        rf_tuned_model_test_perf.T,
        xgb_tuned_model_test_perf.T
    ],
    axis=1,
)
models_test_comp_df.columns = ['Random Forest', 'XGBoost']
print('Testing performance comparison:')
models_test_comp_df

Testing performance comparison:

	Random Forest	XGBoost
RMSE	597.359834	597.659713
MAE	471.864334	468.865812
R-squared	0.687263	0.686949
Adj. R-squared	0.685468	0.685152
MAPE	0.168637	0.168031

# calculating the difference in the training model performance comparison
(models_train_comp_df - models_test_comp_df).iloc[2]

	R-squared
Random Forest	-0.010347
XGBoost	-0.008856

dtype: float64

The R2 differences show that both the Random Forest and XGBoost models have very stable, consistent performance with very minor variance in R-squared between train and test models. As a result this would suggest that these models are reliable and generalize well.

Model selection: Will choose the the tuned XGBoost model as the best model as the R2 score is just slightly better.

Model Serialization

# create a folder to store the files that will be used for the backend server deployment
import os
os.makedirs("backend_files", exist_ok=True)

# define the file path to save (serialize) the trained regression model
model_path = "backend_files/sales_prediction_model_v1_0.joblib"

# save the trained regression model and preprocessor using joblibe
joblib.dump(xgb_tuned, model_path)

print(f'Model saved successfully at {model_path}')

Model saved successfully at backend_files/sales_prediction_model_v1_0.joblib

saved_model = joblib.load('backend_files/sales_prediction_model_v1_0.joblib')

print('Model loaded successfully.')

Model loaded successfully.

saved_model

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_...
                              feature_types=None, gamma=0, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=75, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Pipeline

?Documentation for PipelineiFitted

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('encoder',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  ['Product_Sugar_Content',
                                                   'Store_Size',
                                                   'Store_Location_City_Type',
                                                   'Store_Type', 'Product_Code',
                                                   'Product_Category'])])),
                ('xgbregressor',
                 XGBRegressor(base_score=None, booster=None, callbacks=None,
                              colsample_...
                              feature_types=None, gamma=0, grow_policy=None,
                              importance_type=None,
                              interaction_constraints=None, learning_rate=None,
                              max_bin=None, max_cat_threshold=None,
                              max_cat_to_onehot=None, max_delta_step=None,
                              max_depth=None, max_leaves=None,
                              min_child_weight=None, missing=nan,
                              monotone_constraints=None, multi_strategy=None,
                              n_estimators=75, n_jobs=None,
                              num_parallel_tree=None, random_state=42, ...))])

columntransformer: ColumnTransformer

?Documentation for columntransformer: ColumnTransformer

ColumnTransformer(transformers=[('pipeline',
                                 Pipeline(steps=[('encoder',
                                                  OneHotEncoder(handle_unknown='ignore'))]),
                                 ['Product_Sugar_Content', 'Store_Size',
                                  'Store_Location_City_Type', 'Store_Type',
                                  'Product_Code', 'Product_Category'])])

pipeline

['Product_Sugar_Content', 'Store_Size', 'Store_Location_City_Type', 'Store_Type', 'Product_Code', 'Product_Category']

OneHotEncoder

?Documentation for OneHotEncoder

OneHotEncoder(handle_unknown='ignore')

XGBRegressor

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=0.7, colsample_bynode=None, colsample_bytree=0.7,
             device=None, early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, feature_types=None, gamma=0, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=75,
             n_jobs=None, num_parallel_tree=None, random_state=42, ...)

saved_model.predict(X_test)

array([3283.6877, 3282.5544, 3995.6821, ..., 3859.6404, 3282.5544,
       3995.6821], dtype=float32)

Deployment - Backend

# import the login function from the huggingface_hub library
from huggingface_hub import login, HfApi

# import the create_repo function from the huggingface_hub library
from huggingface_hub import create_repo

# access the secret key in Python
from google.colab import userdata
secret_value = userdata.get('BE_Token')

# login to HuggingFace with access token
login(token=secret_value)

# create the repository for the HuggingFace Space
try:
    create_repo("BigGnTX/superkart_backend",
        repo_type="space",
        space_sdk="docker",
        private=False
    )
except Exception as e:
    # handle any potential errors
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Respository not created.")
    else:
        print(f"Error: {e}")

Flask Web Framework

# create the app.py file in the backend_files folder
%%writefile backend_files/app.py
import numpy as np
import joblib
import pandas as pd
from flask import Flask, request, jsonify

# initialize the flask with a name
sales_forecast_api = Flask("Sales Forecast Predictor")

# load the trained sales forecast model
model = joblib.load("sales_prediction_model_v1_0.joblib")

# define the route for the home page
@sales_forecast_api.get('/')
def home():
    return "Welcome to the Sales Forecast Prediction API!"

# define the endpoint to predict sales forecast
@sales_forecast_api.post('/v1/predict')
def predict_sales():
    # get the JSON data from the request
    predict_data = request.get_json()


    # extract relevant features from the input data.
    sample = {
        'Product_Weight': predict_data['Product_Weight'],
        'Product_Sugar_Content': predict_data['Product_Sugar_Content'],
        'Product_Allocated_Area': predict_data['Product_Allocated_Area'],
        'Product_MRP': predict_data['Product_MRP'],
        'Store_Size': predict_data['Store_Size'],
        'Store_Location_City_Type': predict_data['Store_Location_City_Type'],
        'Store_Type': predict_data['Store_Type'],
        'Store_Years_In_Operation': predict_data['Store_Years_In_Operation'],
        'Product_Code': predict_data['Product_Code'],
        'Product_Category': predict_data['Product_Category']

    }

    # convert the extracted data into a DataFrame
    input_data = pd.DataFrame([sample])

    # make a sales forecast prediction using the trained model
    prediction = model.predict(input_data).tolist()[0]

    # return the prediction as a JSON response
    return jsonify({'Sales': prediction})

# Run the Flask app in debug mode
if __name__ == '__main__':
    sales_forecast_api.run(debug=True)

Writing backend_files/app.py

Dependencies File

# create the requirements.txt file in the backend_files folder
%%writefile backend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
seaborn==0.13.2
joblib==1.4.2
xgboost==2.1.4
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.32.4

Writing backend_files/requirements.txt

Dockerfile

# create the Dockerfile in the backend_files folder
%%writefile backend_files/Dockerfile
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy all files from the current directory to the container's working directory
COPY . .

# Install dependencies from the requirements file without using cache to reduce image size
RUN pip install --no-cache-dir --upgrade -r requirements.txt

# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:sales_forecast_api"]

Writing backend_files/Dockerfile

Setting up a Hugging Face Docker Space for the Backend

Uploading Files to Hugging Face Space (Docker Space)

# for hugging face space authentication to upload files
from huggingface_hub import HfApi

repo_id = "BigGnTX/superkart_backend"

# initialize the API
api = HfApi()

# upload Streamlit app files stored in the folder called backend_files
api.upload_folder(
    folder_path="backend_files",
    repo_id=repo_id,
    repo_type="space"
)

{"model_id":"971072dbb57c4b33993b9f5b96cbbce0","version_major":2,"version_minor":0}

{"model_id":"7fc82ecaa43246f2a6663fddeb689e0b","version_major":2,"version_minor":0}

{"model_id":"da2fba831f3a44d98829a87e915d3133","version_major":2,"version_minor":0}

{"type":"string"}

Deployment - Frontend

Points to note before executing the below cells

Create a Streamlit space on Hugging Face by following the instructions provided on the content page titled Creating Spaces and Adding Secrets in Hugging Face from Week 1

Streamlit for Interactive UI

# Create a folder for storing the files needed for frontend UI deployment
os.makedirs('frontend_files', exist_ok=True)

#create the app.py file in the frontend_files folder
%%writefile frontend_files/app.py
import requests
import streamlit as st
import pandas as pd

st.title('Sales Forecast Prediction')

# input fields for store and product data
Product_Weight = st.slider('Product Weight', min_value=0.0, max_value=30.0, value=12.66)
Product_Sugar_Content = st.selectbox('Product Sugar Content', ['Low Sugar', 'Regular', 'No Sugar'])
Product_Allocated_Area = st.slider('Product Allocated Area', min_value=0.0, max_value=1.0, value = 0.027)
Product_MRP = st.slider('Product MRP', min_value=0.0, max_value=300.0, value = 117.08)
Store_Size = st.selectbox('Store Size', ['Small', 'Medium', 'High'])
Store_Location_City_Type = st.selectbox('Store Location City Type', ['Tier 1', 'Tier 2', 'Tier 3'])
Store_Type = st.selectbox('Store Type', ['Supermarket Type 1', 'Supermarket Type 2', 'Departmental Store', 'Food Mart'])
Store_Years_In_Operation = st.slider('Store Years In Operation', min_value=1, max_value=50, value = 20)
Product_Code = st.selectbox('Product Code', ['FD', 'NC', 'DR'])
Product_Category = st.selectbox('Product Category', ['Food', 'Non Food'])

# converting user input into a DataFrame
forecast_data = {
    'Product_Weight': Product_Weight,
    'Product_Sugar_Content': Product_Sugar_Content,
    'Product_Allocated_Area': Product_Allocated_Area,
    'Product_MRP': Product_MRP,
    'Store_Size': Store_Size,
    'Store_Location_City_Type': Store_Location_City_Type,
    'Store_Type': Store_Type,
    'Store_Years_In_Operation': Store_Years_In_Operation,
    'Product_Code': Product_Code,
    'Product_Category': Product_Category
}

# making prediction when the "Predict" button is clicked
if st.button('Predict', type='primary'):
  response = requests.post('https://biggntx-superkart-backend.hf.space/v1/predict', json=forecast_data)
  if response.status_code == 200:
    result = response.json()
    sales_prediction = result['Sales']
    st.write(f'The Predicted Product Store Sales Total is: ${sales_prediction:.2f}.')
  else:
    st.error(f"Error in API request: Status code {response.status_code}\n{response.text}")

Writing frontend_files/app.py

Dependencies File

#create the requirements.txt file in the frontend_files folder
%%writefile frontend_files/requirements.txt
pandas==2.2.2
requests==2.32.4
streamlit==1.45.0

Writing frontend_files/requirements.txt

DockerFile

#create a Dockerfile in the frontend_files folder
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# set the working directory inside the container to /app
WORKDIR /app

# copy all files from the current directory on the host to the container's /app directory
COPY . .

# install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

# NOTE: Disable XSRF protection for easier external access in order to make batch predictions

Writing frontend_files/Dockerfile

Uploading Files to Hugging Face Space (Streamlit Space)

# setting access key and repo_id of the front end HuggingFace app
# access the front end forecast at https://huggingface.co/spaces/BigGnTX/superkart_forecast
repo_id = "BigGnTX/superkart_forecast"
access_key = "HF_Token"


# login to HuggingFace with access token
login(token=access_key)

# initialize the API
api = HfApi()

# uploading Streamlit app files stored in the folder called frontend_files
api.upload_folder(
    folder_path='/content/frontend_files',
    repo_id=repo_id,
    repo_type='space',
)

{"type":"string"}

Actionable Insights and Business Recommendations

Chose the tuned XGBoost model as it had the highest R-squared score of the two models. It displayed consistent behavior on R-squared for both training and testing.
The R2 score is solid predictor of accuracy and the metric is consistent with no obseravble overfitting.
Seafood and breakfast product types were low revenue generators. The stores may consider sales/rebates/other incentives to increase the sales of those items.
OUT004 is a Medium size Supermarket Type 2 store located in a moderate standard of living area that had 4676 rows of data while the closest other store was OUT001 with a count of 1586. OUT004 reported over 53% (8763) of the totals of the 4 stores. There should be further analysis as to why this store had more sales for accuaracy of reporting from other stores or to learn best practices on why that store was able to be such a high performer.
Most of the EDA analysis is indicative of the high performance revenue and reporting from OUT004.
One feature that could be added for the model is a Standard of Living factor to account for the various costs of products based on location.
Another factor that should be further analyzed is the difference in costs based on convenience at Food Marts which may or may not have different pricing criteria for specific products.
Future enhancements to the code could include batch process to handle batch files.
The forecast app can be located at https://huggingface.co/spaces/BigGnTX/superkart_forecast