Unsupervised-Learning-Clustering-ECommerce

About:

This project is to perform clustering on the product according to their characteristics in product, sales trend and shipping in order to optimise the stock and enabling targeted marketing and pricing strategies.

Dataset:

Details of the dataset selected for this application:

Name of dataset: E- Commerce Dataset
Source: UCI Machine Learning Repository (https://www.kaggle.com/datasets/malaiarasugraj/e-commerce-dataset)
Instances: 1000000 rows, 16 columns
Description of dataset: Based on the information given on the source, selected dataset consists of 1000000 rows, where each row corresponds to the simulated overview of e-commerce operations, includes information such as products, customers, pricing, sales trend and shipping details.

Variable	Description	Data Type	Missing Value
Product ID	Unique identifier for each product.	String	no
Product Name	The name or title of the product listed in the catalog.	Categorical	no
Category	The category or type of the product (e.g., Electronics, Clothing, Home Decor).	Categorical	no
Price	The price of the product in USD.	Float	no
Discount	The discount applied to the product as a percentage of the original price.	Integer	no
Tax Rate	The applicable tax rate for the product as a percentage.	Integer	no
Stock Level	The number of units currently available in inventory.	Integer	no
Supplier ID	A unique identifier for the supplier of the product.	String	no
Customer Age Group	The age group of customers who frequently purchase this product (e.g., Teens, Adults, Seniors).	Categorical	no
Customer Location	The geographical location of customers (e.g., Country, State, or City).	Categorical	no
Customer Gender	The gender(s) of customers most likely to purchase this product (e.g., Male, Female, Both).	Categorical	no
Shipping Cost	The cost of shipping the product in USD.	Float	no
Shipping Method	The method of shipping used (e.g., Standard, Express, Overnight).	Categorical	no
Return Rate	The percentage of orders for this product that are returned by customers.	Float	no
Seasonality	The season(s) during which the product is most popular (e.g., Winter, Summer, All-Year).	Categorical	no
Popularity Index	A score indicating the product's popularity on a scale of 0 to 100.	Integer	no

Task:

To perform clustering on the product according to their characteristics in product, sales trend and shipping in order to optimise the stock and enabling targeted marketing and pricing strategies..

Model Development

Model used: K-Means

Scales well with large datasets: The dataset size used in this study is large, up to 1000000 rows of instances with 5 features. To enable the study to be done under the constraint of computing power, K-Means is selected.
No outlier: Cons of K-Means is it is sensitive to outliers, fortunately in this dataset, there is no outlier observed for the features used for clustering.

DBSCAN is not selected in this study as it has large time complexity with large dataset and sensitive to the hyperparameters value that requires more computational power in fine tuning.

Parameters used: The two key parameters are selected based on the experiment done to determine the best number of clusters from the dataset based on the inertia.

K (number of clusters) : 5
Metric : Euclidean (Commonly used and for simplicity)

Code Execution

Classification Steps
These are the steps to be performed in classification
Step 1: Data Cleaning
Sanity checks on the data has been performed:

Missing value: No missing values are found.
Inconsistent value: There is no inconsistency found in the categorical value.
Extreme value: No abnormal extreme value found in numerical variables.

Step 2: Feature Engineering (Data Encoding)
In this step, the numerical variable used for clustering is transformed through standardisation. This is to remove the mean and scales the data to unit variance so that all features have similar scale (avoid bias) for distance metrics related algorithm like K-Means.

Step 3: Feature Selection
A quick experiment is conducted to define the feature, by checking on the overall distortion score (inertia) of the model trained with the features selected. Initially, these features are selected because they directly related to product and based on the hypothesis as below:

Features	Justification
Price	Reflect its affordability.
Discount	Reflect its promotional activity and product attractiveness for price-sensitive customers.
Stock Level	Reflect its product availability and inventory trends. (Demand)
Shipping Cost	Reflect overall customer satisfaction and product profitability. (Higher shipping cost may reflect larger and premium products.)
Popularity Index	Reflect the customer preferences and product demand.

Based on the plot above using the input features of ['Price', 'Discount', 'Stock Level', 'Shipping Cost', 'Popularity Index'], the model gives overall inertia score of 3161661. Best number of clusters at 5.

By excluding 'Discount', the inertia improved. Therefore, final input features to be used for clustering is ['Price', 'Stock Level', 'Shipping Cost', 'Popularity Index'].

Step 4: Model Training
In this stage, experiment to search for the optimal parameter for K-Means to be conducted. Elbow method is used in this case. Details of the experiment as below:

To perform K-Means clustering with varying number of clusters K.
Plot the distortion score (metric used: sum of squared distance/inertia) vs number of clusters and determine the “elbow” in the graph.

Step 5: Model Evaluation
Since there is no ground truth table to check on the correctness of the cluster of data points, a principal component analysis is applied to reduce the dimension of the input features into 2D, in order to visualise it in a scatterplot to check whether the cluster are distinct. Based on the plot as shown below, it appeared that the cluster is distinguishable. Hence, the quality of the cluster is good.

Result

Table below shows the summary statistics for the clusters:

Cluster No.	Price	Stock Level	Shipping Cost	Popularity Index
0	1380.14	239.87	13.99	76.83
1	464.83	387.98	24.09	52.70
2	1508.33	331.34	35.63	36.39
3	734.45	123.38	37.15	60.52
4	969.26	162.06	13.44	23.19

From the summary statistics table, we can characterise the clusters as follow:

Cluster No.	Product Characteristics
Cluster 0	Mid-high price point Low-mid shipping cost Highest popularity index Mid stock level Can be premium and popular products that is optimised on its shipping.
Cluster 1	Lowest price point Mid shipping cost Mid popularity index Highest stock level Can be the budget-friendly products, high inventory items
Cluster 2	Highest price point Mid-high shipping cost Low-mid popularity index Mid-high stock level Represents the high-end and expensive products with lower popularity
Cluster 3	Low-mid price point Highest shipping cost Mid-high popularity index Lowest stock level Represent popular mid-range products with higher shipping cost
Cluster 4	Mid-range price point Lowest shipping cost Lowest popularity index Low-mid stock level Indicates the mid-range niche products

Based on the clusters, it can be deduced that cluster 0 and cluster 2 representing high- value and premium products, cluster 3 and 4 are mid-range product while cluster 1 is lower-value product.

There can be specific marketing and pricing strategies optimised for each of the cluster, example as below:

Cluster	Marketing & Pricing Strategy
Cluster 0	Implement flash sales during peak seasons to boost sales Maintain the stability of price but to offer exclusive deals to loyal customers
Cluster 1	Create bulk-buy sales event or volume-based promotional campaigns to clear some stock Dynamic pricing based on the inventory levels
Cluster 2	Develop targeted advertising to high purchasing power individuals Seasonal pricing adjustments to boost stock clearance
Cluster 3	Focus on fast delivery options despite high shipping costs Develop waitlist systems for out-of-stock items Offer free shipping thresholds to offset high shipping costs
Cluster 4	Develop targeted advertising through specialized channels or community Offer package deals with related niche products

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
img		img
README.md		README.md
ecommerce-clustering.ipynb		ecommerce-clustering.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unsupervised-Learning-Clustering-ECommerce

Dataset:

Task:

Model Development

Code Execution

Result

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unsupervised-Learning-Clustering-ECommerce

Dataset:

Task:

Model Development

Code Execution

Result

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages