Skip to content

chenlamchan/Unsupervised-Learning-Clustering-ECommerce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

Unsupervised-Learning-Clustering-ECommerce

About:

This project is to perform clustering on the product according to their characteristics in product, sales trend and shipping in order to optimise the stock and enabling targeted marketing and pricing strategies.

Dataset:

Details of the dataset selected for this application:

Name of dataset: E- Commerce Dataset
Source: UCI Machine Learning Repository (https://www.kaggle.com/datasets/malaiarasugraj/e-commerce-dataset)
Instances: 1000000 rows, 16 columns
Description of dataset: Based on the information given on the source, selected dataset consists of 1000000 rows, where each row corresponds to the simulated overview of e-commerce operations, includes information such as products, customers, pricing, sales trend and shipping details.

Variable Description Data Type Missing Value
Product ID Unique identifier for each product. String no
Product Name The name or title of the product listed in the catalog. Categorical no
Category The category or type of the product (e.g., Electronics, Clothing, Home Decor). Categorical no
Price The price of the product in USD. Float no
Discount The discount applied to the product as a percentage of the original price. Integer no
Tax Rate The applicable tax rate for the product as a percentage. Integer no
Stock Level The number of units currently available in inventory. Integer no
Supplier ID A unique identifier for the supplier of the product. String no
Customer Age Group The age group of customers who frequently purchase this product (e.g., Teens, Adults, Seniors). Categorical no
Customer Location The geographical location of customers (e.g., Country, State, or City). Categorical no
Customer Gender The gender(s) of customers most likely to purchase this product (e.g., Male, Female, Both). Categorical no
Shipping Cost The cost of shipping the product in USD. Float no
Shipping Method The method of shipping used (e.g., Standard, Express, Overnight). Categorical no
Return Rate The percentage of orders for this product that are returned by customers. Float no
Seasonality The season(s) during which the product is most popular (e.g., Winter, Summer, All-Year). Categorical no
Popularity Index A score indicating the product's popularity on a scale of 0 to 100. Integer no

Task:

To perform clustering on the product according to their characteristics in product, sales trend and shipping in order to optimise the stock and enabling targeted marketing and pricing strategies..

Model Development

Model used: K-Means

  • Scales well with large datasets: The dataset size used in this study is large, up to 1000000 rows of instances with 5 features. To enable the study to be done under the constraint of computing power, K-Means is selected.
  • No outlier: Cons of K-Means is it is sensitive to outliers, fortunately in this dataset, there is no outlier observed for the features used for clustering.

DBSCAN is not selected in this study as it has large time complexity with large dataset and sensitive to the hyperparameters value that requires more computational power in fine tuning.

Parameters used: The two key parameters are selected based on the experiment done to determine the best number of clusters from the dataset based on the inertia.

  • K (number of clusters) : 5
  • Metric : Euclidean (Commonly used and for simplicity)

Code Execution

Classification Steps
These are the steps to be performed in classification
Step 1: Data Cleaning
Sanity checks on the data has been performed:

  • Missing value: No missing values are found.
  • Inconsistent value: There is no inconsistency found in the categorical value.
  • Extreme value: No abnormal extreme value found in numerical variables.

Step 2: Feature Engineering (Data Encoding)
In this step, the numerical variable used for clustering is transformed through standardisation. This is to remove the mean and scales the data to unit variance so that all features have similar scale (avoid bias) for distance metrics related algorithm like K-Means.

Step 3: Feature Selection
A quick experiment is conducted to define the feature, by checking on the overall distortion score (inertia) of the model trained with the features selected. Initially, these features are selected because they directly related to product and based on the hypothesis as below:

Features Justification
Price Reflect its affordability.
Discount Reflect its promotional activity and product attractiveness for price-sensitive customers.
Stock Level Reflect its product availability and inventory trends. (Demand)
Shipping Cost Reflect overall customer satisfaction and product profitability. (Higher shipping cost may reflect larger and premium products.)
Popularity Index Reflect the customer preferences and product demand.

distortion-1

Based on the plot above using the input features of ['Price', 'Discount', 'Stock Level', 'Shipping Cost', 'Popularity Index'], the model gives overall inertia score of 3161661. Best number of clusters at 5.

distortion-2

By excluding 'Discount', the inertia improved. Therefore, final input features to be used for clustering is ['Price', 'Stock Level', 'Shipping Cost', 'Popularity Index'].

Step 4: Model Training
In this stage, experiment to search for the optimal parameter for K-Means to be conducted. Elbow method is used in this case. Details of the experiment as below:

  • To perform K-Means clustering with varying number of clusters K.
  • Plot the distortion score (metric used: sum of squared distance/inertia) vs number of clusters and determine the “elbow” in the graph.

Step 5: Model Evaluation
Since there is no ground truth table to check on the correctness of the cluster of data points, a principal component analysis is applied to reduce the dimension of the input features into 2D, in order to visualise it in a scatterplot to check whether the cluster are distinct. Based on the plot as shown below, it appeared that the cluster is distinguishable. Hence, the quality of the cluster is good.

cluster

Result

Table below shows the summary statistics for the clusters:

Cluster No. Price Stock Level Shipping Cost Popularity Index
0 1380.14 239.87 13.99 76.83
1 464.83 387.98 24.09 52.70
2 1508.33 331.34 35.63 36.39
3 734.45 123.38 37.15 60.52
4 969.26 162.06 13.44 23.19

From the summary statistics table, we can characterise the clusters as follow:

Cluster No. Product Characteristics
Cluster 0
  • Mid-high price point
  • Low-mid shipping cost
  • Highest popularity index
  • Mid stock level
  • Can be premium and popular products that is optimised on its shipping.
Cluster 1
  • Lowest price point
  • Mid shipping cost
  • Mid popularity index
  • Highest stock level
  • Can be the budget-friendly products, high inventory items
Cluster 2
  • Highest price point
  • Mid-high shipping cost
  • Low-mid popularity index
  • Mid-high stock level
  • Represents the high-end and expensive products with lower popularity
Cluster 3
  • Low-mid price point
  • Highest shipping cost
  • Mid-high popularity index
  • Lowest stock level
  • Represent popular mid-range products with higher shipping cost
Cluster 4
  • Mid-range price point
  • Lowest shipping cost
  • Lowest popularity index
  • Low-mid stock level
  • Indicates the mid-range niche products

Based on the clusters, it can be deduced that cluster 0 and cluster 2 representing high- value and premium products, cluster 3 and 4 are mid-range product while cluster 1 is lower-value product.

There can be specific marketing and pricing strategies optimised for each of the cluster, example as below:

Cluster Marketing & Pricing Strategy
Cluster 0
  • Implement flash sales during peak seasons to boost sales
  • Maintain the stability of price but to offer exclusive deals to loyal customers
Cluster 1
  • Create bulk-buy sales event or volume-based promotional campaigns to clear some stock
  • Dynamic pricing based on the inventory levels
Cluster 2
  • Develop targeted advertising to high purchasing power individuals
  • Seasonal pricing adjustments to boost stock clearance
Cluster 3
  • Focus on fast delivery options despite high shipping costs
  • Develop waitlist systems for out-of-stock items
  • Offer free shipping thresholds to offset high shipping costs
Cluster 4
  • Develop targeted advertising through specialized channels or community
  • Offer package deals with related niche products

About

This is about a clustering project using K-Means algorithm to group product based on their characteristics to optimize stock and marketing strategies.

Topics

Resources

Stars

Watchers

Forks

Contributors