About:
This project is to perform clustering on the product according to their characteristics in product, sales trend and shipping in order to optimise the stock and enabling targeted marketing and pricing strategies.
Details of the dataset selected for this application:
Name of dataset: E- Commerce Dataset
Source: UCI Machine Learning Repository
(https://www.kaggle.com/datasets/malaiarasugraj/e-commerce-dataset)
Instances: 1000000 rows, 16 columns
Description of dataset: Based on the information given on the source, selected dataset consists of 1000000 rows, where each row corresponds to the simulated overview of e-commerce operations, includes information such as products, customers, pricing, sales trend and shipping details.
| Variable | Description | Data Type | Missing Value |
|---|---|---|---|
| Product ID | Unique identifier for each product. | String | no |
| Product Name | The name or title of the product listed in the catalog. | Categorical | no |
| Category | The category or type of the product (e.g., Electronics, Clothing, Home Decor). | Categorical | no |
| Price | The price of the product in USD. | Float | no |
| Discount | The discount applied to the product as a percentage of the original price. | Integer | no |
| Tax Rate | The applicable tax rate for the product as a percentage. | Integer | no |
| Stock Level | The number of units currently available in inventory. | Integer | no |
| Supplier ID | A unique identifier for the supplier of the product. | String | no |
| Customer Age Group | The age group of customers who frequently purchase this product (e.g., Teens, Adults, Seniors). | Categorical | no |
| Customer Location | The geographical location of customers (e.g., Country, State, or City). | Categorical | no |
| Customer Gender | The gender(s) of customers most likely to purchase this product (e.g., Male, Female, Both). | Categorical | no |
| Shipping Cost | The cost of shipping the product in USD. | Float | no |
| Shipping Method | The method of shipping used (e.g., Standard, Express, Overnight). | Categorical | no |
| Return Rate | The percentage of orders for this product that are returned by customers. | Float | no |
| Seasonality | The season(s) during which the product is most popular (e.g., Winter, Summer, All-Year). | Categorical | no |
| Popularity Index | A score indicating the product's popularity on a scale of 0 to 100. | Integer | no |
To perform clustering on the product according to their characteristics in product, sales trend and shipping in order to optimise the stock and enabling targeted marketing and pricing strategies..
Model used: K-Means
- Scales well with large datasets: The dataset size used in this study is large, up to 1000000 rows of instances with 5 features. To enable the study to be done under the constraint of computing power, K-Means is selected.
- No outlier: Cons of K-Means is it is sensitive to outliers, fortunately in this dataset, there is no outlier observed for the features used for clustering.
DBSCAN is not selected in this study as it has large time complexity with large dataset and sensitive to the hyperparameters value that requires more computational power in fine tuning.
Parameters used: The two key parameters are selected based on the experiment done to determine the best number of clusters from the dataset based on the inertia.
- K (number of clusters) : 5
- Metric : Euclidean (Commonly used and for simplicity)
Classification Steps
These are the steps to be performed in classification
Step 1: Data Cleaning
Sanity checks on the data has been performed:
- Missing value: No missing values are found.
- Inconsistent value: There is no inconsistency found in the categorical value.
- Extreme value: No abnormal extreme value found in numerical variables.
Step 2: Feature Engineering (Data Encoding)
In this step, the numerical variable used for clustering is transformed through standardisation. This is to remove the mean and scales the data to unit variance so that all features have similar scale (avoid bias) for distance metrics related algorithm like K-Means.
Step 3: Feature Selection
A quick experiment is conducted to define the feature, by checking on the overall distortion score (inertia) of the model trained with the features selected. Initially, these features are selected because they directly related to product and based on the hypothesis as below:
| Features | Justification |
|---|---|
| Price | Reflect its affordability. |
| Discount | Reflect its promotional activity and product attractiveness for price-sensitive customers. |
| Stock Level | Reflect its product availability and inventory trends. (Demand) |
| Shipping Cost | Reflect overall customer satisfaction and product profitability. (Higher shipping cost may reflect larger and premium products.) |
| Popularity Index | Reflect the customer preferences and product demand. |
Based on the plot above using the input features of ['Price', 'Discount', 'Stock Level', 'Shipping Cost', 'Popularity Index'], the model gives overall inertia score of 3161661. Best number of clusters at 5.
By excluding 'Discount', the inertia improved. Therefore, final input features to be used for clustering is ['Price', 'Stock Level', 'Shipping Cost', 'Popularity Index'].
Step 4: Model Training
In this stage, experiment to search for the optimal parameter for K-Means to be conducted. Elbow method is used in this case. Details of the experiment as below:
- To perform K-Means clustering with varying number of clusters K.
- Plot the distortion score (metric used: sum of squared distance/inertia) vs number of clusters and determine the “elbow” in the graph.
Step 5: Model Evaluation
Since there is no ground truth table to check on the correctness of the cluster of data points, a principal component analysis is applied to reduce the dimension of the input features into 2D, in order to visualise it in a scatterplot to check whether the cluster are distinct. Based on the plot as shown below, it appeared that the cluster is distinguishable. Hence, the quality of the cluster is good.
Table below shows the summary statistics for the clusters:
| Cluster No. | Price | Stock Level | Shipping Cost | Popularity Index |
|---|---|---|---|---|
| 0 | 1380.14 | 239.87 | 13.99 | 76.83 |
| 1 | 464.83 | 387.98 | 24.09 | 52.70 |
| 2 | 1508.33 | 331.34 | 35.63 | 36.39 |
| 3 | 734.45 | 123.38 | 37.15 | 60.52 |
| 4 | 969.26 | 162.06 | 13.44 | 23.19 |
From the summary statistics table, we can characterise the clusters as follow:
| Cluster No. | Product Characteristics |
|---|---|
| Cluster 0 |
|
| Cluster 1 |
|
| Cluster 2 |
|
| Cluster 3 |
|
| Cluster 4 |
|
Based on the clusters, it can be deduced that cluster 0 and cluster 2 representing high- value and premium products, cluster 3 and 4 are mid-range product while cluster 1 is lower-value product.
There can be specific marketing and pricing strategies optimised for each of the cluster, example as below:
| Cluster | Marketing & Pricing Strategy |
|---|---|
| Cluster 0 |
|
| Cluster 1 |
|
| Cluster 2 |
|
| Cluster 3 |
|
| Cluster 4 |
|


