Real-time E-commerce Data Pipeline: Kafka to SQL Server A robust end-to-end data engineering project that simulates a real-time e-commerce order processing system. This project demonstrates advanced Database Management (RDBMS) techniques, message queuing with Apache Kafka, and data visualization with Power BI.
I. Key Features
- Real-time Data Streaming: Utilizes Apache Kafka (KRaft mode) as a message broker to handle high-velocity order streams.
- Advanced SQL Server Logic: * Stored Procedures: Centralized business logic for parsing JSON data and managing order flow.
- ACID Transactions: Implements BEGIN TRANSACTION, COMMIT, and ROLLBACK to ensure data integrity (no overselling, no orphan records).
- Find-or-Create Logic: Automated user normalization during the order process.
- Triggers: Automatic audit logging of stock changes into inventory_logs.
- 3NF Database Design: A fully normalized schema with 7 tables to ensure zero data redundancy.
- Interactive Analytics: A comprehensive Power BI Dashboard connected via SQL Views for real-time business insights.
II. System Architecture The pipeline follows a modern data stack architecture:
- Producer (Python): Simulates a storefront generating random orders (JSON format) every 4 seconds.
- Kafka Broker: Buffers and queues orders in the ecommerce_orders topic.
- Consumer (Python): Listens to Kafka and invokes SQL Stored Procedures via pyodbc.
- SQL Server (The Brain): Executes business logic, maintains integrity, and stores data.
- Power BI: Visualizes sales performance, customer demographics, and shipping efficiency.
II. Database Schema (3NF) The database is designed following Third Normal Form (3NF) principles to ensure scalability and integrity. Core Tables:
- orders & order_details: Fact tables storing transaction data.
- products, categories, users, shippers: Dimension tables for detailed attributes.
- inventory_logs: Audit table populated by SQL Triggers.
IV. Tech Stack Database: Microsoft SQL Server (T-SQL) Streaming: Apache Kafka (KRaft) Language: Python 3.x (kafka-python, pyodbc, faker) BI Tool: Microsoft Power BI
V. Getting Started
- Database Setup Execute the scripts in /sql/init_database.sql using SQL Server Management Studio (SSMS). This will create the database, tables, stored procedures, and triggers.
- Kafka Configuration Ensure Kafka is running in KRaft mode and create the required topic: .\bin\windows\kafka-topics.bat --create --topic ecommerce_orders --bootstrap-server localhost:9092
- Run the Pipeline First, start the consumer to listen for data: python python/consumer.py Then, start the producer to generate orders: python python/producer.py
VI. Visualizations The Power BI dashboard provides real-time insights into:
- Revenue Analysis: Sales trends and top-performing categories.
- Customer Demographics: Order distribution by age, gender, and occupation.
- Logistics: Shipping provider performance and usage.
VII. Future Improvements Implement Indexing for query optimization on large datasets. Add Window Functions for advanced analytics (e.g., Top 3 products per category). Integrate Row-Level Security (RLS) for multi-tenant reporting.
Author: Đỗ Minh Hoàng Project: Database Management System Final Project