Skip to content

Hung4865/H-CSDL

Repository files navigation

Real-time E-commerce Data Pipeline: Kafka to SQL Server A robust end-to-end data engineering project that simulates a real-time e-commerce order processing system. This project demonstrates advanced Database Management (RDBMS) techniques, message queuing with Apache Kafka, and data visualization with Power BI.

I. Key Features

  • Real-time Data Streaming: Utilizes Apache Kafka (KRaft mode) as a message broker to handle high-velocity order streams.
  • Advanced SQL Server Logic: * Stored Procedures: Centralized business logic for parsing JSON data and managing order flow.
    • ACID Transactions: Implements BEGIN TRANSACTION, COMMIT, and ROLLBACK to ensure data integrity (no overselling, no orphan records).
    • Find-or-Create Logic: Automated user normalization during the order process.
    • Triggers: Automatic audit logging of stock changes into inventory_logs.
  • 3NF Database Design: A fully normalized schema with 7 tables to ensure zero data redundancy.
  • Interactive Analytics: A comprehensive Power BI Dashboard connected via SQL Views for real-time business insights.

II. System Architecture The pipeline follows a modern data stack architecture:

  1. Producer (Python): Simulates a storefront generating random orders (JSON format) every 4 seconds.
  2. Kafka Broker: Buffers and queues orders in the ecommerce_orders topic.
  3. Consumer (Python): Listens to Kafka and invokes SQL Stored Procedures via pyodbc.
  4. SQL Server (The Brain): Executes business logic, maintains integrity, and stores data.
  5. Power BI: Visualizes sales performance, customer demographics, and shipping efficiency.

II. Database Schema (3NF) The database is designed following Third Normal Form (3NF) principles to ensure scalability and integrity. Core Tables:

  • orders & order_details: Fact tables storing transaction data.
  • products, categories, users, shippers: Dimension tables for detailed attributes.
  • inventory_logs: Audit table populated by SQL Triggers.

IV. Tech Stack Database: Microsoft SQL Server (T-SQL) Streaming: Apache Kafka (KRaft) Language: Python 3.x (kafka-python, pyodbc, faker) BI Tool: Microsoft Power BI

V. Getting Started

  1. Database Setup Execute the scripts in /sql/init_database.sql using SQL Server Management Studio (SSMS). This will create the database, tables, stored procedures, and triggers.
  2. Kafka Configuration Ensure Kafka is running in KRaft mode and create the required topic: .\bin\windows\kafka-topics.bat --create --topic ecommerce_orders --bootstrap-server localhost:9092
  3. Run the Pipeline First, start the consumer to listen for data: python python/consumer.py Then, start the producer to generate orders: python python/producer.py

VI. Visualizations The Power BI dashboard provides real-time insights into:

  • Revenue Analysis: Sales trends and top-performing categories.
  • Customer Demographics: Order distribution by age, gender, and occupation.
  • Logistics: Shipping provider performance and usage.

VII. Future Improvements Implement Indexing for query optimization on large datasets. Add Window Functions for advanced analytics (e.g., Top 3 products per category). Integrate Row-Level Security (RLS) for multi-tenant reporting.

Author: Đỗ Minh Hoàng Project: Database Management System Final Project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors