Skip to content

AdaXL/data-platform

Repository files navigation

KaggleMind: An Autonomous Agentic SQL Analyst

Overview

KaggleMind is an advanced AI-powered system designed to analyze the Meta Kaggle dataset. It leverages a multi-agent architecture to autonomously plan, generate, execute, and correct SQL queries, allowing users to ask complex natural language questions and receive accurate data visualizations.

Architecture

Data Tier

  • Ingestion:
    • src/ingestion/kaggle_downloader.py: Downloads specific Meta Kaggle tables (Users, Competitions, UserAchievements, etc.) via the Kaggle API.
    • Orchestration: Dagster (orchestration/dagster/repository.py) manages the daily sync pipeline.
  • Transformation:
    • src/processing/data_cleaner.py: Uses PySpark to clean raw CSVs and convert them into optimized Parquet files.
  • Warehouse:
    • DuckDB: Acts as a serverless OLAP engine to query Parquet files directly (supports local disk or remote S3/Supabase storage).

AI Tier (Agentic System)

  • Orchestration: LangGraph (src/agent/graph.py) manages the stateful workflow of the agent.
  • RAG (Retrieval-Augmented Generation):
    • src/agent/rag_retriever.py: Uses ChromaDB to store semantic descriptions of the schema. It retrieves only the relevant table schemas for a given user query, reducing context window usage and improving accuracy.
  • Self-Correction Loop: The agent executes the generated SQL against DuckDB. If an error occurs (e.g., syntax error, missing column), the error is fed back into the LLM to autonomously fix the query.
  • LLM: Powered by DeepSeek-V3 (via OpenAI-compatible API) for high-performance code generation.

Frontend

  • Streamlit: Provides an interactive chat interface (src/app.py).
  • Plotly: Automatically visualizes query results based on data types.

Project Structure

data-platform/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agent/              # AI Agent Logic
β”‚   β”‚   β”œβ”€β”€ graph.py        # LangGraph workflow definition
β”‚   β”‚   └── rag_retriever.py # ChromaDB schema retrieval
β”‚   β”œβ”€β”€ ingestion/          # Data Ingestion
β”‚   β”‚   └── kaggle_downloader.py
β”‚   β”œβ”€β”€ processing/         # Data Transformation
β”‚   β”‚   └── data_cleaner.py # PySpark ETL
β”‚   └── app.py              # Streamlit Frontend
β”œβ”€β”€ orchestration/
β”‚   └── dagster/            # Dagster Repository
β”‚       └── repository.py
β”œβ”€β”€ data/                   # Local data storage (raw/processed)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Makefile
└── .pre-commit-config.yaml

Getting Started

Prerequisites

  • Python 3.9+
  • Java 17 (for PySpark)
  • Kaggle API credentials
  • DeepSeek API Key

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd data-platform
  2. Install dependencies:

    pip install -r requirements.txt
  3. Install Pre-commit Hooks:

    pre-commit install
  4. Set up environment variables:

    cp configs/.env.example .env

    Ensure KAGGLE_USERNAME, KAGGLE_KEY, and DEEPSEEK_API_KEY are set.

  5. Run the Data Pipeline: You can use the Makefile shortcuts:

    # Download specific tables
    make download
    
    # Process CSV to Parquet
    make process

    or Orchestration with Dagster:

    make dagster

    Access the UI at http://localhost:3000.

  6. Run the Application:

    make app

    Enter your DeepSeek API Key in the sidebar to start analyzing.

Docker

Run the full stack including Dagster and Streamlit:

make docker-up

Development

  • Linting & Formatting: This project uses black, isort, and flake8.
    pre-commit run --all-files
  • Testing:
    make test

Roadmap

  • Data Ingestion & Processing
  • Basic Streamlit UI
  • LangGraph Multi-Agent System
  • Semantic Layer & Vector DB
  • FastAPI Backend

Features

  • Natural Language to SQL: Ask questions like "Which Grandmasters have the highest conversion rate from forum posts to gold medals?"
  • Schema-Aware RAG: The agent understands the specific schema of the Meta Kaggle dataset.
  • Auto-Correction: If the agent writes bad SQL, it fixes it automatically.
  • Visualizations: Dynamic charts generated from query results.

License

MIT

About

🚧 An Autonomous Agentic SQL Analyst

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors