KaggleMind is an advanced AI-powered system designed to analyze the Meta Kaggle dataset. It leverages a multi-agent architecture to autonomously plan, generate, execute, and correct SQL queries, allowing users to ask complex natural language questions and receive accurate data visualizations.
- Ingestion:
src/ingestion/kaggle_downloader.py: Downloads specific Meta Kaggle tables (Users,Competitions,UserAchievements, etc.) via the Kaggle API.- Orchestration: Dagster (
orchestration/dagster/repository.py) manages the daily sync pipeline.
- Transformation:
src/processing/data_cleaner.py: Uses PySpark to clean raw CSVs and convert them into optimized Parquet files.
- Warehouse:
- DuckDB: Acts as a serverless OLAP engine to query Parquet files directly (supports local disk or remote S3/Supabase storage).
- Orchestration: LangGraph (
src/agent/graph.py) manages the stateful workflow of the agent. - RAG (Retrieval-Augmented Generation):
src/agent/rag_retriever.py: Uses ChromaDB to store semantic descriptions of the schema. It retrieves only the relevant table schemas for a given user query, reducing context window usage and improving accuracy.
- Self-Correction Loop: The agent executes the generated SQL against DuckDB. If an error occurs (e.g., syntax error, missing column), the error is fed back into the LLM to autonomously fix the query.
- LLM: Powered by DeepSeek-V3 (via OpenAI-compatible API) for high-performance code generation.
- Streamlit: Provides an interactive chat interface (
src/app.py). - Plotly: Automatically visualizes query results based on data types.
data-platform/
βββ src/
β βββ agent/ # AI Agent Logic
β β βββ graph.py # LangGraph workflow definition
β β βββ rag_retriever.py # ChromaDB schema retrieval
β βββ ingestion/ # Data Ingestion
β β βββ kaggle_downloader.py
β βββ processing/ # Data Transformation
β β βββ data_cleaner.py # PySpark ETL
β βββ app.py # Streamlit Frontend
βββ orchestration/
β βββ dagster/ # Dagster Repository
β βββ repository.py
βββ data/ # Local data storage (raw/processed)
βββ requirements.txt
βββ docker-compose.yml
βββ Makefile
βββ .pre-commit-config.yaml
- Python 3.9+
- Java 17 (for PySpark)
- Kaggle API credentials
- DeepSeek API Key
-
Clone the repository:
git clone <repository-url> cd data-platform
-
Install dependencies:
pip install -r requirements.txt
-
Install Pre-commit Hooks:
pre-commit install
-
Set up environment variables:
cp configs/.env.example .env
Ensure
KAGGLE_USERNAME,KAGGLE_KEY, andDEEPSEEK_API_KEYare set. -
Run the Data Pipeline: You can use the Makefile shortcuts:
# Download specific tables make download # Process CSV to Parquet make process
or Orchestration with Dagster:
make dagster
Access the UI at
http://localhost:3000. -
Run the Application:
make app
Enter your DeepSeek API Key in the sidebar to start analyzing.
Run the full stack including Dagster and Streamlit:
make docker-up- Linting & Formatting: This project uses
black,isort, andflake8.pre-commit run --all-files
- Testing:
make test
- Data Ingestion & Processing
- Basic Streamlit UI
- LangGraph Multi-Agent System
- Semantic Layer & Vector DB
- FastAPI Backend
- Natural Language to SQL: Ask questions like "Which Grandmasters have the highest conversion rate from forum posts to gold medals?"
- Schema-Aware RAG: The agent understands the specific schema of the Meta Kaggle dataset.
- Auto-Correction: If the agent writes bad SQL, it fixes it automatically.
- Visualizations: Dynamic charts generated from query results.
MIT