KaggleMind: An Autonomous Agentic SQL Analyst

Overview

KaggleMind is an advanced AI-powered system designed to analyze the Meta Kaggle dataset. It leverages a multi-agent architecture to autonomously plan, generate, execute, and correct SQL queries, allowing users to ask complex natural language questions and receive accurate data visualizations.

Architecture

Data Tier

Ingestion:
- src/ingestion/kaggle_downloader.py: Downloads specific Meta Kaggle tables (Users, Competitions, UserAchievements, etc.) via the Kaggle API.
- Orchestration: Dagster (orchestration/dagster/repository.py) manages the daily sync pipeline.
Transformation:
- src/processing/data_cleaner.py: Uses PySpark to clean raw CSVs and convert them into optimized Parquet files.
Warehouse:
- DuckDB: Acts as a serverless OLAP engine to query Parquet files directly (supports local disk or remote S3/Supabase storage).

AI Tier (Agentic System)

Orchestration: LangGraph (src/agent/graph.py) manages the stateful workflow of the agent.
RAG (Retrieval-Augmented Generation):
- src/agent/rag_retriever.py: Uses ChromaDB to store semantic descriptions of the schema. It retrieves only the relevant table schemas for a given user query, reducing context window usage and improving accuracy.
Self-Correction Loop: The agent executes the generated SQL against DuckDB. If an error occurs (e.g., syntax error, missing column), the error is fed back into the LLM to autonomously fix the query.
LLM: Powered by DeepSeek-V3 (via OpenAI-compatible API) for high-performance code generation.

Frontend

Streamlit: Provides an interactive chat interface (src/app.py).
Plotly: Automatically visualizes query results based on data types.

Project Structure

data-platform/
├── src/
│   ├── agent/              # AI Agent Logic
│   │   ├── graph.py        # LangGraph workflow definition
│   │   └── rag_retriever.py # ChromaDB schema retrieval
│   ├── ingestion/          # Data Ingestion
│   │   └── kaggle_downloader.py
│   ├── processing/         # Data Transformation
│   │   └── data_cleaner.py # PySpark ETL
│   └── app.py              # Streamlit Frontend
├── orchestration/
│   └── dagster/            # Dagster Repository
│       └── repository.py
├── data/                   # Local data storage (raw/processed)
├── requirements.txt
├── docker-compose.yml
├── Makefile
└── .pre-commit-config.yaml

Getting Started

Prerequisites

Python 3.9+
Java 17 (for PySpark)
Kaggle API credentials
DeepSeek API Key

Setup

Clone the repository:

git clone <repository-url>
cd data-platform

Install dependencies:
```
pip install -r requirements.txt
```
Install Pre-commit Hooks:
```
pre-commit install
```
Set up environment variables:
```
cp configs/.env.example .env
```
Ensure KAGGLE_USERNAME, KAGGLE_KEY, and DEEPSEEK_API_KEY are set.
Run the Data Pipeline: You can use the Makefile shortcuts:
```
# Download specific tables
make download

# Process CSV to Parquet
make process
```
or Orchestration with Dagster:
```
make dagster
```
Access the UI at http://localhost:3000.
Run the Application:
```
make app
```
Enter your DeepSeek API Key in the sidebar to start analyzing.

Docker

Run the full stack including Dagster and Streamlit:

make docker-up

Development

Linting & Formatting: This project uses black, isort, and flake8.
```
pre-commit run --all-files
```
Testing:
```
make test
```

Roadmap

Features

Natural Language to SQL: Ask questions like "Which Grandmasters have the highest conversion rate from forum posts to gold medals?"
Schema-Aware RAG: The agent understands the specific schema of the Meta Kaggle dataset.
Auto-Correction: If the agent writes bad SQL, it fixes it automatically.
Visualizations: Dynamic charts generated from query results.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
analytics/reports		analytics/reports
configs		configs
docs		docs
infra		infra
orchestration		orchestration
scripts		scripts
src		src
tests		tests
workflows		workflows
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile.streamlit		Dockerfile.streamlit
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KaggleMind: An Autonomous Agentic SQL Analyst

Overview

Architecture

Data Tier

AI Tier (Agentic System)

Frontend

Project Structure

Getting Started

Prerequisites

Setup

Docker

Development

Roadmap

Features

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KaggleMind: An Autonomous Agentic SQL Analyst

Overview

Architecture

Data Tier

AI Tier (Agentic System)

Frontend

Project Structure

Getting Started

Prerequisites

Setup

Docker

Development

Roadmap

Features

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages