GitHub - firefly-operationOS/flyquery: Tabular Intelligence — multi-tenant Text-to-SQL over user-uploaded structured files (CSV/XLSX/JSON/Parquet/Avro/...). Materialises uploads to Parquet on object storage, queried by DuckDB through a multi-agent pipeline (Grounding → Generation → Critic → Explainer). Part of Firefly OperationOS.

Tabular Intelligence

Conversational SQL over the files your business actually has.

Drop a CSV, XLSX, JSON or Parquet into a workspace. flyquery infers the schema, materialises the file as columnar Parquet on object storage and indexes it in a long-lived knowledge base of tables, columns, AI-proposed joins, business glossary and semantic-layer metrics — then answers natural-language questions through a multi-agent pipeline (Grounding → Generation → Critic → Explainer) that produces governed DuckDB SQL with row-level audit, tenant isolation and per-token scopes. One HTTP service, multi-tenant from the first call. Format coverage: CSV, TSV, XLSX, XLS, ODS, JSON, JSONL, Parquet, Avro, ORC, Arrow and Feather, plus their .gz, .zip and .bz2 variants.

In a hurry? Jump to the Quickstart → · SDK paths: Python · Java · Wire payloads: Payload reference →

Why this service exists

Analysts answering ad-hoc questions live in two worlds:

Spreadsheets — fast to start, impossible to govern. Excel and CSV tabs accumulate in inboxes; nobody knows which version is current, which column means what, or whether the numbers can be trusted.
Warehouses + BI tools — governed but expensive. Onboarding a new dataset means an ETL ticket, a dbt model, a schema review, and weeks before the first dashboard ships.

flyquery is the middle path. Drop a CSV (or XLSX, or JSON, or Parquet) into a workspace; flyquery materialises it as Parquet on object storage, derives a typed schema, samples + profiles each column, runs a PII pass, proposes joins to your other tables, and fits it to a long-lived knowledge base. Then you ask in English. A multi-agent pipeline (Grounding → Generation → Critic → Explainer) produces governed DuckDB SQL, runs it in a sandboxed connection under per-token scopes, returns rows + a chart hint, and keeps a conversation thread for drill-down.

You get the governance of a warehouse (RLS, audit, scopes, versioned schemas, semantic-layer metrics) and the time-to-first- answer of a spreadsheet (upload, ask, done).

What makes this non-trivial:

Structured data is messy. Real-world files have inconsistent headers, mixed date formats, merged cells, compressed archives, and multi-sheet XLSX with completely different schemas per sheet.
Re-uploads break queries. Column renames across versions silently break historical analytics. flyquery detects drifts, preserves annotations, and pins historical queries to specific Parquet snapshots.
SQL generation is unreliable. A single-pass LLM call mis-identifies tables, invents columns, and silently drops GROUP BY. flyquery uses a four-agent pipeline with an AST firewall and self-correcting Critic to ensure only valid, scoped SQL executes.

Why not part of flycanon?

flycanon and flyquery are deliberately separate services that solve fundamentally different problems. Combining them would create a coherent-looking mess.

flycanon is the operational knowledge repository for unstructured content (documents, HTML, PDFs, audio transcripts). Its model is: docs → chunks → embeddings → hybrid retrieval → grounded answers with citations. The query language is natural-language-to-RAG. The storage engine is pgvector + BM25 over text chunks. The retrieval primitive is semantic chunk retrieval with candidate scoring. The user-facing concepts are knowledge items, candidates, and supersession.

flyquery handles structured tabular data. Its model is: file → typed schema → Parquet on object storage → DuckDB SQL via multi-agent pipeline. The query language is natural-language-to-SQL. The storage engine is Parquet on object storage queried in-process by DuckDB. The retrieval primitive is schema-element retrieval + table resolution + AST firewall. The user-facing concepts are datasets, tables, snapshots, examples, and metrics.

The query languages, storage engines, retrieval primitives, and user-facing concepts conflict on every layer. Keeping them separate lets each service be sharp at its one job. Cross-service handoff happens via the agent tier with idempotency; see docs/integration-with-firefly-os.md.

The shape of the system

                        User / Agent caller
                               │
                  REST API  /api/v1/*
                               │
         ┌─────────────────────┼──────────────────────┐
         │                     │                      │
   Upload surface         Query surface         Schema annotation
  POST /datasets/{id}/files  POST /query         PUT /schema-objects/{id}
  PUT  /tables/{id}:upload   POST /conversations  POST /examples/{id}:approve
         │                     │
         ▼                     ▼
  [IngestWorker]       [QueryService]
  10-stage async         Hybrid Retrieval
  pipeline               GroundingAgent
  (EDA topic)            GenerationAgent
         │               DuckDB Executor
         ▼               CriticAgent
  Parquet on             ExplainerAgent
  ObjectStore                  │
  (LocalFs/S3/GCS/Azure)       ▼
         │               AnswerResponse + SSE
         ▼               + presigned Parquet URL
  Postgres + pgvector
  (schema KB, relations,
   examples, audit)

Three tiers: upload surface (async ingest into object storage + schema KB), query surface (multi-agent NL→SQL over Parquet), annotation surface (human governance of descriptions, PII tags, relations, examples).

Quickstart

The commands below use a local stack (task dev starts Postgres + Redis + MinIO). See QUICKSTART.md for the full walkthrough including credentials, MinIO bucket setup, and environment variables.

# 1. Boot
task dev && task migrate && task serve

# 2. Create a workspace and dataset
TENANT=acme
WS_ID=$(curl -s -X POST http://localhost:8520/api/v1/workspaces \
  -H "Content-Type: application/json" \
  -H "X-Tenant-Id: $TENANT" \
  -d '{"name": "analytics", "description": "Q1 analytics"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

DS_ID=$(curl -s -X POST http://localhost:8520/api/v1/datasets \
  -H "Content-Type: application/json" \
  -H "X-Tenant-Id: $TENANT" -H "X-Workspace-Id: $WS_ID" \
  -d '{"name": "q1_sales", "description": "Q1 sales export"}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# 3. Upload a CSV
JOB_ID=$(curl -s -X POST \
  "http://localhost:8520/api/v1/datasets/$DS_ID/files" \
  -H "X-Tenant-Id: $TENANT" -H "X-Workspace-Id: $WS_ID" \
  -F "file=@sales_q1.csv" \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['ingest_job_id'])")

# 4. Stream ingest progress (SSE — Ctrl-C when you see "SUCCEEDED")
curl -N "http://localhost:8520/api/v1/ingest-jobs/$JOB_ID/stream" \
  -H "X-Tenant-Id: $TENANT" -H "X-Workspace-Id: $WS_ID"

# 5. Ask a question
curl -s -X POST http://localhost:8520/api/v1/query \
  -H "Content-Type: application/json" \
  -H "X-Tenant-Id: $TENANT" -H "X-Workspace-Id: $WS_ID" \
  -d "{\"question\": \"Total revenue by region for Q1\", \"dataset_id\": \"$DS_ID\"}" \
  | python3 -m json.tool
# → {
#     "answer": "Total Q1 revenue is $4.2M. Northeast led at $1.8M (43%).",
#     "executed_sql": "SELECT region, SUM(revenue) FROM sales_q1 GROUP BY region",
#     "chart_hint": "bar",
#     "result_url": "https://minio.local/flyquery/results/..."
#   }

Capabilities

Capability	Detail
12 file formats	CSV, TSV, XLSX, XLS, ODS, JSON, JSONL, Parquet, Avro, ORC, Arrow, Feather
Compression	.gz, .zip, .bz2 transparently unwrapped
10-stage pipeline	Receive → Parse → Reconcile → Sample → Profile → Relations → Describe → PII → Embed → Publish
4-agent query pipeline	GroundingAgent → GenerationAgent → CriticAgent → ExplainerAgent
AST firewall	sqlglot + DuckDB double-check; single-statement SELECT only on uploaded tables
Schema knowledge base	pgvector HNSW + BM25 + cross-encoder reranker
Semantic layer	OSI/MetricFlow YAML metrics, deterministic SQL compilation
Conversation memory	Multi-turn drill-down with snapshot pinning
Multi-tenancy	Postgres RLS on all tables; two-role split (admin/app)
Auto-learning	First-shot OK queries auto-propose `(question, SQL)` examples
PII scanning	regex/Presidio; warn/redact/reject policy
Agent surface	Full `/api/v1/agent/*` mirror with `X-Agent-Token` + `Idempotency-Key`
SDKs	Python (`flyquery-sdk`) and Java (`com.firefly:flyquery-sdk`) auto-generated from OpenAPI

Documentation

All docs live in docs/. Start with:

Document	Purpose
docs/README.md	Navigation hub — reading paths by intent
docs/architecture.md	System design, pipelines, hexagonal ports
docs/api-reference.md	REST surface — every endpoint, header, scope
docs/pipeline.md	Upload + query pipelines end-to-end with diagrams
docs/ingestion.md	10-stage ingestion detail
docs/file-formats.md	Per-format reader matrix
docs/semantic-layer.md	MetricFlow YAML + compilation
docs/payload-reference.md	DTO catalog
docs/security-model.md	RLS, tokens, AST firewall, PII policy
docs/deployment.md	Ops reference
docs/conversations.md	Multi-turn query memory + drill-down
docs/eda-events.md	Events published: flyquery.ingest + flyquery.schema
docs/glossary.md	Term definitions
docs/firefly-intelligence-system.md	flycanon + flyradar + flyquery cross-narrative
docs/embeddings.md	Provider-agnostic embedder (Ollama / OpenAI / Cohere / …)
docs/semantic-types.md	49-label semantic-type taxonomy for ingested columns
docs/prompt-engine.md	YAML + Jinja2 prompt files under `src/flyquery/resources/prompts/`
docs/cost-tracking.md	Usage surfacing per agent stage in query / ingest responses
examples/README.md	Synthetic CSV / JSON / JSONL / XLSX / Parquet fixtures + smoke recipe

Full index with reading paths: docs/README.md →

Local dev

task dev          # docker compose up postgres + redis + minio
task migrate      # alembic upgrade head
task serve        # uvicorn flyquery.main:app --reload

Running tests

task lint
task test:unit
task test:integration

Integration tests require Docker (testcontainers spins up pgvector/pgvector:pg16). Tests marked @pytest.mark.s3 / @pytest.mark.gcs / @pytest.mark.azure_blob need local credentials or MinIO/fake-gcs/Azurite — see QUICKSTART.md.

Tests marked @pytest.mark.llm require ANTHROPIC_API_KEY. Exclude them in CI with -m 'not llm'.

Architecture quick-take

flyquery uses a two-tier storage design: Parquet on object storage (LocalFs in dev; S3/GCS/Azure Blob in production) queried by DuckDB in-process per request. Postgres stores all metadata, embeddings, relations, audit, and agent tokens. There is no persistent DuckDB database — every query ATTACHes Parquet snapshots fresh, eliminating shared mutable state between requests.

The hexagonal adapter ring means every external boundary (object store, file reader, vector store, PII scanner, semantic compiler, rate limiter) is a Python Protocol; the domain core never depends on a concrete implementation.

Lock-step modules (web/conventions/*, agent_deps.py, agent_token_service.py, builder.py) are byte-equivalent with flycanon and flyradar. A CI gate (scripts/check_lockstep.py) blocks PRs that drift.

See docs/architecture.md for the full picture.

License

The bundled Python SDK (sdks/python/) and Java SDK (sdks/java/) ship their own Apache License 2.0 files.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
migrations		migrations
scripts		scripts
sdks		sdks
src/flyquery		src/flyquery
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
Taskfile.yml		Taskfile.yml
alembic.ini		alembic.ini
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
openapi.json		openapi.json
openapitools.json		openapitools.json
pyfly.yaml		pyfly.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tabular Intelligence

Why this service exists

Why not part of flycanon?

The shape of the system

Quickstart

Capabilities

Documentation

Local dev

Running tests

Architecture quick-take

License

About

Uh oh!

Releases 7

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tabular Intelligence

Why this service exists

Why not part of flycanon?

The shape of the system

Quickstart

Capabilities

Documentation

Local dev

Running tests

Architecture quick-take

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages