GitHub - firefly-operationOS/flydocs: Pure-multimodal IDP service — extract structured fields with bounding boxes from PDFs, images, Office docs and archives, then validate, authenticity-check, judge, and evaluate business rules. Sync + EDA async APIs on Python 3.13 over fireflyframework-pyfly and -agentic.

Pure-multimodal Intelligent Document Processing

Field extraction with bounding boxes, structured validation, LLM cross-checking, and a business-rule engine — exposed as a production HTTP service with both synchronous and queue-backed asynchronous APIs, independent of any particular product or vertical.

In a hurry? Jump to the 5-minute Quickstart → · SDK paths: Python · Java / Spring Boot · Compose payloads: Payload reference →

Coming from v0? Every old key → new key is in docs/migration-v0-to-v1.md, with side-by-side worked examples.

Why this service exists

KYC reviews, contract intake, claims triage, invoice processing — every operations team has the same workflow underneath:

"Take this document, tell me what it says, decide whether it passes our checks, and route it accordingly."

Doing that with traditional OCR pipelines is brittle: layouts change, new document types arrive every quarter, and the team ends up hand-coding extraction rules that don't survive a single redesign.

flydocs collapses the whole workflow into one HTTP call. You ship the document, declare the fields and rules you care about as JSON, and the service returns a structured verdict — every value tagged with a bounding box, a confidence score, a validation outcome, an LLM judge re-check, and the resolved business rules. No layout templates, no OCR coordinates, no model fine-tuning.

It is built to drop into a production back-office pipeline: idempotent APIs, queue-backed async jobs with HMAC-signed webhooks, observability out of the box, and clean failure isolation per pipeline stage.

What you get back

You give the service one HTTP request. The response is a single JSON object containing, for every document you asked about:

Layer	What it tells you
Fields	The extracted value, page numbers, normalised bounding box (with a geometric quality verdict, a `source` discriminator — `llm` / `pdf_text` / `ocr` — and a `refinement_confidence`), model confidence, and free-text notes. Array and object fields nest recursively.
Field validation	Per-field pass / fail plus the verdict of every built-in `ValidatorSpec` (IBAN, BIC, NIF/NIE/CIF, VAT, SSN, Luhn, phone (E.164), country / language / postal code, lat/lon, IPv4/IPv6, URI/URL/email/domain/slug, UUID, JSON, hex color, date / time / datetime / ISO 8601, currency code, amount, passport). Each spec carries a `severity` (`error` flips the field invalid; `warning` records the finding but keeps the field valid).
Visual authenticity	Yes/no verdicts on caller-defined `visual_checks` (signature present, stamp present, photo present, …).
Content authenticity	Document-level integrity audit: date consistency, totals tally, expected boilerplate, tampering signals.
Judge	A second LLM pass re-checks each extracted value against the document and stamps `pass` / `fail` / `uncertain` with evidence and a `flag_for_review` bit.
Judge escalation	Optional re-run of `extract` + `judge` with a stronger `escalation.model` when the judge's first pass fails too many fields. The `pipeline.escalation` block in the response audits the trigger (`primary_fail_rate`, `escalation_fail_rate`, `accepted`).
Post-extraction transforms	Declarative entity resolution (dedupe people across documents by DNI + name variants) and free-form LLM transformations (closed-taxonomy normalisation, role mapping, …). Per-document outputs land in the affected `documents[].field_groups[]`; cross-document outputs land in `request_transformations[]`.
Business rules	Boolean / categorical decisions over fields, validator outcomes, and other rules' results — evaluated as a DAG with `graphlib.TopologicalSorter`, batched per level into one LLM call. "Is this KYC-complete?", "Escalate to manual review?", "Approve / reject".
Multi-file summary	`files[]` carries one entry per input file (filename, MIME type, page count, byte size, final `matched_type`, classifier verdict). Files that don't match any declared `document_type` show up in `discovered_documents[]` instead of being dropped.
Pipeline errors	Non-fatal per-stage failures are surfaced in `pipeline.errors[]` (one entry per failed node with `code`, `message`, `node`) — the request still returns with `status: "partial"` instead of failing the whole call.
Execution trace	`pipeline.trace[]` lists every executed pipeline node in DAG order with `started_at`, `completed_at`, `latency_ms`, and `status` (`success` / `failed` / `skipped`) — drop-in latency breakdown for ops dashboards.
Audit trail	Response `id` (`ext_…`), per-stage latencies, per-doc model used, structured `outbound_call` log lines for every LLM / webhook / queue call, W3C trace context (`traceparent`, `tracestate`, `X-Correlation-Id`, `X-Tenant-Id`) propagated end-to-end.
Cost telemetry	Aggregated `pipeline.usage` block in every response: input/output tokens + estimated USD cost — sourced from `genai-prices`, which is provider-agnostic (Anthropic, OpenAI, Google, Mistral, …). Broken down by agent and by model. Plus a per-call `cost_usd` on every `outbound_call` log line.
Prompt caching	Provider-aware across the full Anthropic + OpenAI + Google + Bedrock + Azure matrix. Anthropic / Bedrock-Anthropic: explicit `cache_control` on the system prompt + last user-message block (5-minute or 1-hour TTL). OpenAI / Azure-OpenAI: automatic caching for prompts ≥1024 tokens + a stable `prompt_cache_key` routing hint so concurrent requests from the same agent share cache-backend affinity. Google Gemini: caller-supplied `CachedContent` resource ids wired through to pydantic-ai. Cache writes / reads surface as `cache_creation_tokens` / `cache_read_tokens` on the response. Toggle the whole middleware with `FLYDOCS_PROMPT_CACHE=off`.

A single request always carries a non-empty files[] list — a single file is just a one-element list. Submit several entries to ship a multi-file pack at once: pin each file's expected_type when you know it, or let the LLM classifier decide. Each extracted document carries a source_file field so callers can map output back to the input file that produced it. The full multi-file shape is documented in docs/api-reference.md § 2c.

The same call works synchronously (POST /api/v1/extract, blocks until done) or as an async submission with a webhook (POST /api/v1/extractions, returns 202 + an ext_… id). Multi-file submission is supported on both surfaces. Both endpoints accept application/json or multipart/form-data.

Quickstart

Want the 5-minute curl tour instead? See QUICKSTART.md — task docker:up:test + one curl call against a mock LLM, no API keys. The section below is the full walk-through (real provider keys, Postgres, worker, multi-file async job + transformation).

A complete walk-through from a fresh clone to your first sync extraction and your first async, multi-file job with a transformation. Pick whichever provider you have a key for — fireflyframework-genai resolves the right credential from the model id prefix.

0. Prerequisites

Python 3.13 (uv will install it on demand; otherwise install via pyenv / asdf / the system package manager).
uv — the package manager (brew install uv on macOS, or curl -LsSf https://astral.sh/uv/install.sh | sh).
Docker — for Postgres in dev, and optionally for the full stack.
task — task runner used by every helper command (brew install go-task/tap/go-task, or see taskfile.dev).
An LLM provider key — ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY, MISTRAL_API_KEY, … any one the provider you pick supports.

1. Clone and install

git clone https://github.com/firefly-operationOS/flydocs.git
cd flydocs
task deps:install        # uv sync --extra dev: pins the venv at .venv/

2. Configure the environment

task env:init            # copies env_template -> .env (gitignored)

Edit .env. The two knobs you actually need to think about:

# Pick any provider + model id that fireflyframework-genai can resolve.
FLYDOCS_MODEL=anthropic:claude-sonnet-4-6
# Optional second provider used on transient errors. Mix providers freely.
FLYDOCS_FALLBACK_MODEL=openai:gpt-4o

# Set the credential matching the prefix you chose above. Set the
# fallback's credential too if it's a different provider.
ANTHROPIC_API_KEY=sk-ant-...
# OPENAI_API_KEY=sk-...
# GOOGLE_API_KEY=...
# MISTRAL_API_KEY=...

Everything else (Postgres URL, EDA adapter, timeouts, webhook secret, …) has sane defaults in env_template.

3. Bring up Postgres and run migrations

task dev:db              # docker compose up Postgres (and Redis if you switch adapter)
task dev:migrate         # alembic upgrade head — creates extractions +
                         # the pyfly_eda_outbox / pyfly_eda_offsets tables

4. Start the API and the worker

Two terminals — the API serves HTTP, the worker drains the EDA bus.

# Terminal A
task dev:serve           # uvicorn on http://localhost:8400
                         # OpenAPI:    /docs
                         # Health:     /actuator/health/readiness
                         # PyFly admin: /admin

# Terminal B
task dev:worker          # subscribes via fireflyframework-pyfly's EventPublisher

A healthy boot prints both the database_health and eda_health indicators as UP. Hit /actuator/health/readiness to confirm before sending traffic.

5. Your first synchronous extraction

curl -s http://localhost:8400/api/v1/extract \
  -H 'content-type: application/json' \
  -d @docs/examples/extract.json | jq '.documents[0].field_groups'

The endpoint blocks until the pipeline finishes (or hits FLYDOCS_SYNC_TIMEOUT_S, default 60 s). The response carries every extracted field with its bounding box, validation outcome, judge verdict, business-rule decisions, and a pipeline.usage block with the USD cost. See docs/api-reference.md for the full shape.

6. Your first async, multi-file extraction with a transformation

Build a payload with two files, a deduper, and a webhook callback, then POST it. The submit returns immediately with a 202 + an ext_… id; the worker drives the same pipeline and posts the result to your callback_url when it finishes:

curl -s http://localhost:8400/api/v1/extractions \
  -H 'content-type: application/json' \
  -H 'idempotency-key: '"$(uuidgen)" \
  -d '{
    "intention": "KYB pack: deed + DNI. Dedupe people across docs.",
    "files": [
      {"filename": "deed.pdf", "content_base64": "JVBERi0xLjQK...",  "content_type": "application/pdf"},
      {"filename": "dni.jpg",  "content_base64": "/9j/4AAQ...",       "content_type": "image/jpeg"}
    ],
    "document_types": [ /* one DocumentTypeSpec per id — see docs/api-reference.md § 5 */ ],
    "rules": [],
    "options": {
      "stages": {"classifier": true, "judge": true, "transform": true},
      "transformations": [
        {"type": "entity_resolution", "target_group": "personas",
         "match_by": ["dni", "nombre"], "scope": "request"}
      ]
    },
    "callback_url": "https://your-workflow.example.com/idp/webhook",
    "metadata": {"tenant_id": "acme"}
  }'

Poll state if you don't want to wait for the webhook:

EXT_ID=ext_01HEM2ZZ7M0Q8...
curl -s http://localhost:8400/api/v1/extractions/$EXT_ID
curl -s http://localhost:8400/api/v1/extractions/$EXT_ID/result | jq

The webhook payload is the unified EventEnvelope (event_id, event_type: "extraction.completed", occurred_at, correlation_id, extraction snapshot, …) and carries the full ExtractionResult under result when extraction.status == "succeeded". Signed with HMAC-SHA256 in X-Flydocs-Signature using FLYDOCS_WEBHOOK_HMAC_SECRET.

7. (Optional) Skip steps 3–4 with the full container stack

task docker:up           # api + worker + Postgres on Docker (and Redis if adapter=redis)
task health              # GET /actuator/health
task docker:logs         # tail every container

This is the closest thing to production locally; the only difference is no TLS termination in front of the service.

8. Run the test suite

task test                # unit suite — ~250 tests, in-memory SQLite + EDA, <2 s
task test:llm            # real-LLM smoke test — needs the provider key from step 2

task lint:check runs ruff + pyright (both gated in CI).

How the request flows

The service runs the request as a DAG inside the fireflyframework-agentic PipelineEngine. Stages are toggled per request through ExtractionOptions.stages; the engine builds a fresh DAG for each call so the audit trail reflects exactly what executed.

                ┌──────────────────────────────────────────────────────────────────┐
   POST  ──────▶│ load → discover? → classify? → plan_tasks → extract →            │──────▶ JSON
 (PDF/PNG/…)    │ bbox_validation → bbox_refine? → field_validation? →             │  (fields + bbox
                │ visual_auth? → content_auth? → judge? → judge_escalation? →      │   + verdicts)
                │ transform? → rules? → assemble                                   │
                └──────────────────────────────────────────────────────────────────┘
                              │
                              │  per-segment concurrency (asyncio.gather)
                              │  per-stage timeouts + error capture
                              ▼
                       structured trace
                       (id, pipeline.latency_ms, pipeline.errors)

The extractor and the geometric bbox check are the only mandatory stages. Everything else is a caller-chosen trade-off between cost, latency, and rigor. With splitter enabled, every file -- even a single uploaded PDF -- is split into its sub-documents and each is independently classified against the declared document_types[], so a pack that bundles a deed + an ID + a utility bill comes out as three separate routed documents without the caller having to know what's inside.

The bbox_refine stage grounds the LLM's bounding boxes against the document's real text. PDF text layers go through PyMuPDF (sub-pixel accurate); image-PDFs and rasters route to a pluggable OcrEngine. Pick tesseract (default), docling (layout-aware, surfaces table-cell + reading-order metadata -- see docs/docling.md), or none. The extractor can also splice a Markdown text-anchor (FLYDOCS_EXTRACTION_TEXT_ANCHOR=docling) into the user prompt for the LLM to cross-reference -- useful for multilingual scans and dense tabular documents.

See docs/pipeline.md for the deep dive.

Built on the Firefly Framework

Every cross-cutting concern is delegated to the framework so the business logic stays small.

Concern	Provided by
Dependency injection	`fireflyframework-pyfly` `@configuration` + `@bean`
CQRS (commands / queries / bus)	`fireflyframework-pyfly` `@command_handler` / `@query_handler`
REST surface	`fireflyframework-pyfly` `@rest_controller` over FastAPI
Async pipeline DAG	`fireflyframework-agentic` `PipelineEngine` / `PipelineBuilder`
Prompt management	`fireflyframework-agentic` `PromptTemplate` + `PromptRegistry` (YAML-backed)
LLM agents (multimodal)	`fireflyframework-agentic` `FireflyAgent` over `pydantic-ai`
EDA / async jobs	`fireflyframework-pyfly` `EventPublisher` — default `postgres` (durable outbox + LISTEN/NOTIFY); flip `FLYDOCS_EDA_ADAPTER` to `memory` / `redis` / `kafka`
W3C trace context	`fireflyframework-pyfly` `CorrelationFilter` (default web filter) + `pyfly.observability.correlation`
K8s probes	`/actuator/health/liveness` + `/actuator/health/readiness` with `database_health` + `eda_health` indicators
Multi-arch container	`ghcr.io/firefly-operationos/flydocs:latest` — linux/amd64 + linux/arm64 manifest
Observability	structlog JSON, OTLP tracing, Prometheus metrics, actuator
Persistence	SQLAlchemy async, Alembic, Postgres (SQLite for tests)
RFC 7807 error responses	`@controller_advice` exception handler

Everything is wired through fireflyframework-pyfly's container — including the prompt catalog, the EDA event publisher, the webhook publisher, and the async worker — so the application has no manually-constructed singletons outside the DI graph.

Project layout

src/flydocs/
├── interfaces/              Public DTOs + enums — the stable HTTP contract
├── models/                  SQLAlchemy entities + async repositories
├── core/
│   ├── configuration.py     @configuration with every @bean
│   └── services/
│       ├── extract/         CQRS: sync extract command + handler
│       ├── extractions/     CQRS: submit / get / list / cancel async extraction
│       ├── extraction/      MultimodalExtractor + PromptCatalog
│       ├── splitting/       LLM document splitter
│       ├── validation/      Pure-Python FieldValidator + built-in validators
│       ├── authenticity/    Visual + content audits
│       ├── judge/           LLM judge / re-evaluator
│       ├── rules/           DAG-based business rule engine
│       ├── pipeline/        PipelineOrchestrator (fireflyframework-agentic PipelineEngine)
│       ├── webhook/         Outbound webhook publisher with HMAC
│       └── workers/         ExtractionWorker (subscribes to fireflyframework-pyfly EDA)
├── resources/
│   └── prompts/             YAML prompt templates (one per LLM stage)
└── web/
    ├── controllers/         @rest_controller beans
    └── advice/              @controller_advice exception mapping

Public API at a glance

Endpoint	Purpose
Sync extraction
`POST /api/v1/extract`	Synchronous extraction. Blocks until the pipeline finishes.
`POST /api/v1/extract:validate`	Dry-run the semantic validator on a payload (no LLM call, no DB write).
Async extractions
`POST /api/v1/extractions`	Submit a queued extraction. Returns `202` + an `ext_…` id.
`GET /api/v1/extractions`	Filtered, paginated listing (`status`, `post_processing_status`, `idempotency_key`, `created_after` / `before`, `limit`, `offset`).
`GET /api/v1/extractions/{id}`	Current state of an `Extraction` (incl. post-processing block).
`GET /api/v1/extractions/{id}/result`	Final `ExtractionResult`. Long-poll for grounded bboxes via `?wait_for_bboxes=true&timeout=…`.
`DELETE /api/v1/extractions/{id}`	Cancel an extraction that is still `queued`.
Service metadata
`GET /api/v1/version`	Build + model + EDA-adapter info.
`GET /openapi.json`	Machine-readable OpenAPI 3.1 spec.
`GET /docs`	Swagger UI (OpenAPI 3.1).
`GET /admin`	PyFly Admin dashboard — beans, mappings, env, CQRS, traces, loggers, health.
Actuator (ops)
`GET /actuator/health`	Composite health (DB + EDA).
`GET /actuator/health/liveness`	Kubernetes liveness probe.
`GET /actuator/health/readiness`	Kubernetes readiness probe — `503` when `database_health` or `eda_health` is `DOWN`.
`GET /actuator/metrics`	Prometheus metrics.

Full request / response shapes in docs/api-reference.md. Errors follow RFC 7807 (application/problem+json) — see the error-code catalogue.

What's bundled

Standard validators — pure-Python checkers you can declare per field. They run after extraction and never call the LLM:

Group	Validators
Network	`email`, `uri`, `url`, `domain`, `slug`, `ipv4`, `ipv6`
Temporal	`date`, `datetime`, `time`, `iso_8601`
Identifiers	`uuid`, `json`, `hex_color`
Finance	`iban` (mod-97), `bic`, `credit_card` (Luhn), `currency_code`, `amount`
Telephony	`phone_e164`
Geographic	`country_code`, `language_code`, `postal_code` (country-aware), `latitude`, `longitude`
National IDs	`nif` (ES, mod-23), `nie`, `cif`, `vat_id`, `ssn`, `passport_number`

Each one accepts optional params (e.g. {"country": "ES"}) and a severity (error flips the field invalid; warning records the finding but keeps the field valid). See docs/validators.md.

Prompt catalog — every LLM stage reads its system + user prompt from a YAML file under src/flydocs/resources/prompts/. The catalog is a normal fireflyframework-pyfly bean; you can swap templates, bump versions, or A/B-test prompts without touching Python. See docs/prompts.md.

Business rule engine — declare predicates that depend on fields, validator outcomes, or other rules. Rules form a DAG; the engine evaluates them level-by-level via graphlib.TopologicalSorter and groups same-level rules into a single LLM call to amortise cost. Cycles are rejected before any LLM call is issued. See docs/rule-engine.md.

{
  "id": "kyc_complete",
  "predicate": "All identity fields are populated AND nif is valid.",
  "parents": [
    {"kind": "field", "document_type": "passport",
     "fields": ["full_name", "nif"]}
  ],
  "output": {"type": "boolean", "valid_outputs": ["true", "false"]}
}

Documentation map

Document	Read it when…
QUICKSTART.md	You want your first extraction in five minutes (HTTP / curl).
docs/payload-reference.md	You're composing the request payload — every field, option, variant, and worked example.
docs/overview.md	You're new and want a guided tour of the system.
docs/architecture.md	You need to know how `fireflyframework-pyfly` + `fireflyframework-agentic` plug together.
docs/pipeline.md	You're touching the orchestrator or adding a new stage.
docs/api-reference.md	You're integrating with the HTTP API.
docs/transformations.md	You want to dedupe, normalise or run free-form LLM transformations on extracted data.
docs/validators.md	You want to know what validators are built-in.
docs/rule-engine.md	You're designing business rules.
docs/prompts.md	You're editing or adding YAML prompt templates.
docs/deployment.md	You're shipping the service to a real environment.
docs/troubleshooting.md	A real-world problem just blew up.
docs/migration-v0-to-v1.md	You're migrating a v0 integration to v1.

Operations & developer workflows

task deps:install        # uv sync --extra dev
task lint:check          # ruff + pyright
task test                # unit suite (~26 tests, <1s)
task test:llm            # real-LLM smoke test (needs the provider key matching FLYDOCS_MODEL)
task dev:serve           # API on :8400
task dev:worker          # async job consumer
task migrate             # alembic upgrade head
task docker:build        # build the production image
task docker:up           # full stack — api + worker + Postgres + Redis
task docker:up:test      # stack with mock-llm for integration tests
task health              # GET /actuator/health
task version             # GET /api/v1/version
task openapi             # dump the OpenAPI spec

Full task surface is in Taskfile.yml.

flydocs is part of Firefly OperationOS — the open back-office plane. flydocs itself is product- and vertical-agnostic: plug it into any platform that needs to turn documents into structured, verifiable data.

Official SDKs: Python (flydocs-sdk — wheel on GitHub Releases, install with uv add <release-url>) · Java / Spring Boot (com.firefly.flydocs:flydocs-sdk on GitHub Packages). See sdks/README.md for install + quickstart.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github		.github
docs		docs
migrations		migrations
scripts		scripts
sdks		sdks
src/flydocs		src/flydocs
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
Taskfile.yml		Taskfile.yml
alembic.ini		alembic.ini
docker-compose.test.yml		docker-compose.test.yml
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
env_template		env_template
pyfly.yaml		pyfly.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pure-multimodal Intelligent Document Processing

Why this service exists

What you get back

Quickstart

0. Prerequisites

1. Clone and install

2. Configure the environment

3. Bring up Postgres and run migrations

4. Start the API and the worker

5. Your first synchronous extraction

6. Your first async, multi-file extraction with a transformation

7. (Optional) Skip steps 3–4 with the full container stack

8. Run the test suite

How the request flows

Built on the Firefly Framework

Project layout

Public API at a glance

What's bundled

Documentation map

Operations & developer workflows

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Pure-multimodal Intelligent Document Processing

Why this service exists

What you get back

Quickstart

0. Prerequisites

1. Clone and install

2. Configure the environment

3. Bring up Postgres and run migrations

4. Start the API and the worker

5. Your first synchronous extraction

6. Your first async, multi-file extraction with a transformation

7. (Optional) Skip steps 3–4 with the full container stack

8. Run the test suite

How the request flows

Built on the Firefly Framework

Project layout

Public API at a glance

What's bundled

Documentation map

Operations & developer workflows

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages