Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
e9d4798
docs: API contract v1 redesign spec
May 26, 2026
9081162
docs: implementation plan for API v1 contract redesign
May 26, 2026
3beb81b
chore: capture baseline test snapshot for v1 redesign
May 26, 2026
370e3d3
refactor(interfaces): rewrite DTOs and enums for API v1 contract
May 26, 2026
2022b82
refactor(db): Extraction entity + ExtractionRepository + alembic v1 r…
May 26, 2026
a284d61
refactor(core+web): sweep core services + workers + controllers to v1…
May 26, 2026
29335bc
test(v1): migrate enum, validator, field/request validator tests
May 26, 2026
8968e84
test(v1): migrate event envelopes, webhook publisher, extraction repo…
May 26, 2026
70813e1
test(v1): migrate submit/list/get/reaper handler tests
May 26, 2026
c69ad09
test(v1): migrate worker concurrency / retry / bbox refine worker tests
May 26, 2026
549ed91
test(v1): migrate bbox/transformer/integration/llm tests
May 26, 2026
e9c5317
docs: post-server test snapshot for Phase 5 of API v1 redesign
May 26, 2026
a9b8156
docs(api-reference): rewrite for v1
May 26, 2026
f265de5
docs(payload-reference): rewrite for v1
May 26, 2026
8eebab3
docs: migration guide v0 -> v1
May 26, 2026
3b97ac3
docs(validators): rename standard-validators -> validators, rewrite f…
May 26, 2026
f3ad168
refactor(py-sdk): rewrite models around v1 DTOs; consolidate request.…
May 26, 2026
2d4f775
feat(py-sdk): Client + AsyncClient with extractions sub-resource
May 26, 2026
f4fd107
feat(py-sdk): WebhookVerifier returns typed EventEnvelope; errors exp…
May 26, 2026
779e9d5
test(py-sdk): rewrite around v1 surface (156 passing)
May 26, 2026
2bde9be
docs(py-sdk): rewrite examples + README + QUICKSTART + TUTORIAL for v1
May 26, 2026
d133ad3
feat(py-sdk): re-export full v1 surface; bump to 26.6.0
May 26, 2026
2f78151
docs: sweep related docs for v1 vocabulary
May 26, 2026
be29476
docs: rewrite QUICKSTART/README/sdks-README + add CHANGELOG for v1
May 26, 2026
c5f2004
docs: align cross-references and last v1 vocabulary sweeps
May 26, 2026
121950c
feat(java-sdk): rewrite full SDK for v1 contract
May 26, 2026
464a2fe
fix(server): purge remaining v0 field names from LLM-facing schema
May 26, 2026
40cd37a
docs: capture final post-redesign test snapshot for Phase 9 sign-off
May 26, 2026
d57ef98
fix: sync timeout returns 408, bbox-worker topic, alembic + transform…
May 27, 2026
7e16a18
chore: polish + ruff format + bump server to 26.6.0 for v1 release
May 27, 2026
1ea4449
fix(deps): cap fireflyframework-agentic<26.5.21 (pre-release transiti…
May 27, 2026
c714df6
fix(deps): override transitive mistralai<2.0.0 to dodge pre-release-o…
May 27, 2026
72a64a1
fix(ci): exclude scripts/ from ruff + wrap long migration lines
May 27, 2026
a0481bd
docs: replace 2 lingering 'StandardValidator(s)' refs with v1 vocabulary
May 27, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Changelog

All notable changes to flydocs are documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
and this project uses **CalVer `YY.M.PP`** (PEP 440 may normalise patch numbers
for the Python wheel — e.g. `26.06.00` → `26.6.0`).

## [26.6.0] - 2026-05-26

### BREAKING CHANGES — API v1 redesign

This release replaces the public API contract end-to-end. There is no
backwards-compatible shim. See [docs/migration-v0-to-v1.md](docs/migration-v0-to-v1.md)
for the full rename table and worked examples.

**Highlights:**

- snake_case across every JSON key, enum value, and error code.
- Top-level request body: `files[]` + `document_types[]` + `rules[]` (was `documents[]` + `docs[]`).
- One recursive `Field` (was `FieldSpec` + `FieldItem`). Array `items` is a single `Field`; objects use `type: "object"` + `fields: [Field, ...]`.
- `DocumentTypeSpec.id` flattens the v0 `docs[].docType.documentType` triple-stutter.
- `Extraction` lifecycle collapses to `queued → running → succeeded | failed | cancelled`; refining-bbox state lives under `post_processing.bbox_refinement.{status, started_at, finished_at, attempts, error}` and evolves independently. `PARTIAL_SUCCEEDED` and `REFINING_BBOXES` are gone.
- Unified `EventEnvelope` for EDA events and webhook deliveries. Dotted event types (`extraction.submitted`, `extraction.completed`, `extraction.post_processing.requested`, `extraction.post_processing.completed`).
- New error catalogue (`not_found`, `not_ready`, `not_cancellable`, `timeout`, `file_too_large`, `unsupported_file`, `validation_failed`, …).
- `POST /api/v1/extract` and `POST /api/v1/extractions` accept `multipart/form-data` in addition to JSON.
- Validators: `Field.validators[]` (was `standard_validators[]`); dispatch key is `name` (was `type`).
- Visual checks: `DocumentTypeSpec.visual_checks[]` (was `validators.visual[]`).
- Rule parents: discriminator key is `kind` (was `parentType`); members snake_case (`document_type`, `fields`, `validator`, `rule`).
- Response top-level meta (`model`, `latency_ms`, `trace`, `pipeline_errors`, `escalation`, `usage`) nested under a single `pipeline` block.
- Top-level response `id` (was `request_id`) is a prefixed ULID (`ext_…`).
- Bbox: `bbox: null` signals absence; the v0 `quality: "empty"` / `source: "none"` placeholders are removed.
- New `FieldType.OBJECT` lets schemas nest objects natively.
- `escalation_threshold` / `escalation_model` collapse into a single `escalation` sub-object.
- Endpoint moves: `POST /api/v1/jobs` → `POST /api/v1/extractions` (and every related `/jobs/...` path).

### Changed (server-side)

- Database table `extraction_jobs` → `extractions`; column `created_at` → `submitted_at`; per-column `bbox_refine_*` fields collapse into a `post_processing` JSONB column; per-column `error_code` + `error_message` collapse into an `error` JSONB column.
- Repository `ExtractionJobRepository` → `ExtractionRepository`; entity `ExtractionJob` → `Extraction`.
- CQRS rename: `SubmitJobCommand` / `GetJobQuery` / `ListJobsQuery` / `CancelJobCommand` / `GetJobResultQuery` → `SubmitExtractionCommand` / `GetExtractionQuery` / `ListExtractionsQuery` / `CancelExtractionCommand` / `GetExtractionResultQuery`. Handlers renamed in lockstep.
- Directory `core/services/jobs/` → `core/services/extractions/`.
- Worker `JobWorker` → `ExtractionWorker`; `JobReaper` → `ExtractionReaper` (`BboxReaper` keeps its name).

### Changed (SDKs)

- Python SDK: `DocumentInput` → `FileInput`; `DocSpec` / `DocType` → `DocumentTypeSpec`; `FieldSpec` + `FieldItem` → `Field`; `StandardValidatorSpec` → `ValidatorSpec`; `JobStatus` → `ExtractionStatus`; `BboxRefineStatus` → `PostProcessingStatus`; `SubmitJobRequest` / `SubmitJobResponse` / `JobStatusResponse` / `JobResult` / `JobListResponse` → `SubmitExtractionRequest` / `Extraction` / `ExtractionResultEnvelope` / `ExtractionList`; `JobWebhookPayload` → `WebhookEnvelope`. New methods: `client.extractions.{create, get, get_result, cancel, list}`.
- Java SDK: every record renamed in lockstep with Python. New `client.extractions()` sub-resource handle. `@FlydocsWebhook` resolver now takes `WebhookEnvelope`.

### Migration

Every existing integration (curl, SDK, webhook receiver, EDA consumer) needs to be ported. The migration guide ([docs/migration-v0-to-v1.md](docs/migration-v0-to-v1.md)) has:

- A glossary fixing `file` vs `document_type` vs `document`.
- Side-by-side before/after request body, response body, async submit/poll, webhook envelope, and error problem-details.
- An SDK upgrade quick-reference (Python + Java) covering imports, sync extraction, async submit + result, and webhook handlers.

### Documentation

- Full rewrites: `docs/api-reference.md`, `docs/payload-reference.md`.
- Renamed: `docs/standard-validators.md` → `docs/validators.md` (content rewritten).
- New: `docs/migration-v0-to-v1.md`.
- Sweep-updated: `docs/pipeline.md`, `docs/rule-engine.md`, `docs/transformations.md`, `docs/concurrency.md`, `docs/overview.md`, `docs/architecture.md`, `docs/deployment.md`, `docs/troubleshooting.md`, `docs/cicd.md`, `docs/docling.md`, `QUICKSTART.md`, `README.md`, `CLAUDE.md`, `sdks/README.md`.

### Fixed (post-merge polish from the live KYB smoke run)

These five fixes were committed to the v1 branch after the live end-to-end
test against the real Anthropic API (`claude-sonnet-4-6`) on two Spanish
notarial PDFs (incorporation deed + shareholders agreement):

- **`/api/v1/extract` sync timeout returns HTTP 408 instead of 400.** A new `ExtractionTimedOut(RuntimeError)` is raised by the handler so it propagates through the pyfly CQRS bus to the controller (which previously wrapped `asyncio.TimeoutError`, an `OSError` subclass, as a generic `COMMAND_PROCESSING_ERROR` at HTTP 400). The new `@exception_handler(ExtractionTimedOut)` advice emits the canonical 408 `timeout` problem-detail with `extensions.timeout_s`.
- **`bbox-worker` EDA destination realigned.** docker-compose pinned the bbox subscriber to the v0 topic `flydocs.bbox.refine`. The v1 main worker publishes to `flydocs.extractions.post_processing` per the renamed event-type. Without this fix async jobs with `bbox_refine=true` would hang at `post_processing.bbox_refinement.status=pending` indefinitely.
- **Alembic `migrations/env.py`** still imported `from flydocs.models.entities.extraction_job` after the v1 entity rename → fatal at API container startup when `RUN_MIGRATIONS=true` (the default).
- **`src/flydocs/resources/prompts/transform.yaml`** used legacy `id:` / `system:` / `user:` keys; the catalogue's loader expects `name:` / `system_template:` / `user_template:` (with a `required_variables` declaration). The mismatch crashed `PromptCatalog.from_resources()` at startup.
- **`scripts/kyb_real_test.py`** committed as the canonical live smoke runner. Run against the docker stack (`docker compose up -d` + `ANTHROPIC_API_KEY` in `.env`) to validate sync (`POST /api/v1/extract`) and async (`POST /api/v1/extractions`) end-to-end with multi-file, multi-document-type, recursive `Field`, judge, `bbox_refine` post-processing, six cross-document rules, and `validators` / `visual_checks` declarations. Verified live: sync 72s/175k tokens/$0.60; async 271s/772k tokens/$2.59; all six KYB rules resolve correctly (including a `partial` shareholders-reconciliation verdict that the deed and pacto don't share the same party set).
57 changes: 32 additions & 25 deletions QUICKSTART.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

Zero to your first extracted invoice in **five minutes**. HTTP-only — no SDK required, no API keys.

> **Migrating from v0?** See [`docs/migration-v0-to-v1.md`](docs/migration-v0-to-v1.md) — every old key, every renamed endpoint, every enum value, side by side with the v1 equivalent.

> **Already on Python or Java?** Skip the curl tour and jump straight to the SDK quickstart:
> - **Python**: [`sdks/python/QUICKSTART.md`](sdks/python/QUICKSTART.md)
> - **Java / Spring Boot**: [`sdks/java/QUICKSTART.md`](sdks/java/QUICKSTART.md)
Expand All @@ -10,7 +12,7 @@ Zero to your first extracted invoice in **five minutes**. HTTP-only — no SDK r

## 1. Run flydocs locally (1 min)

The repo ships with a docker-compose stack that brings up the service, a Postgres for jobs, and a mock LLM so you don't need any provider credentials:
The repo ships with a docker-compose stack that brings up the service, a Postgres for the `extractions` table, and a mock LLM so you don't need any provider credentials:

```bash
git clone https://github.com/firefly-operationOS/flydocs.git
Expand All @@ -31,22 +33,22 @@ curl http://localhost:8400/actuator/health/readiness
# 1. Base64-encode any PDF / PNG / DOCX you have at hand.
B64=$(base64 < invoice.pdf | tr -d '\n')

# 2. POST a minimal ExtractionRequest. ``docs[]`` declares what to extract;
# ``documents[]`` carries the file. Everything else has sensible defaults.
# 2. POST a minimal ExtractionRequest. ``document_types[]`` declares what to extract;
# ``files[]`` carries the binary. Everything else has sensible defaults.
curl -sS http://localhost:8400/api/v1/extract \
-H 'Content-Type: application/json' \
-d @- <<JSON | jq
{
"documents": [
"files": [
{ "filename": "invoice.pdf", "content_base64": "$B64" }
],
"docs": [
"document_types": [
{
"docType": { "documentType": "invoice" },
"fieldGroups": [
"id": "invoice",
"field_groups": [
{
"fieldGroupName": "totals",
"fieldGroupFields": [
"name": "totals",
"fields": [
{ "name": "total_amount", "type": "number", "required": true },
{ "name": "currency", "type": "string", "required": true }
]
Expand All @@ -58,26 +60,30 @@ curl -sS http://localhost:8400/api/v1/extract \
JSON
```

You'll get back an `ExtractionResult` whose `documents[*].fields[*].fieldGroupFields[*]` carries `value`, `confidence`, and a normalised `bbox`:
You'll get back an `ExtractionResult` whose `documents[*].field_groups[*].fields[*]` carries `name`, `value`, `confidence`, and a normalised `bbox`:

```jsonc
{
"request_id": "…",
"model": "openai:flydocs-mock",
"latency_ms": 412,
"id": "ext_01HEM...",
"status": "success",
"pipeline": {
"model": "openai:flydocs-mock",
"latency_ms": 412
},
"documents": [
{
"document_type": "invoice",
"fields": [
"type": "invoice",
"field_groups": [
{
"fieldGroupName": "totals",
"fieldGroupFields": [
"name": "totals",
"fields": [
{
"name": "total_amount", "value": 1234.56, "confidence": 0.97,
"bbox": { "page": 1, "x_min": 0.61, "y_min": 0.83, "x_max": 0.79, "y_max": 0.86,
"source": "llm" }
"bbox": { "xmin": 0.61, "ymin": 0.83, "xmax": 0.79, "ymax": 0.86,
"source": "llm", "quality": "good", "quality_score": 0.92,
"refinement_confidence": null }
},
{ "name": "currency", "value": "EUR", "confidence": 0.99, "bbox": { } }
{ "name": "currency", "value": "EUR", "confidence": 0.99, "bbox": { /* ... */ } }
]
}
]
Expand All @@ -95,9 +101,10 @@ That's the **mandatory pipeline** — multimodal extract + bbox. Everything else
| **Compose a realistic schema** (field types, formats, validators, arrays) | [`docs/payload-reference.md`](docs/payload-reference.md) §§ 4–6 |
| **Tune the pipeline** (which stages to enable, model selection, escalation) | [`docs/payload-reference.md`](docs/payload-reference.md) § 7 |
| **Add business rules** over extracted fields + validator outcomes | [`docs/payload-reference.md`](docs/payload-reference.md) § 8 |
| **Run as an async job** with callbacks (`Idempotency-Key`, `callback_url`) | [`docs/payload-reference.md`](docs/payload-reference.md) § 10 |
| **Run as an async extraction** with callbacks (`Idempotency-Key`, `callback_url`) | [`docs/payload-reference.md`](docs/payload-reference.md) § 10 |
| **Verify webhook signatures** on the receiver | [`docs/payload-reference.md`](docs/payload-reference.md) § 11 |
| **Branch on the RFC 7807 error catalogue** | [`docs/payload-reference.md`](docs/payload-reference.md) § 12 |
| **Branch on the RFC 7807 error catalogue** | [`docs/payload-reference.md`](docs/payload-reference.md) § 13 |
| **Migrate from v0** | [`docs/migration-v0-to-v1.md`](docs/migration-v0-to-v1.md) |
| **Deploy** to your cluster (multi-arch image, Postgres, Redis, env knobs) | [`docs/deployment.md`](docs/deployment.md) |
| **Understand the pipeline DAG** internals (timeouts, concurrency, cost) | [`docs/pipeline.md`](docs/pipeline.md) |
| **See the full HTTP wire contract** (every endpoint, every DTO) | [`docs/api-reference.md`](docs/api-reference.md) |
Expand All @@ -110,8 +117,8 @@ That's the **mandatory pipeline** — multimodal extract + bbox. Everything else
|-------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| `curl: (7) Failed to connect to localhost port 8400` | Service not up yet. `docker compose ps` and check `task docker:logs`. |
| `400 Bad Request` / `422 invalid_base64` | `content_base64` not strict base64 (e.g. literal newlines). Use `base64 \| tr -d '\\n'`. |
| `413 document_too_large` | File over `FLYDOCS_MAX_BYTES`. Split or compress. |
| `408 extraction_timeout` | Pipeline exceeded the sync ceiling. Retry through `POST /api/v1/jobs`. |
| `422 invalid_request` with a list of `errors` | Semantic validator caught an issue (rule references unknown field, etc.). Each error has `path` and `message`. |
| `413 file_too_large` | File over `FLYDOCS_MAX_BYTES`. Split or compress. |
| `408 timeout` | Pipeline exceeded the sync ceiling. Retry through `POST /api/v1/extractions`. |
| `422 validation_failed` with a list of `errors` | Semantic validator caught an issue (rule references unknown field, etc.). Each error has `path` and `message`. |

More: [`docs/troubleshooting.md`](docs/troubleshooting.md).
Loading
Loading