Skip to content

feat(admin-api): support inter-table dependencies in derived dataset validation#1912

Open
mitchhs12 wants to merge 4 commits intomainfrom
mitchhs12/inter-table-deps
Open

feat(admin-api): support inter-table dependencies in derived dataset validation#1912
mitchhs12 wants to merge 4 commits intomainfrom
mitchhs12/inter-table-deps

Conversation

@mitchhs12
Copy link
Contributor

@mitchhs12 mitchhs12 commented Mar 5, 2026

Summary

Adds full inter-table dependency support for derived datasets — both validation (admin API) and runtime (dump engine). Tables within a derived dataset can now reference sibling tables using self.<table_name> syntax (e.g., SELECT * FROM self.blocks_base), consistent with the existing self. UDF convention.

Part 1: Validation (admin API)

  • Add SelfSchemaProvider.add_table() for progressive schema registration during topological processing
  • Add cycle detection via topological_sort(), returning CYCLIC_DEPENDENCY 400 error
  • Add explicit SELF_REF_TABLE_NOT_FOUND error when self. references target non-existent sibling tables
  • Process tables in dependency order in /schema and /manifests endpoints
  • 5 integration tests: basic self-ref, 3-table chain, cycle rejection, self-referencing table, mixed deps

Part 2: Runtime (dump engine)

  • Split physical_for_dump::create() into resolve_external_deps() + build_catalog() so callers can inject self-ref entries alongside external deps
  • Extract partition_table_refs() to separate self. refs from external deps
  • Register sibling tables in both planning (SelfSchemaProvider for column types) and execution (ResolvedTableEntry for physical data) phases
  • Notification-driven polling loop — self-ref tables wait for sibling data via notification pipeline
  • Remove unreachable dataset_start_block fallback (validation now guarantees tables have at least one reference via NO_TABLE_REFERENCES check from fix(admin-api): reject derived tables with no source table references #1944)
  • Pass sibling PhysicalTable map from orchestrator to each table task
  • Tables dump in parallel (no topological ordering at runtime) — the existing streaming/notification system handles dependency ordering naturally
  • Un-ignore intra_deps_test E2E test

Error codes

New error codes added by this PR:

Code Status Description
SELF_REFERENCING_TABLE 400 A table references itself via self.<own_name>
SELF_REF_TABLE_NOT_FOUND 400 A self.-qualified reference targets a non-existent sibling table
CYCLIC_DEPENDENCY 400 Inter-table references form a cycle
CATALOG_QUALIFIED_TABLE 400 3-part catalog-qualified table reference (renamed from TABLE_REFERENCE_RESOLUTION)
INVALID_TABLE_NAME 400 Table name does not conform to identifier rules

Breaking change: TABLE_REFERENCE_RESOLUTION error code renamed to CATALOG_QUALIFIED_TABLE for catalog-qualified table references. Acceptable given status: unstable.

Key design decisions

  • self. convention: Aligns with UDF convention (self.functionName()). Parsed by DataFusion as TableReference::Partial { schema: "self", table: "..." }
  • Resolve + build split: for_dump.rs has zero self-ref knowledge — it resolves external deps and builds catalogs from generic entries. Self-ref resolution lives in table.rs where it belongs
  • Parallel, not sequential: Per Leo's feedback, tables dump in parallel. The streaming query notification pipeline handles ordering — same mechanism as external deps
  • Notification-driven start block: Self-ref tables subscribe to sibling notifications and wait until data appears, protected by FailFastJoinSet cancellation if a sibling fails

Files changed

File Changes
common/src/self_schema_provider.rs add_table() for progressive schema registration
common/src/catalog/physical/for_dump.rs Split create()resolve_external_deps() + build_catalog(), add ResolvedTableEntry
worker-datasets-derived/src/job_impl.rs Build siblings map, pass to each materialize_table() call; 1 unit test
worker-datasets-derived/src/job_impl/table.rs partition_table_refs(), self-ref resolution in both phases, notification-driven polling; 5 unit tests
datasets-derived/src/sorting.rs topological_sort() and CyclicDepError
admin-api/src/handlers/schema.rs Topological ordering, cycle detection, self-ref validation in /schema
admin-api/src/handlers/common.rs Topological ordering, cycle detection, self-ref validation in /manifests
tests/src/tests/it_dependencies.rs Remove #[ignore] from intra_deps_test
docs/feat/data-inter-table-dependencies.md Feature documentation with all error codes

Related

  • Prior runtime implementation: Table self reference #1524 (closed — codebase has since been refactored)
  • Feature doc: docs/feat/data-inter-table-dependencies.md

Test plan

  • 5 integration tests for validation (self-ref, chain, cycle, self-cycle, mixed deps)
  • 6 unit tests for runtime (partition logic, error fatality)
  • E2E intra_deps_test passes (dump + query with inter-table deps)
  • Manual testing: verified SELF_REFERENCING_TABLE, CYCLIC_DEPENDENCY, NO_TABLE_REFERENCES errors via curl
  • Format, check, clippy — zero warnings

@mitchhs12 mitchhs12 force-pushed the mitchhs12/inter-table-deps branch 3 times, most recently from 48380af to ded30db Compare March 10, 2026 15:10
@mitchhs12 mitchhs12 marked this pull request as ready for review March 10, 2026 15:49
@mitchhs12 mitchhs12 force-pushed the mitchhs12/inter-table-deps branch 5 times, most recently from bcc4bbd to b32ca21 Compare March 10, 2026 18:37
@mitchhs12 mitchhs12 self-assigned this Mar 10, 2026
@mitchhs12 mitchhs12 requested a review from LNSD March 10, 2026 18:58
Add self-qualified table references (self.table_name) enabling tables
within a derived dataset to reference sibling tables. Includes
topological ordering, cycle detection, and self-reference rejection.

- Add `DepAliasOrSelfRef` type for parsing `self.`-qualified refs
- Implement topological sort with `CyclicDepError` in `datasets-derived`
- Register sibling schemas progressively via `SelfSchemaProvider`
- Add `CYCLIC_DEPENDENCY`, `SELF_REFERENCING_TABLE`, `CATALOG_QUALIFIED_TABLE`, `INVALID_TABLE_NAME` error codes
- Add runtime inter-table dependency support in worker-datasets-derived

Signed-off-by: Mitchell Spencer <mitchellhspencer@gmail.com>
@mitchhs12 mitchhs12 force-pushed the mitchhs12/inter-table-deps branch from b32ca21 to 1b2d9e7 Compare March 10, 2026 19:04
Deduplicate inter-table dep logic that was copy-pasted between manifest validation and schema inference handlers.

- Add `resolve_inter_table_order` shared function in `common.rs`
- Add `InterTableDepError` enum with `error_code()` to preserve API error codes
- Replace inline dep extraction in both `common.rs` and `schema.rs` with shared function
- Consolidate 6 duplicated error variants into 3 defined once

Signed-off-by: Mitchell Spencer <mitchellhspencer@gmail.com>
Replace generic iterator parameter with concrete BTreeMap reference for clarity and caller ergonomics.

- Change `impl IntoIterator<Item = (&TableName, &[TableReference<...>])>` to `&BTreeMap<TableName, Vec<TableReference<...>>>`
- Simplify schema.rs call site from `.iter().map()` chain to direct `&parsed_refs`
- Build explicit `table_refs_only` map in common.rs where tuple destructuring is needed

Signed-off-by: Mitchell Spencer <mitchellhspencer@gmail.com>
Update generated spec to reflect renamed resolve_inter_table_order function
in schema handler panics documentation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants