diff --git a/src/explanation/relational-workflow-model.md b/src/explanation/relational-workflow-model.md index a7b94da7..d82501e7 100644 --- a/src/explanation/relational-workflow-model.md +++ b/src/explanation/relational-workflow-model.md @@ -1,70 +1,38 @@ # The Relational Workflow Model -The relational model has historically admitted two interpretations. Codd's -mathematical foundation (1970) views tables as logical predicates and rows -as true propositions — rigorous but abstract. Chen's Entity-Relationship -Model (1976) views tables as entity types or relationships — intuitive for -domain modeling, but silent on how entities come into being. The -**Relational Workflow Model** introduces a third interpretation: tables -represent workflow steps, rows represent workflow artifacts, and foreign -keys prescribe execution order. The schema specifies not only *what* data -exists but *how* it is derived — a single formal system in which data -structure, computational dependencies, and integrity constraints are all -queryable, enforceable, and machine-readable. - -This unification is what makes DataJoint a *computational substrate* rather -than a database in the conventional sense. Each surrounding category of -tools is good at part of the problem and silent on the rest. File-based -workflow systems (CWL, Snakemake, Nextflow) offer flexibility but fragment -provenance across the filesystem and configuration. Task-centric -orchestrators (Airflow, Argo, Prefect) manage execution but remain agnostic -to data structure. Data catalogs (DataHub, Atlan, Marquez) describe data -after it lands. Lakehouses (Delta, Iceberg, Hudi) optimize analytical -queries but treat computation as external. The Relational Workflow Model -is the deliberate trade-off: framework commitment in exchange for one -formal system that addresses all four concerns at once. - -## Three interpretations of the relational model - -| Aspect | Mathematical (Codd) | Entity-Relationship (Chen) | **Relational Workflow (DataJoint)** | -|--------|---------------------|----------------------------|-------------------------------------| -| **Core question** | What functional dependencies exist? | What entity types exist? | **When and how are entities created?** | -| **Table semantics** | Logical predicate | Entity or relationship | **Workflow step** | -| **Row semantics** | True proposition | Entity instance | **Workflow artifact** | -| **Foreign keys** | Referential integrity | Relationship | **Execution order** | -| **Computation** | Not addressed | Not addressed | **Declared in schema** | -| **Provenance** | Not addressed | Not addressed | **Structural** | -| **Implementation gap** | High | High | **None** | - -## Four shifts from the classical relational model - -- **Tables represent workflow steps**, not merely categories of records. -- **Rows represent workflow artifacts**, each with provenance to its inputs. -- **Foreign keys prescribe execution order**, not only referential integrity — the dependency graph *is* the pipeline DAG, enforced by the database. -- **Computed and Imported tables carry their own `make()` methods**, declaring derivation logic in the schema itself, not in an external workflow file. - -The schema is therefore *active*, not passive. A row exists in a Computed -table if and only if its upstream key exists, its `make()` has run, and its -result satisfies the declared constraints. The schema is the executable -specification of the work. +The **Relational Workflow Model** interprets tables as workflow steps, +rows as workflow artifacts, and foreign keys as execution order. The +schema specifies not only *what* data exists but *how* it is derived — +a single formal system in which data structure, computational +dependencies, and integrity constraints are all queryable, enforceable, +and machine-readable. This unification is what makes DataJoint a +*computational substrate* rather than a database in the conventional +sense. The worked example below shows the model in action; its place in +the lineage of relational modeling follows. ## A worked example +Diagrams in this documentation use the same notation as `dj.Diagram` in +`datajoint-python`: **Manual** tables are green rectangles, **Lookup** +tables are plain text, **Imported** tables are blue ovals, and **Computed** +tables are red ovals. Tier is conveyed by shape and color — the node +itself carries only the table name. + ```mermaid graph TD - Mouse["Mouse
Manual"]:::manual - Session["Session
Manual"]:::manual - Scan["Scan
Manual"]:::manual - SegParam["SegmentationParam
Lookup"]:::lookup - AvgFrame["AverageFrame
Imported — make()"]:::imported - Segmentation["Segmentation
Computed — make()"]:::computed - Fluorescence["Fluorescence
Imported — make()"]:::imported + Mouse["Mouse"]:::manual + Session["Session"]:::manual + Scan["Scan"]:::manual + SegParam["SegmentationParam"]:::lookup + AvgFrame(["AverageFrame"]):::imported + Segmentation(["Segmentation"]):::computed + Fluorescence(["Fluorescence"]):::imported Mouse --> Session --> Scan --> AvgFrame --> Segmentation --> Fluorescence SegParam --> Segmentation classDef manual fill:#c8e6c9,stroke:#2e7d32,color:#1b5e20; - classDef lookup fill:#e0e0e0,stroke:#616161,color:#212121; + classDef lookup fill:none,stroke:none,color:#212121; classDef imported fill:#bbdefb,stroke:#1565c0,color:#0d47a1; classDef computed fill:#ffcdd2,stroke:#c62828,color:#b71c1c; ``` @@ -81,24 +49,52 @@ scheduler is consulted: the foreign-key graph dictates what may run, what must run first, and what already exists. The pipeline DAG and the database schema are the same object. +## Three interpretations of the relational model + +The relational model has historically admitted two interpretations. Codd's +mathematical foundation (1970) views tables as logical predicates and rows +as true propositions — rigorous but abstract. Chen's Entity-Relationship +Model (1976) views tables as entity types or relationships — intuitive +for domain modeling, but silent on how entities come into being. The +Relational Workflow Model adds a third, the one the worked example +above illustrates. + +| Aspect | Mathematical (Codd) | Entity-Relationship (Chen) | **Relational Workflow (DataJoint)** | +|--------|---------------------|----------------------------|-------------------------------------| +| **Core question** | What functional dependencies exist? | What entity types exist? | **When and how are entities created?** | +| **Table semantics** | Logical predicate | Entity or relationship | **Workflow step** | +| **Row semantics** | True proposition | Entity instance | **Workflow artifact** | +| **Foreign keys** | Referential integrity | Relationship | **Execution order** | +| **Computation** | Not addressed | Not addressed | **Declared in schema** | +| **Provenance** | Not addressed | Not addressed | **Structural** | +| **Implementation gap** | High | High | **None** | + +## A semantic interpretation, not a departure + +The Relational Workflow Model layers a semantic interpretation on the +classical relational model; it does not replace any of it. Tables, rows, +primary and foreign keys, normalization, and the query algebra keep +their classical meaning. The model adds four readings on top: + +- Tables also represent **workflow steps**. +- Rows also represent **workflow artifacts**, carrying provenance to their inputs. +- Foreign keys also prescribe **execution order** — the dependency graph *is* the pipeline DAG, enforced by the database. +- **Computed and Imported tables carry their own `make()` methods**, declaring derivation logic in the schema itself rather than in an external workflow file. + +Under this interpretation the schema becomes *active*. A row exists in a +Computed table if and only if its upstream key exists, its `make()` has +run, and its result satisfies the declared constraints. The schema is the +executable specification of the work. + ## The deliberate trade-off -Decoupled architectures have legitimate advantages. File-based workflow -systems optimize for portability — any tool that reads files works. -Orchestrators evolve independently of the data model. Lakehouses give -analytics teams a layer that doesn't bind them to upstream pipeline -choices. These are the right trade-offs for many use cases. - -DataJoint accepts tighter coupling deliberately. The cost is framework -commitment. The benefit is one system that knows the data structure, the -data, the computation that produced it, the dependencies between -computations, and the integrity constraints that govern all of it. -Everything an analyst, an engineer, or an AI agent might ask about the -work — *what is this, where did it come from, what depends on it, what -must hold for it to be valid, what would change if I touched the input* — -is answerable by query against a single formal model. For scientific -workflows where the data and the computation cannot be cleanly separated -without losing the science, this is the right trade-off. +DataJoint accepts tighter coupling deliberately, in exchange for one +formal system that spans data structure, computation, dependencies, and +integrity. See +[Comparison to Workflow Languages](comparison-to-workflow-languages.md) +for the structural treatment — what file-based workflows and task +orchestrators each offer, what each omits, and when to use them +alongside DataJoint. ## Substrate consequences @@ -200,10 +196,14 @@ proper entity set with clear identity — distinguishes DataJoint's algebra from SQL, where query results lack both a well-defined primary key and a clear entity type. -## From transactions to transformations +## Two readings of the same schema + +The classical relational reading and the workflow reading hold +simultaneously — they are interpretive lenses on the same schema, not +incompatible designs. -| Traditional view | Workflow view | -|------------------|---------------| +| Classical reading | Workflow reading | +|-------------------|------------------| | Tables store data | Tables represent workflow steps | | Rows are records | Rows are workflow artifacts | | Foreign keys enforce consistency | Foreign keys prescribe execution order |