nullhack · nullhack · May 8, 2026 · May 8, 2026
diff --git a/.flowr/flows/planning-flow.yaml b/.flowr/flows/planning-flow.yaml
@@ -143,7 +143,7 @@ states:
       out: []
     conditions:
       feature-baselined:
-        feature-status: ==BASELINED
+        baseline-confirmed: ==verified
       committed-to-main-locally:
         committed-to-main-locally: ==verified
     next:

diff --git a/.opencode/knowledge/requirements/feature-discovery.md b/.opencode/knowledge/requirements/feature-discovery.md
@@ -13,7 +13,7 @@ last-updated: 2026-05-08
 - Feature boundaries respect bounded context borders, aggregate transactional boundaries, and module dependency order per [[requirements/feature-boundaries]]. Features that span boundaries are flagged for splitting.
 - Rules are derived systematically from three sources: domain events, aggregate invariants, and commands per [[requirements/rule-derivation]]. Every rule traces to at least one domain model artifact.
 - Gaps discovered during feature discovery (a bounded context with no feature, a quality attribute with no enforcing feature, a domain event with no corresponding rule) are flagged, not silently filled.
-- Features have a lifecycle of increasing specificity: `Status: ELICITING` through discovery and breakdown, advancing to `BASELINED` after baseline confirmation.
+- Features progress through a lifecycle of increasing specificity: an empty file with a description → coarse Rules (Business) and Constraints → full Rule blocks with @id-tagged Examples.
 
 ## Concepts
 
@@ -26,17 +26,17 @@ last-updated: 2026-05-08
 **Gap Analysis**: Systematically verify coverage across three dimensions: (1) every bounded context from the domain model is covered by at least one feature, (2) every quality attribute from the product definition is enforced by at least one feature's constraints, (3) every critical domain event is traceable to at least one business rule. Uncovered areas indicate missing features or gaps in the domain model itself. Flag both.
 
 **Feature Lifecycle**: Features follow a lifecycle of increasing specificity across phases:
-1. **Discovery**: Feature boundaries identified, coarse business rules written, constraints scoped. Status: ELICITING.
-2. **Breakdown**: Coarse rules expanded into full Rule blocks with As a/I want/So that format. INVEST validation applied. Status remains ELICITING.
-3. **Example Writing and Baseline**: Given/When/Then Examples written, pre-mortems applied, baseline confirmed. Status advances to BASELINED.
+1. **Discovery**: Feature boundaries identified, coarse business rules written, constraints scoped.
+2. **Breakdown**: Coarse rules expanded into full Rule blocks with As a/I want/So that format. INVEST validation applied.
+3. **Example Writing and Baseline**: Given/When/Then Examples written, pre-mortems applied, baseline confirmed (feature now has @id-tagged Examples).
 
 ## Content
 
 ### Discovery Sequence
 
 Feature discovery is two sequential activities:
 
-1. **Boundary identification** (discover-features skill): Use the delivery order as backbone. Map each step to bounded contexts and aggregates from the domain model. Split candidates that span contexts or aggregates. Name features and write descriptions per [[requirements/feature-boundaries]]. Create .feature files with title, description, Status: ELICITING, and an empty Questions table.
+1. **Boundary identification** (discover-features skill): Use the delivery order as backbone. Map each step to bounded contexts and aggregates from the domain model. Split candidates that span contexts or aggregates. Name features and write descriptions per [[requirements/feature-boundaries]]. Create .feature files with title, description, and an empty Questions table.
 
 2. **Rule derivation** (discover-rules skill): For each feature, assign domain model artifacts (entities, events, invariants, commands) based on bounded context membership. Derive behavioral rules from events, structural rules from invariants, and action rules from commands per [[requirements/rule-derivation]]. Map quality attributes to constraints. Write coarse Rules (Business) bullets and Constraints into each .feature file.
 

diff --git a/.opencode/knowledge/requirements/gherkin.md b/.opencode/knowledge/requirements/gherkin.md
@@ -13,6 +13,7 @@ last-updated: 2026-04-29
 - `Then` must be a single, observable, measurable outcome; no "and" combining multiple behaviours in one `Then`.
 - Bug Examples use `@bug` and require both a specific feature test and a Hypothesis property test.
 - After criteria commit, Examples are frozen; changes require `@deprecated` on the old Example and a new Example with a new `@id`.
+- Two Examples with the same `Then` outcome but different input values test the same behaviour; partition by behaviour outcome, not by input value (Wynne, 2015; Adzic, 2011).
 
 ## Concepts
 
@@ -26,6 +27,8 @@ last-updated: 2026-04-29
 
 **Bug Examples**: When a defect is reported, add an `@bug` Example. Implement both a specific `@id` test and a Hypothesis property test covering the whole class of inputs. Both are required.
 
+**Behavioral Distinctness**: Two Examples are behavior-distinct only when they produce different `Then` outcomes (Wynne, 2015; Adzic, 2011). Partitioning by behaviour outcome rather than by input value avoids the combinatorial explosion of value-distinct testing. Two Examples with the same `Then` but different input values test the same behaviour — keep one, discard the duplicates. For action and behavioural rules, each distinct outcome gets one representative Example. For structural (invariant) rules, one representative Example suffices because the invariant holds across all inputs; full coverage is deferred to a Hypothesis property test per [[software-craft/test-design#concepts]].
+
 ## Content
 
 ### Declarative vs Imperative
@@ -84,6 +87,7 @@ Implement both:
 - Multiple behaviours in one Example: split them
 - Examples that test implementation details ("Then: the Strategy pattern is used")
 - Imperative UI steps instead of declarative behaviour descriptions
+- Two examples with the same `Then` but different input values: duplicate behaviour coverage per [[requirements/gherkin#concepts]]
 
 ### Feature File Path Convention
 
@@ -98,4 +102,5 @@ Test path conventions (`tests/features/<feature_slug>/`), the feature-test vs un
 - [[requirements/invest]]: story quality criteria for rules
 - [[requirements/moscow]]: prioritizing Examples as Must/Should/Could
 - [[requirements/decomposition]]: splitting Rules with too many Examples
-- [[requirements/pre-mortem]]: finding hidden failure modes in rules
+- [[requirements/pre-mortem]]: finding hidden failure modes in rules
+- [[software-craft/test-design]]: property-based testing for invariant rules
diff --git a/.opencode/knowledge/requirements/pre-mortem.md b/.opencode/knowledge/requirements/pre-mortem.md
@@ -9,8 +9,9 @@ last-updated: 2026-04-29
 ## Key Takeaways
 
 - Prospective hindsight catches approximately 30% more issues than forward-looking review (Klein, 1998); frame the question as "it already failed: why?" to activate explanation mode.
-- Apply the pre-mortem at three levels of granularity: specification (missing observable behaviours), architecture (design principle violations), and implementation (design self-declaration).
+- Apply the pre-mortem at four levels of granularity: specification (missing observable behaviours), behavior (failure modes per distinct outcome), architecture (design principle violations), and implementation (design self-declaration).
 - At specification: "Imagine this feature was built exactly as described, all tests pass, but it doesn't work for the user. What would be missing?"
+- At behavior: "Imagine this specific behaviour went wrong in production — how?" Run per distinct `Then` outcome after grouping Examples per [[requirements/gherkin#concepts]]; add Examples for surfaced failure modes.
 - At architecture: for each candidate class check [[software-craft/object-calisthenics#key-takeaways]] and [[software-craft/solid#key-takeaways]]; for each external dependency check [[architecture/hexagonal#key-takeaways]]; for each noun check if it serves double duty across modules.
 - All pre-mortems are enforced by condition gates in the flow: they are not optional exercises.
 
@@ -20,6 +21,12 @@ last-updated: 2026-04-29
 
 **Specification Pre-Mortem**: Ask "What observable behaviours must we prove for this Rule to be complete?" This surfaces hidden requirements that forward-looking analysis misses.
 
+**Behavior Pre-Mortem**: Ask "Imagine this specific behaviour went wrong in production — how would it fail?" Once Examples are grouped by distinct `Then` outcome per [[requirements/gherkin#concepts]], run this pre-mortem for each outcome independently. The framing varies by rule type:
+- **Action rules**: "A user performs this action. What subtle real-world conditions would cause it to produce the wrong result?" (e.g., concurrent writes, stale reads, rounding, timezone shifts)
+- **Behavioural rules**: "The system applies this business rule. What edge-case inputs would expose a gap in the logic?" (e.g., boundary crossing, empty/zero/null, ordering dependency)
+- **Structural/invariant rules**: "This invariant must always hold. What counterexamples would break it?" — surface candidate counterexamples, then capture them in a Hypothesis property test per [[software-craft/test-design#concepts]] rather than as additional BDD Examples.
+Add Examples for the failure modes surfaced. This is a distinct level from specification pre-mortem: specification asks "what behaviours are missing from the rule?"; behavior asks "how could this specific outcome fail in production?" per the prospective hindsight mechanism (Klein, 1998).
+
 **Architecture Pre-Mortem**: Ask "In 6 months this design is a mess. What mistakes did we make?" Check each candidate class per [[software-craft/object-calisthenics]] and [[software-craft/solid]]. Check each external dependency per [[architecture/hexagonal]]. Check each noun for cross-module double duty.
 
 **Flow Condition Gates**: Pre-mortem completion is enforced by condition gates in the flow YAML. Self-declaration uses explicit AGREE/DISAGREE commitments (a commitment device (Cialdini, 2001) that makes the declaration psychologically binding). Adversarial framing during pre-mortem analysis ("find what's wrong" rather than "confirm it's right") uses adversarial collaboration (Mellers et al., 2001) to produce stronger reasoning.
@@ -34,6 +41,16 @@ Ask:
 
 Record the findings in the feature's Questions section or as additional Rules.
 
+### Behavior Pre-Mortem
+
+Once Examples are grouped by distinct `Then` outcome per [[requirements/gherkin#concepts]], run for each outcome:
+
+- **Action rules**: "A user performs this action. What subtle real-world conditions would cause it to produce the wrong result?" (e.g., concurrent writes, stale reads, rounding, timezone shifts)
+- **Behavioural rules**: "The system applies this business rule. What edge-case inputs would expose a gap in the logic?" (e.g., boundary crossing, empty/zero/null, ordering dependency)
+- **Structural/invariant rules**: "This invariant must always hold. What counterexamples would break it?" — surface candidate counterexamples, then capture them in a Hypothesis property test per [[software-craft/test-design#concepts]] rather than as additional BDD Examples.
+
+Add Examples for the failure modes surfaced. This is a distinct level from specification pre-mortem: specification asks "what behaviours are missing from the rule?"; behavior asks "how could this specific outcome fail in production?"
+
 ### Architecture Pre-Mortem
 
 Ask:
@@ -58,4 +75,5 @@ The design self-declaration covers YAGNI, KISS, DRY, Object Calisthenics per [[s
 - [[software-craft/tdd]]: design self-declaration subsumes the implementation pre-mortem
 - [[software-craft/object-calisthenics]]: ObjCal-7 (two instance variables) checked in architecture pre-mortem
 - [[software-craft/smell-catalogue]]: pattern smells checked in implementation pre-mortem
-- [[software-craft/solid]]: SOLID checks in implementation pre-mortem
+- [[software-craft/solid]]: SOLID checks in implementation pre-mortem
+- [[software-craft/test-design]]: property-based testing for structural/invariant rules
diff --git a/.opencode/knowledge/requirements/wsjf.md b/.opencode/knowledge/requirements/wsjf.md
@@ -12,7 +12,7 @@ last-updated: 2026-05-04
 - Value (1-5) maps to Kano categories: 5=Must-have (core workflow blocked), 4=High, 3=Medium (performance), 2=Low (delighter), 1=Minimal (cosmetic).
 - Effort (1-5) maps to complexity: 1=Trivial (no new domain concepts), 2=Small (one new entity), 3=Medium (cross-cutting), 4=Large (multiple entities), 5=Very large (spans modules).
 - Dependency=1 features are ineligible regardless of WSJF score; ties broken by Value; if all features have Dependency=1, resolve the blocking dependency first.
-- Only features with `Status: BASELINED` are eligible for WSJF scoring; WIP limit is 1.
+- Only features with @id-tagged Examples (confirmed by baseline) are eligible for WSJF scoring; WIP limit is 1.
 
 ## Concepts
 
@@ -76,7 +76,7 @@ Estimate implementation complexity:
 
 ### Prerequisites
 
-- Only features with `Status: BASELINED` are eligible for WSJF scoring
+- Only features with @id-tagged Examples (confirmed by baseline) are eligible for WSJF scoring
 - WIP limit of 1: only one feature in progress at a time
 - The PO selects and moves the feature; no other agent moves feature files
 

diff --git a/.opencode/knowledge/software-craft/test-design.md b/.opencode/knowledge/software-craft/test-design.md
@@ -13,6 +13,7 @@ last-updated: 2026-04-29
 - Test coupling exists on a spectrum: feature tests (most resilient) > unit contract tests > property-based tests > white-box tests (most brittle, avoid).
 - One observable behaviour per test: each test should fail for exactly one reason and pass for exactly one reason.
 - Hard-coded values are acceptable when the test only requires that value; parameterising prematurely couples the test to assumptions about future needs.
+- Property tests: all invariant/structural rules, not just @bug Examples. Examples alone cannot prove an invariant (MacIver, 2016).
 
 ## Concepts
 
@@ -26,6 +27,8 @@ last-updated: 2026-04-29
 
 **Semantic Depth**. A test that exists for an @id tag but exercises domain logic directly instead of through the entry point described in the acceptance criterion has correct structural traceability but wrong semantic depth. Every @id test must exercise the entry point the AC describes: if the AC specifies a command-line invocation, the test must invoke the command handler; if the AC specifies an API call, the test must call the API endpoint. Structural traceability (every @id has a test function) without semantic depth (every @id test exercises the right entry point) creates a false sense of coverage. Tests exist for every example but don't verify the actual user-facing behavior.
 
+**Invariant Property Tests**. Structural (invariant) rules describe properties that must hold across all inputs, not specific behaviours. Examples alone cannot prove an invariant — they only confirm it holds for the selected cases (MacIver, 2016). When a Rule asserts an invariant (e.g., "total must equal sum of parts," "output must be sorted," "balance must never go negative"), the specification pre-mortem and behavior pre-mortem surface candidate counterexamples. These counterexamples become assertions in a Hypothesis property test (`tests/unit/`) that verifies the invariant across a generated range of inputs, catching failure modes that no finite set of hand-picked Examples could have found.
+
 ## Content
 
 ### Test Coupling Spectrum
@@ -34,7 +37,7 @@ last-updated: 2026-04-29
 |---|---|---|---|
 | Feature test | Observable behaviour through public interface | Highest | Every @id acceptance criterion |
 | Unit contract test | Module protocol (inputs, outputs, invariants) | High | Complex domain logic with clear contracts |
-| Property test | Invariants across input ranges | Moderate | Bug @id requirements; edge-case classes |
+| Property test | Invariants across input ranges | Moderate | Bug @id requirements; all structural/invariant rules |
 | White-box test | Internal state or private methods | Lowest | Legacy characterization only |
 
 ### Semantic Alignment Examples
@@ -58,7 +61,7 @@ last-updated: 2026-04-29
 |-----------|----------|-------------|
 | `tests/features/<feature_slug>/` | BDD scenario tests: one test per `@id` tag in the feature file | `@id` tag required |
 | `tests/unit/` | Unit contract tests: coverage-boosting tests for implementation branches not covered by BDD examples | No `@id` tag |
-| `tests/unit/` | Property tests: invariant verification across input ranges | No `@id` tag (except `@bug` examples) |
+| `tests/unit/` | Property tests: invariant verification across input ranges | No `@id` tag (except `@bug` examples); all structural/invariant rules must have one |
 
 **Rule:** `tests/features/` is exclusively for BDD scenario tests that trace back to `@id` tags in the feature file. Coverage-boosting tests that exercise implementation branches not covered by any `@id` example are unit contract tests and belong in `tests/unit/`, not `tests/features/`. A test without an `@id` tag in `tests/features/` violates the traceability contract.
 
@@ -67,4 +70,5 @@ last-updated: 2026-04-29
 - [[software-craft/tdd]]: the RED-GREEN-REFACTOR cycle that produces these tests
 - [[software-craft/code-review]]: reviewing whether tests meet these quality criteria
 - [[requirements/gherkin]]: the specification format that drives test design
-- [[software-craft/stub-design]]: creating typed stubs that maintain semantic alignment
+- [[software-craft/stub-design]]: creating typed stubs that maintain semantic alignment
+- [[requirements/pre-mortem]]: behavior pre-mortem surfaces counterexamples for property tests
diff --git a/.opencode/skills/confirm-baseline/SKILL.md b/.opencode/skills/confirm-baseline/SKILL.md
@@ -10,4 +10,3 @@ Available knowledge: [[requirements/decomposition#key-takeaways]]. `in` artifact
 1. Verify all Examples have `@id` tags. If any are missing, the feature is not ready for baseline.
 2. Verify the feature passes decomposition checks per [[requirements/decomposition#key-takeaways]]: no more than 2 concerns, no more than 8 Must Examples.
 3. Verify all planning artifacts are present and consistent.
-4. Verify feature status is BASELINED.
diff --git a/.opencode/skills/discover-features/SKILL.md b/.opencode/skills/discover-features/SKILL.md
@@ -14,5 +14,5 @@ Available knowledge: [[requirements/feature-boundaries]], [[requirements/feature
 5. Name each feature per [[requirements/feature-boundaries#content]]: use the delivery step name, validated for clarity and specificity.
 6. Write a description for each feature per [[requirements/feature-boundaries#content]]: what it provides, which context it serves, why it exists, key entities.
 7. Identify cross-cutting quality attributes from product_definition.md that will become Constraints — note which features they distribute to per [[requirements/feature-boundaries#content]] — but do NOT write Constraints yet; discover-rules will write them.
-8. Create a `.feature` file from the template at `.templates/docs/features/feature.feature.template` for each feature with title, description, Status: ELICITING, and an empty Questions table. Do NOT write Rules (Business) or Constraints — those come from the discover-rules skill.
+8. Create a `.feature` file from the template at `.templates/docs/features/<feature_name>.feature.template` for each feature with title, description, and an empty Questions table. Do NOT write Rules (Business) or Constraints — those come from the discover-rules skill.
 9. Run context coverage gap analysis per [[requirements/feature-discovery#content]]: every bounded context covered by at least one feature? IF any gap → add a Questions entry flagging it.