Skip to content

test(integration): make the suite honest under pipefail; pin post-#129/Bug-397 contracts#137

Merged
Andrei Kvapil (kvaps) merged 11 commits into
mainfrom
fix/integration-suite-honest-green
Jun 12, 2026
Merged

test(integration): make the suite honest under pipefail; pin post-#129/Bug-397 contracts#137
Andrei Kvapil (kvaps) merged 11 commits into
mainfrom
fix/integration-suite-honest-green

Conversation

@kvaps

@kvaps Andrei Kvapil (kvaps) commented Jun 12, 2026

Copy link
Copy Markdown
Member

What

Makes the Tier-2 integration suite honest after the pipefail fix (#135 exposes the real exit code): triages all 16 tests that are red on main at 10fdc2b, fixes every test/harness-side failure, and pins the new post-#129 / post-Bug-397 contracts the old tests contradicted.

With this PR the suite is green except for 5 tests that are red because of genuine product bugs (details below). Those are deliberately left untouched — fixing the product or weakening the tests is out of scope for a test-honesty PR; they need their own fixes and will keep the integration job red until then.

Why the suite was red (per-test triage)

Stale pins of changed product contracts (inverted, same way the L1 specs were inverted):

  • TestGroupFRToggleDiskful2DisklessPreservesTieBreaker → renamed ...ReapsTieBreaker: fix(controller): reap auto-tiebreaker below 2 diskful (1 diskful + 1 diskless) #129 collapsed the auto-witness invariant to "witness lives iff exactly 2 diskful replicas"; the toggle down to 1 diskful now reaps the witness. Asserts the exact 2-replica composition.
  • TestGroupG/SnapShipCrossNode → renamed SnapRestoreSnapshotHolderOnly: the Bug-397 data-integrity guard now refuses node_names restores onto snapshot-less nodes (an empty replica could latch UpToDate and be promoted). The test now pins both the 400 refusal and the successful holder-node restore with the dispatcher routing prop.
  • TestGroupKWFLateVD, TestGroupKWFReplicasOnSame (+OnDifferent), TestGroupKWFPoolDestroyedDropsFromPlacer: post-fix(controller): reap auto-tiebreaker below 2 diskful (1 diskful + 1 diskless) #129 a 2-diskful RD always carries a diskless TIE_BREAKER witness; the diskful-placement asserts now exclude it (it consumes no pool and sits outside placement constraints).

Tests speaking a CLI dialect the pinned linstor-client 1.27.1 (asserted by Group H) does not have:

  • node evacuate --force / node evict don't exist in 1.27.1 → the force override and the evict alias are now driven directly over the PUT routes (?force=true, /evict), asserting the []ApiCallRc envelope shape that pins the Bug-78 wire contract.
  • rd create --storage-pool, encryption create-passphrase --new-passphrase, and trailing-positional r c argument order → rewritten in 1.27.1 syntax.

Test/harness bugs:

  • linstor n lost against an ONLINE satellite is correctly refused (Bug 111 gate), and the machine-readable CLI exits 0 on refusal envelopes, so both node-lost tests silently no-op'd into timeouts. The satellite mock gained a SimulateNodeOffline knob and the tests now model the documented precondition (dead satellite) first. (With that fixed, TestGroupKWFNodeLostCascade reaches a real product bug — see below.)
  • Fixture storage pools were seeded without the blockstor.io/node-name label the k8s store's ListByNode selector requires, making them invisible to the n lost SP cascade.
  • TestGroupKWFHappyPath autoplaced without a pool pin and could land on the thick file pool, where the snapshot step is correctly refused by the thick-provider gate; it now pins lvm-thin the way linstor-csi always pins a pool.
  • TestGroupG/AutoSnapshotPeriodicTick raced the live RD reconciler with a single-shot Update; now retries on conflict.
  • Group G's seedRDWithVolume created the RD before its per-node Resources, so the auto-tiebreaker pass raced the fixture loop to worker-3 and the seed flaked with AlreadyExists; the helper now seeds all three diskful rows first.
  • TestGroupI/NodeConnectionSetProperty asserted a properties envelope key; the upstream wire key (golinstor NodeConnection) is props — the product was right.

Note on scope: docs/agent-playbook.md reserves harness edits for coordinated work — this PR makes the two minimal harness changes above (satellite knob, fixture label) because the suite cannot be made honest without them; flagged here for explicit review.

Why the local gate run looked green

go test -tags=integration ./tests/integration/... piped into anything (| tee, | tail) reports the pipe's exit status, not go test's — the same masking class #135 fixes in CI. An unpiped run exits 1 locally as well (and without KUBEBUILDER_ASSETS/the linstor CLI the suite fails fast rather than skipping).

Known product bugs (kept red on purpose)

  1. TestGroupJ/CSICreateVolumeFromClone — the clone endpoint rejects golinstor's legitimate use_zfs_clone field (rdCloneRequest + DisallowUnknownFields), breaking CSI clone-from-source. Same wire-compat class as the earlier override_props/src_snap_name fix.
  2. TestGroupKWFSpawnAndDependentReAutoplacestripRebalanceAnnotation nils the empty annotations map, and the k8s store's mergeUserAnnotationsInto treats nil as "don't touch", so the rebalance-pending annotation can never be removed in production and the rebalance pass churns forever.
  3. TestGroupKWFBalanceResourcesTicklinstor controller set-property persists to ControllerConfig.Spec.ExtraProps, but reconcilers and the placer read Store.ControllerProps(), which in the k8s store is a process-local map never synced from the CRD: every controller-scope tuning prop (BalanceResources*, autoplacer weights, AllowMixingStoragePoolDriver) is silently ignored.
  4. TestGroupKWFLUKSStackEndToEndencryption create-passphrase succeeds (master passphrase stored in a Secret) but the LUKS RD-create gate only consults the DrbdOptions/EncryptPassphrase controller prop, so the upstream CLI unlock flow can never enable LUKS provisioning (and the remediation hint suggests storing a plaintext passphrase in a prop).
  5. TestGroupKWFNodeLostCascade — controller-side TOCTOU: after n lost cascades a 3-diskful RD down to 2 diskful, ensureTiebreaker picks the witness node from the manager's (lagging) cache, re-creates a DISKLESS TIE_BREAKER Resource on the just-deleted node, and nothing ever reaps it — the (rd, node) slot the cascade freed is re-occupied by a ghost row pinned to a nonexistent node. The REST resource-create path has the Bug-174 node-deleted-race guard; the controller-side witness create does not. Deterministic whenever the controller process is warm (any prior test in the same binary); passes only when the test runs first.

TestGroupI/ResourceConnectionPathCreate failed once in the run-27381332316 CI job but passes locally in full-suite runs; treated as a CI-timing flake to observe on this PR's CI.

Validation

  • go test -tags=integration -count=1 -timeout=15m ./tests/integration/... (CI invocation, set -o pipefail + tee): all tests pass except the 5 product-bug pins above.
  • go test ./...: green.
  • golangci-lint run: 0 issues (also no new issues under --build-tags integration vs the main baseline).

Summary by CodeRabbit

Tests

  • Enhanced integration test coverage for node operations, snapshot restore workflows, and replica placement validation
  • Improved test harness with offline satellite simulation capabilities and storage pool labeling
  • Updated test assertions to validate correct API response formats and stabilized replica compositions

Andrei Kvapil (kvaps) and others added 10 commits June 12, 2026 02:50
The mock stamped every Node ONLINE unconditionally, making it
impossible to exercise `linstor n lost` honestly: the Bug 111 gate
correctly refuses node-lost against an ONLINE satellite, and the
machine-readable CLI exits 0 on refusal envelopes, so the lost-cascade
tests silently no-op'd and timed out. The knob models the documented
n-lost precondition (a satellite that stopped heartbeating).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The pinned linstor-client 1.27.1 (asserted by Group H) has no
`node evacuate --force` flag and no `node evict` verb at all, so
both tests crashed on CLI argument parsing before reaching the wire.
Exercise the force override via PUT /v1/nodes/{node}/evacuate?force=true
and the evict alias via PUT /v1/nodes/{node}/evict directly, asserting
the []ApiCallRc envelope shape both routes must answer (Bug 78 wire).

Node-lost cascade now takes the satellite offline first — the
documented n-lost precondition — instead of silently hitting the
Bug 111 online-refusal and timing out.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…tract

TestGroupFRToggleDiskful2DisklessPreservesTieBreaker pinned the
pre-#129 invariant (witness preserved at 1 diskful + 1 user-diskless).
PR #129 collapsed the auto-witness contract to upstream parity —
"witness lives iff exactly 2 diskful replicas" — so the toggle down
to 1 diskful now reaps the witness. Rename to ...ReapsTieBreaker and
assert the exact 2-replica composition (1 diskful + 1 user-diskless,
0 witnesses), mirroring the L1 spec inversion in
internal/controller/ensure_tiebreaker_test.go.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… on conflict

SnapShipCrossNode pinned the pre-Bug-397 behaviour where an explicit
node_names restore onto a snapshot-less node succeeded and acted as a
cross-node ship trigger. Bug 397 (P0 data integrity) now refuses that
restore at the API edge — an empty diskful replica could latch
UpToDate and be promoted on failover. Rename the subtest to
SnapRestoreSnapshotHolderOnly and pin both halves of the new
contract: 400 + no target RD for the snapshot-less node, 201 + the
BlockstorRestoreFromSnapshot routing prop for a holder node.

AutoSnapshotPeriodicTick stamped AutoSnapshot/RunEvery with a
single-shot Update that races the live RD reconciler; wrap it in
retry.RetryOnConflict.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…'props' key

The GET envelope assertion looked for a 'properties' map, but the
upstream wire key is 'props' — golinstor's NodeConnection struct
(client/connection.go) and blockstor's handler both spell it that
way. The test (not the product) diverged from the wire contract.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
 witness

- HappyPath: pin autoplace to the thin lvm-thin pool the way
  linstor-csi always does; unpinned placement could land on the thick
  'file' pool where the snapshot step is correctly refused by the G5
  thick-provider gate (the CLI exits 0 on refusal envelopes, so the
  failure surfaced as a misleading wait timeout).
- NodeLostCascade: take the satellite offline before 'n lost' (Bug
  111 gate refuses lost against an ONLINE satellite).
- PoolDestroyedDropsFromPlacer: 'rd create' in linstor-client 1.27.1
  has no --storage-pool flag; the pool constraint rides on autoplace.
- LUKSStackEndToEnd: 1.27.1 spells the flag --passphrase, not
  --new-passphrase.
- LateVD / ReplicasOnSame / ReplicasOnDifferent: exclude the
  auto-tiebreaker witness (expected on 2-diskful RDs post-#129) from
  diskful placement asserts it was never part of.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…age pools

The k8s store's StoragePools().ListByNode runs a label selector on
blockstor.io/node-name — the label pkg/store/k8s stamps on every
store-created pool. Fixture pools seeded directly via the CRD client
lacked it, so per-node store reads saw zero pools and the `n lost`
storage-pool cascade silently deleted nothing, failing
TestGroupKWFNodeLostCascade's SP assert.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ug 83 assert

linstor-client 1.27.1's argparse rejects the trailing positional
after two valued options ('too few arguments'); use the
positional-first form Group F already uses. The Bug 83 dead-pool
assert now excludes the auto-tiebreaker witness — a network-only
DISKLESS attachment that consumes no pool and legitimately lands on
the dead-pool node for a 2-diskful RD post-#129.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Fold the PUT + envelope assertion into one helper so bodyclose sees
the close in the same scope, wrap client errors inside the
RetryOnConflict closure, and reword two comments that tripped godox.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ess race

seedRDWithVolume created the RD first, so the RD reconciler's
auto-tiebreaker pass (witness iff exactly 2 diskful, post-#129)
raced the fixture loop to worker-3 and the seeded third Resource
flaked with AlreadyExists. Seeding all three diskful rows before the
RD lands means the first reconcile already sees 3 diskful and wants
no witness.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a244471b-aa8f-4091-87bb-5e0fab1e95a4

📥 Commits

Reviewing files that changed from the base of the PR and between 10fdc2b and 24a787c.

📒 Files selected for processing (7)
  • tests/integration/group_a_test.go
  • tests/integration/group_f_test.go
  • tests/integration/group_g_test.go
  • tests/integration/group_i_test.go
  • tests/integration/group_k_test.go
  • tests/integration/harness/fixtures.go
  • tests/integration/harness/satellite.go

📝 Walkthrough

Walkthrough

This PR updates integration tests across six test groups to enforce new behaviors: offline satellite simulation for lifecycle testing, witness reaping on replica toggles, snapshot restore contract enforcement, REST API node operations, and placement assertion filtering that excludes auto-tiebreaker witnesses.

Changes

Integration test infrastructure and behavior updates

Layer / File(s) Summary
Offline satellite simulation capability
tests/integration/harness/satellite.go
Mock satellite adds nodeOffline map, SimulateNodeOffline() method, and internal isNodeOffline() helper to track and report offline nodes; reconcileNodes returns OFFLINE status when node is marked offline.
Storage pool fixture labeling
tests/integration/harness/fixtures.go
seedStoragePool now sets blockstor.io/node-name label on created StoragePool objects for per-node store lookups.
Node lifecycle REST API testing
tests/integration/group_a_test.go
Group A adds encoding/json, io, net/http imports; TestGroupANodeEvacuatePUT and TestGroupANodeEvictPUT replace CLI calls with direct HTTP PUT requests; new helpers putRequireAPICallRcEnvelope and waitForNodeConnectionStatus validate HTTP status and []ApiCallRc envelope decoding.
Node lost cascade with offline precondition
tests/integration/group_a_test.go, tests/integration/group_k_test.go
TestGroupANodeLostCascadesOrphans and the Group K node-lost test now simulate satellite offline before issuing node lost to exercise cascade-delete under intended preconditions.
Witness reaping behavior and assertion helpers
tests/integration/group_f_test.go
TestGroupFRToggleDiskful2DisklessPreservesTieBreaker replaced with TestGroupFRToggleDiskful2DisklessReapsTieBreaker; new assertWitnessReapedComposition helper enforces exact replica counts and zero TIE_BREAKER entries after toggle.
Placement filtering helper and multi-test updates
tests/integration/group_k_test.go
New resourceHasFlag helper checks resource flags; happy-path, late-Vd, pool-destroyed, and replica-distribution tests now exclude TIE_BREAKER and DISKLESS resources from zone spread and placement assertions; LUKS test updated to use --passphrase flag for linstor-client 1.27.1.
Snapshot restore contract and fixture ordering
tests/integration/group_g_test.go
seedRDWithVolume now creates per-node Resource objects before parent ResourceDefinition; new testGroupGSnapRestoreSnapshotHolderOnly enforces Bug-397 contract (HTTP 400 on non-holder restore, HTTP 201 on holder restore); testGroupGAutoSnapshotPeriodicTick wraps RD updates in retry.RetryOnConflict for conflict-safe stamps.
API envelope alignment
tests/integration/group_i_test.go
NodeConnectionSetProperty test updated to validate response under props key (not properties); comments adjusted to describe correct wire envelope shape.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • cozystack/blockstor#123: Main PR's integration coverage for n lost lifecycle (including simulating satellite offline and sequencing via ConnectionStatus) is directly tied to PR #123's new/updated REST node-lost tests that validate POST /v1/nodes/{node}/lost behavior and absence of dangling references.

Poem

🐰 From offline nodes to witness reaping done,
snapshots now restore where light can run,
REST APIs dance in tubes of HTTP green—
the finest integration tests we've seen!
With helpers stacked like carrots, tall and spun,
the cascade logic blooms when day is done. 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title precisely captures the main objective: making the integration test suite 'honest' (passing) after pipefail was enabled, while pinning contract changes from post-#129 and Bug-397 fixes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/integration-suite-honest-green

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the integration tests to align with upstream LINSTOR client 1.27.1 behavior and address several bug fixes (including Bugs 78, 79, 83, 111, 129, and 397). Key updates include simulating offline nodes for node lost tests, asserting that tie-breakers are reaped when dropping below two diskful replicas, ensuring snapshot restores are refused on non-holder nodes, and updating CLI flags and wire keys (such as using props instead of properties). Additionally, retry-on-conflict logic was added for resource definition updates, and storage pools were labeled to ensure visibility. As there are no review comments, I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@kvaps Andrei Kvapil (kvaps) merged commit 1728baf into main Jun 12, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant