test(e2e): L6/L7 release-gate coverage for rd clone data plane and Secret-only LUKS provisioning#145
Conversation
Bug-020 release-gate coverage for the clone fix: the L6 cell drives both wire variants (plain python-CLI clone body without use_zfs_clone, and the raw-REST use_zfs_clone=true body linstor-csi sends), then proves the data plane really copied by reading a deterministic marker back on every clone replica, alongside clone-status COMPLETE and the internal clone-<dst> snapshot visibility. The L7 replay pins the verb + convergence shape the runner can express (create, autoplace, clone, all_uptodate on the clone RD); the byte-level assert stays in the L6 cell because replay steps are linstor-argv only. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Release-gate coverage for the encryption create-passphrase fix: the L6 cell drives the upstream-parity flow (create-passphrase, then rd c -l drbd,luks,storage + autoplace) while asserting the legacy DrbdOptions/EncryptPassphrase controller prop stays absent end to end, and proves at the kernel level that the Secret-backed passphrase formatted and opens the LUKS header on each replica. The L7 replay mirrors the flow with prop_value absence asserts before and after provisioning; the cryptsetup-level checks stay in the L6 cell because replay steps are linstor-argv only. Written against the post-fix contract: both tests stay red until the Bug-023 fix lands and must be gated on it. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces new end-to-end and operator-harness replay tests to cover Bug 023 (verifying Secret-only LUKS provisioning without legacy controller properties) and Bug 020 (verifying data-plane integrity for volume-definition-bearing resource-definition clones). The review feedback correctly identifies that process substitutions in the clone test script mask command failures under set -e and recommends using mktemp instead of a hardcoded /tmp file path to prevent conflicts and security risks.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| rm -f "$err_file" | ||
|
|
||
| echo ">> [Bug 020 / A] clone replicas materialise + converge UpToDate" | ||
| mapfile -t cli_nodes < <(wait_clone_replicas "$DST_CLI" 120) |
There was a problem hiding this comment.
Process substitution <(...) runs asynchronously in a subshell, and its exit status is not monitored or caught by set -e. If wait_clone_replicas fails and returns 1, the script will silently ignore the failure and proceed to wait_uptodate with empty or unbound variables, leading to hard-to-debug test failures.
To ensure that failures in wait_clone_replicas are caught immediately by set -e, assign the output to a variable first and then parse it with mapfile.
| mapfile -t cli_nodes < <(wait_clone_replicas "$DST_CLI" 120) | |
| cli_nodes_str=$(wait_clone_replicas "$DST_CLI" 120) | |
| mapfile -t cli_nodes <<< "$cli_nodes_str" |
| fi | ||
|
|
||
| echo ">> [Bug 020 / B] clone replicas materialise + converge UpToDate" | ||
| mapfile -t zfs_nodes < <(wait_clone_replicas "$DST_ZFS" 120) |
There was a problem hiding this comment.
Process substitution <(...) runs asynchronously in a subshell, and its exit status is not monitored or caught by set -e. If wait_clone_replicas fails and returns 1, the script will silently ignore the failure and proceed to wait_uptodate with empty or unbound variables, leading to hard-to-debug test failures.
To ensure that failures in wait_clone_replicas are caught immediately by set -e, assign the output to a variable first and then parse it with mapfile.
| mapfile -t zfs_nodes < <(wait_clone_replicas "$DST_ZFS" 120) | |
| zfs_nodes_str=$(wait_clone_replicas "$DST_ZFS" 120) | |
| mapfile -t zfs_nodes <<< "$zfs_nodes_str" |
| http_code=$(curl -sS -m 30 -o /tmp/cli-matrix-020-clone.json -w '%{http_code}' \ | ||
| -X POST -H 'Content-Type: application/json' \ | ||
| -d "{\"name\":\"${DST_ZFS}\",\"use_zfs_clone\":true}" \ | ||
| "http://127.0.0.1:${LCTL_PORT}/v1/resource-definitions/${SRC}/clone" \ | ||
| 2>/dev/null || echo "000") | ||
| if [[ "$http_code" != "201" ]]; then | ||
| echo "FAIL (Bug 020): use_zfs_clone=true POST answered HTTP $http_code, want 201" >&2 | ||
| echo " (pre-fix: 400 DisallowUnknownFields on use_zfs_clone)" >&2 | ||
| cat /tmp/cli-matrix-020-clone.json >&2 2>/dev/null || true | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Using a hardcoded file path in /tmp (like /tmp/cli-matrix-020-clone.json) can lead to conflicts, permission issues, or security vulnerabilities (such as symlink attacks) if multiple tests run concurrently or if leftovers exist from previous runs.
Instead, use a secure temporary file created via mktemp and ensure it is cleaned up properly.
| http_code=$(curl -sS -m 30 -o /tmp/cli-matrix-020-clone.json -w '%{http_code}' \ | |
| -X POST -H 'Content-Type: application/json' \ | |
| -d "{\"name\":\"${DST_ZFS}\",\"use_zfs_clone\":true}" \ | |
| "http://127.0.0.1:${LCTL_PORT}/v1/resource-definitions/${SRC}/clone" \ | |
| 2>/dev/null || echo "000") | |
| if [[ "$http_code" != "201" ]]; then | |
| echo "FAIL (Bug 020): use_zfs_clone=true POST answered HTTP $http_code, want 201" >&2 | |
| echo " (pre-fix: 400 DisallowUnknownFields on use_zfs_clone)" >&2 | |
| cat /tmp/cli-matrix-020-clone.json >&2 2>/dev/null || true | |
| exit 1 | |
| fi | |
| clone_json=$(mktemp) | |
| http_code=$(curl -sS -m 30 -o "$clone_json" -w '%{http_code}' \ | |
| -X POST -H 'Content-Type: application/json' \ | |
| -d "{\"name\":\"${DST_ZFS}\",\"use_zfs_clone\":true}" \ | |
| "http://127.0.0.1:${LCTL_PORT}/v1/resource-definitions/${SRC}/clone" \ | |
| 2>/dev/null || echo "000") | |
| if [[ "$http_code" != "201" ]]; then | |
| echo "FAIL (Bug 020): use_zfs_clone=true POST answered HTTP $http_code, want 201" >&2 | |
| echo " (pre-fix: 400 DisallowUnknownFields on use_zfs_clone)" >&2 | |
| cat "$clone_json" >&2 2>/dev/null || true | |
| rm -f "$clone_json" | |
| exit 1 | |
| fi | |
| rm -f "$clone_json" |
📝 WalkthroughWalkthroughThis PR adds comprehensive e2e test coverage for two independent bug fixes: Bug 023 (LUKS Secret-only encryption provisioning) and Bug 020 (RD clone with VD-bearing source data-plane preservation). Each bug receives both CLI-matrix e2e tests and operator-harness replay scenarios with full validation of the expected behavior and failure-mode fixes. ChangesBug 023 – LUKS Secret-only Encryption Verification
Bug 020 – RD Clone with VD Data-plane Preservation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml (1)
85-89:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
await.kind: prop_valueis not in the allowed list per coding guidelines.Same issue as lines 44-48:
prop_valueis not in the coding guidelines' allowed list forawait.kind, though it is implemented and necessary for this Bug-023 validation (verifying the legacy property remains absent post-provisioning).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml` around lines 85 - 89, The test uses an unsupported await.kind value "prop_value"; replace it with the allowed kind that checks absence (e.g., "prop_missing" or the project's canonical "prop_absent") so the step still verifies that DrbdOptions/EncryptPassphrase is not present after provisioning; update both occurrences (the block referencing key DrbdOptions/EncryptPassphrase and the earlier identical block around lines 44-48) to use the permitted await.kind name so the validation remains equivalent to the original intent.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/e2e/cli-matrix/encryption-passphrase-luks-rd.sh`:
- Around line 168-173: Update the comment above the wait_uptodate call to
explain that wait_uptodate (from tests/e2e/lib.sh) primarily polls each
replica’s observer-stamped .status.volumes[].diskState via status_disk_state and
returns success when both are "UpToDate", but it will also succeed if
kernel_pair_uptodate reports ok (to tolerate CRD/status lag), so the helper can
pass via kernel fallback rather than only Resource.Status convergence; reference
wait_uptodate, status_disk_state and kernel_pair_uptodate by name in the
comment.
In `@tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh`:
- Line 189: The mapfile invocation using process substitution drops the exit
status of wait_clone_replicas, so if wait_clone_replicas (called with DST_CLI)
times out the script continues with an empty cli_nodes and later fails
confusingly; fix by capturing the helper's exit code and output instead of using
direct process substitution: run wait_clone_replicas into a temporary variable,
record its exit status, then populate cli_nodes from that output (mapfile -t
cli_nodes <<<"$output") and validate the array length (or exit status)
immediately and emit the clear "never materialised X diskful replicas" error and
exit non‑zero if it failed; apply the same pattern for the other occurrence
(variant B around the second mapfile usage) and adjust references to cli_nodes,
wait_clone_replicas, wait_uptodate and DST_CLI accordingly.
In `@tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml`:
- Around line 44-48: Await.kind "prop_value" is used by replay YAMLs (e.g., the
encryption-passphrase-luks-rd case) but is missing from the allowed await.kind
list in the guidelines; update the guidelines to either add "prop_value" to the
allowed await.kind enumeration or add an explicit exception for the harness's
prop_value behavior (which is implemented by the harness's prop_value handler)
so policy checks accept await.kind: prop_value references such as the one
setting DrbdOptions/EncryptPassphrase.
---
Duplicate comments:
In `@tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml`:
- Around line 85-89: The test uses an unsupported await.kind value "prop_value";
replace it with the allowed kind that checks absence (e.g., "prop_missing" or
the project's canonical "prop_absent") so the step still verifies that
DrbdOptions/EncryptPassphrase is not present after provisioning; update both
occurrences (the block referencing key DrbdOptions/EncryptPassphrase and the
earlier identical block around lines 44-48) to use the permitted await.kind name
so the validation remains equivalent to the original intent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d5d7c2a6-40c0-427c-9a57-851a88135fad
📒 Files selected for processing (5)
tests/e2e/cli-matrix/README.mdtests/e2e/cli-matrix/encryption-passphrase-luks-rd.shtests/e2e/cli-matrix/rd-clone-vd-data-plane.shtests/operator-harness/replay/encryption-passphrase-luks-rd.yamltests/operator-harness/replay/rd-clone-vd-data-plane.yaml
| echo ">> [Bug 023] wait both replicas UpToDate (Secret-fed luksFormat ran)" | ||
| # Pre-fix failure mode for a gate-only patch: rd-create passes but the | ||
| # satellite loops on "LUKS in layer stack but Props.LuksPassphrase | ||
| # empty" and the replicas never converge. UpToDate within the bound is | ||
| # the proof the Secret reached the satellite channel. | ||
| wait_uptodate "$RD" "$N1" "$N2" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Verify wait_uptodate exists and uses observer-stamped Status
# Find wait_uptodate definition
ast-grep --pattern $'wait_uptodate() {
$$$
}'
# Also check if it calls wait_status_state or similar Status-checking functions
rg -nP -A 10 '^\s*(function\s+)?wait_uptodate\s*\(' tests/e2e/cli-matrix/lib.sh tests/e2e/lib.shRepository: cozystack/blockstor
Length of output: 2837
Update the comment: wait_uptodate checks .status.diskState, but it can also succeed via kernel fallback.
wait_uptodate (tests/e2e/lib.sh) polls both replicas’ observer-stamped .status.volumes[].diskState (via status_disk_state) and returns success only when both are "UpToDate". If that isn’t true yet, it may still return success when kernel_pair_uptodate ... reports ok (to tolerate CRD/status lag), so the helper isn’t purely a “Resource.Status convergence only” assertion.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/cli-matrix/encryption-passphrase-luks-rd.sh` around lines 168 -
173, Update the comment above the wait_uptodate call to explain that
wait_uptodate (from tests/e2e/lib.sh) primarily polls each replica’s
observer-stamped .status.volumes[].diskState via status_disk_state and returns
success when both are "UpToDate", but it will also succeed if
kernel_pair_uptodate reports ok (to tolerate CRD/status lag), so the helper can
pass via kernel fallback rather than only Resource.Status convergence; reference
wait_uptodate, status_disk_state and kernel_pair_uptodate by name in the
comment.
| rm -f "$err_file" | ||
|
|
||
| echo ">> [Bug 020 / A] clone replicas materialise + converge UpToDate" | ||
| mapfile -t cli_nodes < <(wait_clone_replicas "$DST_CLI" 120) |
There was a problem hiding this comment.
Process substitution exit code is lost; test continues with empty array on timeout.
mapfile -t cli_nodes < <(wait_clone_replicas ...) runs the helper in a subshell whose exit status is not captured by set -e. If wait_clone_replicas times out and returns 1, the script continues with an empty cli_nodes array, then fails at wait_uptodate with empty node arguments—producing a confusing error instead of the clear "never materialised 2 diskful replicas" message.
Proposed fix: validate array length after mapfile
mapfile -t cli_nodes < <(wait_clone_replicas "$DST_CLI" 120)
+if (( ${`#cli_nodes`[@]} != 2 )); then
+ echo "FAIL (Bug 020 / A): wait_clone_replicas returned ${`#cli_nodes`[@]} nodes, expected 2" >&2
+ exit 1
+fi
wait_uptodate "$DST_CLI" "${cli_nodes[0]}" "${cli_nodes[1]}"Apply the same pattern after line 229 for variant B.
Also applies to: 229-229
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh` at line 189, The mapfile
invocation using process substitution drops the exit status of
wait_clone_replicas, so if wait_clone_replicas (called with DST_CLI) times out
the script continues with an empty cli_nodes and later fails confusingly; fix by
capturing the helper's exit code and output instead of using direct process
substitution: run wait_clone_replicas into a temporary variable, record its exit
status, then populate cli_nodes from that output (mapfile -t cli_nodes
<<<"$output") and validate the array length (or exit status) immediately and
emit the clear "never materialised X diskful replicas" error and exit non‑zero
if it failed; apply the same pattern for the other occurrence (variant B around
the second mapfile usage) and adjust references to cli_nodes,
wait_clone_replicas, wait_uptodate and DST_CLI accordingly.
| await: | ||
| kind: prop_value | ||
| obj: controller | ||
| key: DrbdOptions/EncryptPassphrase | ||
| timeout_s: 30 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify that prop_value is implemented in lib.sh and check if other replay workflows use it
echo "=== Checking lib.sh for prop_value implementation ==="
rg -A 20 'prop_value\)' tests/operator-harness/lib.sh
echo ""
echo "=== Checking existing replay workflows that use prop_value ==="
rg -l 'kind: prop_value' tests/operator-harness/replay/Repository: cozystack/blockstor
Length of output: 2022
Update await.kind guidelines to include prop_value (or document an explicit exception)
tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml uses await.kind: prop_value (lines 44-48), but the coding guidelines’ allowed await.kind list doesn’t include prop_value. This isn’t a new/unsupported pattern: prop_value is implemented in tests/operator-harness/lib.sh and already used by other replay YAMLs. Update the guidelines to include prop_value (or add an explicit carve-out) to avoid false policy violations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml` around
lines 44 - 48, Await.kind "prop_value" is used by replay YAMLs (e.g., the
encryption-passphrase-luks-rd case) but is missing from the allowed await.kind
list in the guidelines; update the guidelines to either add "prop_value" to the
allowed await.kind enumeration or add an explicit exception for the harness's
prop_value behavior (which is implemented by the harness's prop_value handler)
so policy checks accept await.kind: prop_value references such as the one
setting DrbdOptions/EncryptPassphrase.
Source: Coding guidelines
What
Release-gate L6 cli-matrix cells + L7 replay workflows for two recently-fixed CLI-reachable flows, per the CLI-bug-fix protocol (CLAUDE.md):
1.
rd clonedata plane (Bug 020, fix merged in #142)tests/e2e/cli-matrix/rd-clone-vd-data-plane.sh— seeds a deterministic marker on a 2-replica source (viaresolve_drbd_device), then drives both wire variants: the plain python-CLIresource-definition clone(linstor-client 1.27.1 declares--use-zfs-cloneasstore_true, default=None, so the bare verb POSTs a body withoutuse_zfs_clone) and the raw-RESTPOST /v1/resource-definitions/{src}/clonewithuse_zfs_clone: true— the exact body linstor-csi sends. Asserts per variant: clone accepted (no 400/501),GET .../clone/{dst}reachesCOMPLETE, 2 replicas converge UpToDate, the marker bytes are present on every clone replica (promoting each in turn — catches the Bug-114 silently-empty-shell mode), and the internalclone-<dst>snapshot is visible on the source (delta row 82).tests/operator-harness/replay/rd-clone-vd-data-plane.yaml— create → autoplace →all_uptodateon the source →rd clone→replica_count+all_uptodateon the clone RD (awaitrd:fields go through the same substitution as cmd argv, so asserting on the second RD via{{rd}}-clworks). Teardown deletes the clone, then the internalclone-{{rd}}-clsnapshot, then the source (RD delete is blocked while snapshots exist).2. Secret-only LUKS provisioning (Bug 023, fix in #143 — not yet merged)
tests/e2e/cli-matrix/encryption-passphrase-luks-rd.sh—linstor encryption create-passphrasealone, with the legacyDrbdOptions/EncryptPassphrasecontroller property asserted absent before and after provisioning, thenrd c -l drbd,luks,storage+r c --auto-place=2to UpToDate, and kernel-level proof on each replica (cryptsetup luksDump+--test-passphraseagainst the backing device via the existingluks_backing_device/wait_luks_header_present/assert_luks_passphrase_openshelpers). Cleanup viacleanup_encryption_state.tests/operator-harness/replay/encryption-passphrase-luks-rd.yaml— clears the legacy prop (empty-value = delete) with aprop_valueabsence await,create-passphrase(expect_exit: [0, 10]for the shared stand), LUKS rd create, autoplace,all_uptodate, and a finalprop_valueabsence re-check.The sibling
luks-*cells andreplay/luks-encrypted-rd.yamlkeep covering the deprecated controller-prop path; these new tests pin that the prop is no longer required.Merge ordering
This PR must merge after #143. The Bug-023 cell and replay are written against the post-#143 contract: on current
maintherd c -l drbd,luks,storagestep is rejected with "LUKS layer requires DrbdOptions/EncryptPassphrase to be set first", so both stay red until #143 lands. The branch is based onmainand already includes the clone fix (#142), so the Bug-020 pair is green-eligible immediately.Validation status
Authoring-only at this point — neither stand has run these yet (both stands are currently busy). Stand validation is deferred to the sweep phase, where the cells must be run via
make e2e NAME=<cluster> SCENARIO=cli-matrix/...and the replays viatests/operator-harness/replay-runner.shbefore the covered bugs count as closed per the protocol.Local checks done:
bash -n+shellcheckon both cells (only the SC1091 info shared with every sibling cell),yaml.safe_loadon both replays, and everyawait.kind/field used (replica_count,all_uptodate,prop_valuewithobj: controllerand omitted-expectedabsence semantics) verified against the currentreplay-runner.sh/lib.shcontract.Known limitation
The replay runner executes
linstorargv only — no dd/curl/cryptsetup steps — so the byte-level clone assert and the LUKS-header probes live exclusively in the L6 cells; the replays pin the verb + convergence shape the runner can express.Summary by CodeRabbit