test: harden v0.8.3 plan-mode trust-boundary tests + fix nested marker tracking

Follow-up from the v0.8.3 multi-model code review (GPT 5.4 + Gemini 3.1 Pro + Kimi K2.5 + MiniMax M2.7 + Claude). All reviewers verdict: ship — these are non-blocking hardening items found post-merge.

## Findings to address

1. **Nested `// altimate_change` markers break `findMarkers` tracking** (Gemini, verified against `script/upstream/analyze.ts:499-520`). The v0.8.3 marker-fix wrapped the reworded `processor.ts` warning in an *inner* marker block nested inside the existing outer plan-refusal block. `findMarkers` uses a single `openBlock` with no nesting stack, so the inner `start` clobbers the outer block and the outer `end` is dropped — the whole outer block falls out of marker tracking. Passes the strict CI gate (hunk-based) and count balance, so it didn't fail the release, but it degrades upstream-bridge accuracy. **Fix:** remove the nested markers (the warning text is already inside the outer block, so the strict gate still passes on pure deletion).

2. **Behavioral trust-boundary tests use `session: {} as any` + depend on the default plan-mode path** (consensus: all 5 reviewers; Gemini rated MAJOR). If `OPENCODE_EXPERIMENTAL_PLAN_MODE` ever defaults true, `Session.plan(session)` throws on `{}` and the exact-string reminder assertion wouldn't match the experimental prompt — tests fail confusingly or stop covering the intended path. **Fix:** structurally valid session (`{ slug, time: { created } }`) + an explicit `expect(Flag.OPENCODE_EXPERIMENTAL_PLAN_MODE).toBe(false)` precondition that fails loudly on a flag flip.

3. **Tests stop one layer short of the model-input sink** (GPT). They prove the intermediate `trustedReminderParts`, not the final `system` array + `MessageV2.toModelMessages` user-role payload. **Fix:** add an end-to-end sink test asserting attacker text never reaches `system` and the hoisted reminder is not duplicated into the user role.

4. **Exact-phrase wording assertions are fragile** (MiniMax). `"too thin to act on"` negative guard is good, but the positive checks pin exact copy that a legitimate improvement would break. **Fix:** concept/synonym-tolerant assertions.

5. **Mislabeled source-regex test** (MiniMax). The `insertReminders return shape` test actually asserts the `InsertRemindersResult` type alias exists. **Fix:** rename for honesty.

6. **plan.txt marker description was edited** (Gemini #5, minor). Reverted the 4-word addition since plan.txt is imported raw into the LLM prompt; the prose already documents the escape hatch.

## Not in scope (separate tracking)
- Pre-existing: `// altimate_change` markers in `plan.txt` are LLM-visible (imported raw). Flagged by GPT/Gemini/Kimi as a NIT — needs a build-time strip or lint rule. Will file separately.
- Experimental-plan-mode behavioral coverage — already tracked in #890.
- Classifier unification — #891.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: harden v0.8.3 plan-mode trust-boundary tests + fix nested marker tracking #898

Findings to address

Not in scope (separate tracking)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

test: harden v0.8.3 plan-mode trust-boundary tests + fix nested marker tracking #898

Description

Findings to address

Not in scope (separate tracking)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions