Skip to content

ci: Change host tests to run in environment mode#5008

Open
backspace wants to merge 56 commits into
mainfrom
environment-mode-host-tests-cs-11275
Open

ci: Change host tests to run in environment mode#5008
backspace wants to merge 56 commits into
mainfrom
environment-mode-host-tests-cs-11275

Conversation

@backspace

@backspace backspace commented May 28, 2026

Copy link
Copy Markdown
Contributor

This lets us catch regressions with CI. Summary of changes:

  • live-test also switches to use environment mode so test-web-assets can still be shared with host shards
  • test-web-assets is configurable for environment mode (ci environment name)
  • mapping for https://localhost:4202/test in tests
  • hardcoded *.localhost:4201 (published), localhost:4202 (test), and localhost:4206 (icons) assertions are now dynamic
  • realm permissions are cloned for environment mode domains

This will tighten feedback cycles for host tests in parallel environments:

s 2026-06-12 at 15 55 46@2x

backspace and others added 2 commits May 27, 2026 17:41
Dev realms that are public-readable in standard mode (skills, catalog,
experiments) returned 401 to unauthenticated readers in environment
mode. Their public-read grants are seeded by static-URL migrations keyed
on localhost:4201, but environment mode mounts the same realms at
realm-server.<slug>.localhost URLs that no migration row matches. The
base realm is unaffected because its public grant is keyed on the
canonical https://cardstack.com/base/ URL.

Add an environment-mode-only step to the boot-time registry backfill
that mirrors the already-public set (matched by URL path) onto each
bootstrap realm's actual URL. It is idempotent and only grants read on
paths already declared public by policy, so it cannot expose a realm
that is meant to stay private.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boots the realm-server stack under BOXEL_ENVIRONMENT (Traefik, *.localhost
hostnames, per-slug database, dynamic ports) and asserts that the skills
realm — public in standard mode — also returns 200 to an unauthenticated
reader here, the parity the host integration tests depend on.

This is a deliberately small first step: it de-risks the env-mode CI
infrastructure (Traefik in CI, *.localhost resolution, env-mode boot)
before converting the full sharded host suite, and it guards the
public-read parity restored in the previous commit. A fixed "ci" slug is
safe because each job runs on an isolated VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Preview deployments

Host Test Results

    1 files  ±0      1 suites  ±0   1h 54m 55s ⏱️ +49s
3 078 tests ±0  3 062 ✅ +1  15 💤 ±0  0 ❌ ±0  1 🔥  - 1 
3 097 runs  ±0  3 080 ✅ +2  15 💤 ±0  1 ❌  - 1  1 🔥  - 1 

Results for commit 3562962. ± Comparison against earlier commit 50fa6bb.

For more details on these errors, see this check.

Realm Server Test Results

    1 files  ±0      1 suites  ±0   10m 15s ⏱️ +4s
1 722 tests ±0  1 722 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 815 runs  ±0  1 815 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 3562962. ± Comparison against earlier commit 50fa6bb.

backspace and others added 13 commits May 27, 2026 19:40
Extends the environment-mode proof-of-concept to run the realm-touching
suite that the ticket reported failing in env mode, against the env-mode
dist. Validates the fix beyond the public-read curl check and is the
first step toward running the full host suite in environment mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bare 'ember' invocation failed with 'No such file or directory'
because node_modules/.bin was not on PATH; route through pnpm exec like
the other host test steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The public-read parity assertion is the job's gate; running the
realm-touching suite in env mode still surfaces harness issues unrelated
to that fix, so keep the slice visible but non-blocking.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probes the prerender's failing cross-origin module fetch
(host.ci.localhost -> realm-server.ci.localhost/base/card-api) to
determine whether the realm-server behind Traefik emits an
Access-Control-Allow-Origin header for a cross-subdomain origin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Boots the realm-server test stack under BOXEL_ENVIRONMENT (Traefik,
*.localhost hostnames, per-slug paths) and asserts the same public-read
parity gate the host PoC validates. Runs one shard of the realm-server
suite under the env-mode stack as a non-blocking slice while integration
fallout against env-mode services is matured. Reuses the env-mode CI
infrastructure proven by the host PoC (no new infra required).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Shard 1/6 ran cleanly under env mode (244/0/0), so expand to a 6-shard
matrix mirroring the existing standard-mode realm-server-test job. Kept
non-blocking (continue-on-error on the test step) so the per-shard
public-read parity gate remains the required signal while any
integration-level fallout is matured. Promote to required once stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wait for /base/_readiness-check (the same gate test-services:host uses
to start its second stage, which covers full indexing including
prerender) before declaring readiness. Then pause 60s to let the
prerender's standby pool fully populate before launching the test
runner's own chrome instance, since concurrent chrome lifecycle events
trigger NetworkChangeNotifier and abort still-loading standby pages
with ERR_NETWORK_CHANGED. Visible in the previous run as 167
ERR_NETWORK_CHANGED events and FilterRefersToNonexistentType errors
from cards being indexed against a base whose modules table never
fully populated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A botched edit left the prior parity step's bash tail merged into the
Settle step's run value, so the executed command became
'sleep 60 cat /tmp/skills_info.json ...' and sleep exited with code 2.
Replace with a clean run: sleep 60 and drop the now-redundant CORS
diagnostic step (CORS through Traefik is already confirmed working).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
card-catalog ran 10/10 cleanly under the env-mode stack with the
strengthened readiness gate + standby settle, so expand the host PoC
into the full 20-shard partitioned suite mirroring the standard
host-test job (HOST_TEST_PARTITION/COUNT consumed by
ember-test-pre-built). Kept non-blocking (continue-on-error on the test
step) so the per-shard public-read parity gate remains the required
signal while any env-mode-specific fallout across the full suite is
matured; promote to required once stable across all shards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Path B: host test fixtures and helpers hardcode the live test realm
URL as https://localhost:4202/test/, which doesn't exist in environment
mode (the test realm-server runs at a per-environment Traefik hostname
instead). Expose the running URL as config.resolvedTestRealmURL (derived
from REALM_TEST_URL, which env-vars.sh sets in both modes) and have the
host's NetworkService rewrite the hardcoded URL to it at fetch time
when the two differ. Mirrors the existing addURLMapping pattern used
for the canonical base realm; no-op in standard mode and production.
Closes ~80 env-mode test failures concentrated in shards 4, 10, 11, 12
(realm-querying, realm-indexing, realm tests).

Cluster C: mirror the standard host-test job's chunk-fetch retry block
in the env-mode host job so a transient ChunkLoadError / Failed to fetch
dynamically imported module / NetworkError aborting a whole shard before
any tests run gets one retry. Previously cost shard 13 its entire run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The appendFileSync call in env-mode-lock.js (added in #5021, merged
into this branch via 4d9c587) was multilined past prettier's print
width, breaking the Lint job's prettier check. Format it on a single
line so CI lint goes green on this branch; the fix is in main territory
and can be cherry-picked there independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolvedTestRealmURL was added to environment.js in the previous commit
but missing from environment.ts's exported config type, so network.ts
saw it as unknown and ember-tsc failed. Declare it as string.

local-testing.md picked up a trailing-whitespace prettier warning
in the merge from main; reformat to keep the Lint job green here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
backspace and others added 13 commits June 1, 2026 09:05
Last run showed Path B closing ~66 env-mode failures (shards 4/10/11/12)
but introducing 24 new failures concentrated in shard 9
(Integration | operator-mode | links: 'waitFor timed out waiting for
[data-test-stack-card=http://test-realm/test/BlogPost/1]'). The same
suite passes 103/0 in standard mode on the same SHA, so the regression
is env-mode-specific and either Path B or a commit pulled in via the
merge from main caused it.

Comment out the addURLMapping call (keep the config + type plumbing) so
the next CI run isolates which: if shard 9 returns to 0 and shards
10/11/12 regress, Path B was the source; if shard 9 stays at 24, the
merge introduced it independently.

Not a final state — this commit is meant to be reverted/replaced once
the source is identified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When Path B was first added (commit 746e9bd) it caused a 24-test
regression in shard 9's Integration | operator-mode | links: half the
host-app code paths went through VirtualNetwork (with the new mapping)
while the other half still went through the deprecated global
prefixMappings (which didn't), producing asymmetric URL resolution that
broke card-instance lookups.

CS-10752 has since landed in full — the global prefixMappings table and
the deprecated card-reference-resolver module are gone from main, so
every resolution site now goes through VirtualNetwork uniformly. The
asymmetry that broke shard 9 no longer has anywhere to live, so the
mapping is safe to restore. With it back, ~66 hardcoded-URL test
failures across shards 4/10/11/12/15 should close again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The env-mode parity gate was waiting on base/_readiness-check, which
returns 200 once the base realm finishes its from-scratch index but
leaves skills indexing still in flight (the realm-server indexes
base and skills sequentially). The AI-assistant tests then fetch
Skill/boxel-environment from the skills realm and get 404 because
skills hasn't been indexed yet, accounting for the 11+7=18 ai-assistant
failures clustered in shards 7 and 17.

Add a parallel wait on skills/_readiness-check so the test step does
not start until both base and skills have completed indexing. Standard
mode does not need this because the host-test job's wait orchestration
naturally delays past skills indexing via its longer asset-restore
path; env-mode CI builds the dist inline and gets to the gate earlier.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Env-mode CI runs `pnpm build` (in packages/host) before launching
test-services, which materializes the pnpm workspace and creates an
empty packages/skills-realm/contents directory. services/realm-server
later invokes `pnpm skills:setup`, whose `[ -d contents ] || git clone`
heuristic sees the directory already exists and skips the clone — so
the skills realm boots with no content. The realm's `#startup` finds
no files to index, completes immediately, and `_readiness-check`
returns 200 promptly. Tests then fetch `@cardstack/skills/Skill/
boxel-environment` and 404, breaking every ai-assistant-panel | skills
and | commands test in env mode (~18 failures total).

Standard mode doesn't hit this because it downloads a pre-built dist
artifact rather than running `pnpm build`, so contents/ doesn't exist
when services/realm-server's skills:setup check runs and the clone
fires normally.

Add an explicit skills:setup step in both env-mode jobs (host and
realm-server) before any pnpm workspace op runs, so contents/ has real
boxel-skills content on disk when test-services starts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The skills:setup script's SSH-then-HTTPS fallback chain has a silent
failure mode in the env-mode CI: SSH clone fails with 'Permission
denied (publickey)' as expected (no SSH key in CI) but leaves a
non-empty contents/ directory, which the HTTPS clone fallback then
refuses to overwrite. The script exits 0 because git clone's failure
output is swallowed in the chain, so the step succeeds but
packages/skills-realm/contents has no real content. The skills realm
boots empty, _readiness-check returns 200 fast (no work), and tests
404 on Skill/boxel-environment.

Replace the pnpm skills:setup call in both env-mode CI jobs with a
direct, verifiable HTTPS clone: bypass SSH, skip if an actual Skill/
directory is already present (idempotent), and ls the result so any
future regression is visible in the log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arts

Two prior fix attempts populated packages/skills-realm/contents and ls confirmed Skill/boxel-environment.json was present, but the skills realm still indexed zero files in env mode. Either something between the populate step and the realm-server's NodeAdapter read is clobbering the content, or the realm-server is reading from a different path. Add an explicit ls at three vantage points (workspace-relative, Skill subdir, realm-server-cwd-relative) right before test-services start to nail down which case applies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scoped-fromUrl bootstrap realms (skills, openrouter) take their
realm.url from the env-mode served URL, not the canonical, so
the static-URL migration rows in realm_user_permissions never
match. getRealmOwnerUserId then throws "Cannot determine realm
owner" inside from-scratch-index on boot, which fullIndex catches
and swallows -- the realm mounts but indexes zero files. Every
ai-assistant-panel | skills test then 404s.

Extend the env-mode parity step to mirror realm-owner, write, and
named-user permissions by URL pathname (was: only *: read). Keys
off existing standard-mode rows so it stays in lockstep with the
migrations; per-(username, env-url) check preserves any custom
admin permission across reruns.
The realm-server tests run against a template DB whose migrations
already seed standard-mode permission rows for /skills/, /catalog/,
etc. Tests that deepEqual the full row set at the env-mode URL pick
those up alongside their own inserts and yield false "actual"
results: my coalesces + custom-row tests saw three rows including
* and @skills_writer instead of the one row they set up.

Switch the three deepEqual tests to a synthetic /probe-realm/
pathname that no migration touches. The publicReadGranted tests
stay on /skills/ since they assert one property and aren't
affected by extra rows.
backspace added 10 commits June 11, 2026 12:25
Host: remove the create-realm-users workflow step. In env mode
`mise run test-services:host` calls `start:matrix`, which both
boots Synapse and auto-registers users on first run; the workflow
backgrounds that and the extra register step raced the Synapse
boot, failing with "No such container: boxel-synapse-ci".

Realm-server: skip tests/index.ts's TLS env-var deletion when
BOXEL_ENVIRONMENT is set. The deletion was added to defeat the
first-byte dispatcher's plain-HTTP→https redirect in standard mode,
but env-mode createListener uses a pure h2 server (no dispatcher)
and treats the cert as a system invariant — wiping it throws
"HTTP/2 requires a TLS cert/key but ... are not set" on every
fixture spawn.
`env-vars.sh` exports REALM_SERVER_TLS_CERT_FILE / KEY_FILE only when
the cert files are already on disk at the moment it's sourced. In CI,
mise-action sources env-vars.sh at workflow init — which runs before
`actions/init`'s `mise run infra:ensure-dev-cert` provisions the cert —
so the conditional silently skips and the env vars never reach the
test step. The result was every in-process fixture realm-server in env
mode throwing "HTTP/2 requires a TLS cert/key but ... are not set"
(92 failures per realm-server shard, previously hidden by
continue-on-error: true on the standalone env-mode PoC job).

Set the paths explicitly at the job level to match where
infra:ensure-dev-cert writes them on ubuntu-latest.
`tests/helpers/index.ts`'s `stripTlsEnvVars()` runs inside `runTestRealmServer`
and the other fixture helpers, so even with the tests/index.ts gate in
place the helpers still re-stripped the env vars on every fixture
spawn — leaving in-process realm-servers throwing "HTTP/2 requires a
TLS cert/key but ... are not set" in env mode.

Same gate as tests/index.ts: standard-mode rationale (defeat the
dispatcher's plain-HTTP→https redirect) doesn't apply in env mode
where createListener uses a pure h2 server with the cert as a system
invariant.
The realm-server tests use supertest with plain HTTP/1.1 against in-
process fixture realm-servers on random `127.0.0.1:444X` ports.
Env-mode `createListener` requires HTTPS+HTTP/2 with no plain-HTTP
fallback, so fixtures spawned under BOXEL_ENVIRONMENT=ci either throw
"HTTP/2 requires a TLS cert/key" or the h2 server rejects supertest's
plain bytes as garbage. The cost of running this job in env mode is a
test-infrastructure refactor (h2-aware fixture client, or a test-only
bypass in createListener) that is out of scope here.

This restores the standard-mode job (test-web-assets dependency,
register-realm-users, pnpm test:wait-for-servers) and removes the
env-mode-aware gates from tests/index.ts and tests/helpers/index.ts
that were paired with the now-reverted env-mode job env.
`testModuleRealm` is the URL of the live test realm-server's /test/
realm — used by ~57 integration tests as the base for module
references (e.g. `${testModuleRealm}some-card`). It was hardcoded to
the standard-mode `https://localhost:4202/test/`; in env mode the
test realm-server serves it at `https://realm-test.<slug>.localhost/test/`.

Derive it from ENV.resolvedTestRealmURL, which `environment.js`
already populates from REALM_TEST_URL.
`baseTestMatrix.url` in the host test helpers was hardcoded to
`http://localhost:8008` — standard-mode Synapse. In env mode Synapse
lives at `https://matrix.<slug>.localhost`, so every test setup that
spun up MatrixService against this URL failed `POST .../login` with
"Failed to fetch", cascading into "Error loading env default system
card" and the 11 `ai-assistant-panel | skills` failures with
"Timed out waiting for env skill to be active".

ENV.matrixURL is already resolved by environment.js from
BOXEL_ENVIRONMENT (or MATRIX_URL) — use that.
Replace literal `https://localhost:4202/test/...` (live test realm)
and `http://localhost:4206/...` (icon server) string usages with
`testModuleRealm` and an `iconsBase` const sourced from
`ENV.iconsURL`. Both already resolve correctly per mode — env mode
gets the `realm-test.<slug>.localhost` / `icons.<slug>.localhost`
hostnames, standard mode keeps the original ports — so the same test
bundle works against either stack.

Adds the testModuleRealm import to files that referenced it via the
URL but didn't yet import the helper.
The two `indexing identifies an instance's [card|polymorphic] references`
tests deepEqual `refs!.sort()` against a hand-written expected list that
had `http://localhost:4206/...` (standard-mode icons) at the top —
lexically first because `http:` sorts before `https:`. In env mode
icons resolve to `https://icons.<slug>.localhost/...`, which sorts
between `https://cardstack.com/...` and `https://packages/...`, so the
hand-written order no longer matches the sort output.

Add `.sort()` to the expected array as well so the assertion is robust
to the lexical position of `iconsBase` URLs in either mode.
The 6 `host submode > with a realm that is publishable > publish and
unpublish realm` tests asserted against literal strings of the form
`https://testuser.localhost:4201/test/` and friends. The publishing UI
builds those URLs from `ENV.publishedRealmBoxelSpaceDomain` /
`publishedRealmBoxelSiteDomain` (both fall back to
`defaults.realmHost`: `localhost:4201` in standard mode,
`realm-server.<slug>.localhost` in env mode), so the assertions need
to derive the host the same way.

Source the host from `ENV.publishedRealmBoxelSpaceDomain` and rebuild
every `<username>|<custom-subdomain>.localhost:4201` literal as a
template literal interpolating that host. Object literal keys are
wrapped in computed-property brackets.
@backspace backspace changed the title Change host and realm server tests to run in environment mode ci: Change host tests to run in environment mode Jun 12, 2026
- ci.yaml: drop the comment block above `realm-server-test` — the
  job is identical to main, the comment described a hypothetical
  alternative that doesn't exist in the code.
- ci-host.yaml: rewrite the `Verify *.localhost resolves to loopback`
  preamble to state the invariant the step checks instead of framing
  it as the "riskiest unknown"; rewrite the `Wait for realms` preamble
  to describe what the step blocks on rather than which fix it
  validates.
- network.ts: tighten the URL-mapping comment to name the surviving
  hardcoded surface (fixture JSON / embedded card ids) instead of
  claiming all "test fixtures and helpers" hardcode.
- realm-registry-backfill.ts: drop the "Following the same path-keyed
  design as the prior public-read-only parity" cross-reference to a
  prior version of this function.
`test-web-assets` takes an optional `boxel_environment` input that
gets baked into the cache key, artifact name, and the host build
step. Empty input keeps the existing standard-mode output (still
consumed by live-test / matrix-client-test / vscode-boxel-tools-package);
`boxel_environment: ci` produces a sibling artifact whose host bundle
has `https://realm-server.ci.localhost`-shaped URLs baked in.

ci-host.yaml adds a second `test-web-assets` call —
`test-web-assets-env-mode` — that passes `boxel_environment: ci`.
`host-test` depends on it, downloads the artifact, and skips the
per-shard `build:ui` + `pnpm build` (~60–120 s × 20 shards reclaimed).
Both consumers of test-web-assets in ci-host.yaml — live-test and
host-test — now run their chrome against the env-mode service stack.
Single `test-web-assets` call with `boxel_environment: ci` produces
one artifact both jobs share. (The standard-mode test-web-assets in
ci.yaml is untouched; matrix-client-test and vscode-boxel-tools-package
still get their standard-mode artifact.)

live-test gains the env-mode setup steps already on host-test —
mkcert root install, Traefik start, `*.localhost` loopback verify,
skills-realm clone, parity gate, prerender settle. The
`pnpm register-realm-users` step is dropped; env-mode `start:matrix`
auto-registers users on first boot via the synapse-data marker file.
…eration

Fast-feedback loop on live-test env mode:
- live-test-wait-for-servers.sh: pick the realm-server + matrix
  hostnames from REALM_BASE_URL / MATRIX_URL_VAL when set (env mode);
  fall back to the standard-mode localhost ports.
- ci-host.yaml live-test step: set REALM_URL to the env-mode skills
  URL so testem's default `test_page` points at it.
- host-test (ci-host.yaml), matrix-client-test + realm-server-test
  (ci.yaml): `if: false` with a TEMPORARY comment. Revert before
  merging.
Live-test env-mode iteration is settled; restore the original
gating on the three jobs that were temporarily off.
The parity gate already blocks on each realm's `_readiness-check`,
which by construction means a standby page completed at least one
render. The fixed 60s sleep that followed it didn't probe anything
— it guessed at how long the rest of the standby pool takes to
finish loading host.ci.localhost asset chunks. If the pool-fill
race actually bites we'll see a concrete failure signature and
design a real probe; meanwhile, dropping it removes ~60s × 20
shards of unconditional wait.
The env-mode host-test job invokes Percy inline as
`percy exec --parallel -- pnpm ember-test-pre-built`, bypassing the
`test:wait-for-servers` wrapper this script used to fan into.
Nothing else in the repo calls it.
The URL-equality guard already short-circuited in prod (resolvedTestRealmURL
defaults to the same hardcoded localhost:4202/test/ value the mapping
points away from), so this is a documentation + defense-in-depth move.
Matches the existing isTesting() pattern at monaco.ts / import.ts /
auth-service-worker-registration.ts.
@backspace backspace marked this pull request as ready for review June 12, 2026 19:56

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bd0aff5dc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/host/config/environment.js Outdated
The env-mode host bundle is built by test-web-assets.yaml with
BOXEL_ENVIRONMENT=ci set but no REALM_TEST_URL — environment.js
was falling back to the hardcoded standard-mode default
`https://localhost:4202/test/`, so the bundle baked the wrong
value for testModuleRealm and the NetworkService URL rewrite saw
hardcoded==resolved and short-circuited. Card-data references to
testModuleRealm that escape the in-process realm-server mock would
hit a localhost:4202 listener that doesn't exist in env-mode CI.

Add testRealmURL to environmentDefaults() — standard mode keeps
the original `https://localhost:4202/test/`, env mode derives
`https://realm-test.<slug>.localhost/test/` from the slug the
same way the other URLs already do. resolvedTestRealmURL falls
back to defaults.testRealmURL; explicit REALM_TEST_URL still wins
for callers that want a custom endpoint.
When the workflow ran \`percy\` directly via \`dbus-run-session -- $TEST_CMD\`,
the percy binary in \`node_modules/.bin\` wasn't on PATH and the shard
exited with \"failed to exec 'percy': No such file or directory\". Going
through pnpm puts the local bin dir on PATH.

The script body differs from what was on main (env-mode tests don't
need the \`test:wait-for-servers\` wrapper — the workflow's parity gate
already waits for the live realm-server / matrix to come up).
…-tests-cs-11275

# Conflicts:
#	packages/host/tests/integration/realm-indexing-test.gts
The parity gate waits on base + skills _readiness-check before
running tests, but `icons.<slug>.localhost` has no readiness probe.
A run where the Traefik route for icons hasn't registered yet (or
where the http-server hasn't bound its port) lets the test step
start anyway, and the first card that imports an icon fails with
`TypeError: Failed to fetch` instead of a meaningful 4xx.

Add a probe for a stable icon file (`folder-pen.js`) to both parity
gates (live-test and host-test). 30 attempts at 2s = 60s headroom,
which is comfortably more than the icons server's observed startup
time but won't add noticeable wall-clock when it's healthy.

This rules out the boot-time race as a cause of the recurring
mid-run `Failed to fetch` against icons.ci.localhost. If failures
persist after this lands, the cause is mid-run service drop, not
readiness.
…real card

Two related fixes for env-mode host CI:

1. Run `pnpm register-realm-users` in both ci-host.yaml jobs
   (live-test and host-test), matching what standard-mode CI already
   does. Without this, the realm-server's worker fetches `_mtimes`
   from the dev realm-server unauthenticated, the response is 404,
   and from-scratch indexing finishes with zero files. The base
   realm then 404s every card the host bundle loads
   (welcome-to-boxel.json, ai-app-generator.json,
   join-the-community.json, cards/skill, Skill/catalog-listing,
   …), which cascades into the AI Assistant, create-file, and
   highlight-cards tests failing on missing UI elements.

2. card-delete's "can delete a card that is a selected item" set
   stack 1 to the bare realm URL `${testModuleRealm}`. The live
   realm-server doesn't resolve a bare realm URL as a card, so it
   404s. Point stack 1 at `${testModuleRealm}index` instead — the
   index card exists in test-realm-cards/contents/, so the load
   succeeds in both standard and env mode without changing the
   test's intent (two distinct stacks for the selection assertion).
Env-mode CI sometimes sees a momentary `Failed to fetch` against
`icons.<slug>.localhost` mid-run — a brief Traefik route loss or
service-side hiccup. The realm-server hostnames see the same blip,
but the realm fetch path has its own `withRetries` and recovers
silently; the icons path doesn't, so the failure surfaces in
whichever test imports an icon at the wrong instant.

Treat this as the same retry-category as the existing chunk-fetch
transient: broaden the shard retry regex to match
`unable to fetch https://icons\.<host>: fetch failed`, which is
specific to the env-mode wire URL and won't accidentally retry on
real test failures.
Two integration tests
(`realm: realm can serve GET card requests with linksTo relationships to
other realms` and `realm can serve search requests whose results have
linksTo fields`) fetch `${testModuleRealm}hassan` from the live test
realm. In env-mode CI, the test realm-server occasionally finishes its
from-scratch index with zero files when the dev realm-server is
heavy-indexing on the same runner (shared matrix + prerender pool),
producing a 404 from a realm that should serve hassan.

Inverse-correlated across shards — shard A: dev indexes 0 (matrix
race), test indexes 75 (passes); shard B: dev indexes 200, test
indexes 0 (fails). Re-running the shard normally lands after the
race resolves.

Broaden the existing shard-retry regex (already covers chunk-fetch
and icons transients) to also match
`cross-realm fetch failed for https://realm-test.<host>` so this
class of env-mode boot race gets one automatic retry. The regex is
specific to the env-mode wire URL and the cross-realm-fetch log
prefix, so it won't mask real linksTo bugs in tests using
`http://test-realm/...` in-process URLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant