Experiment: use Static Site Importer for generated local sites#3309
Experiment: use Static Site Importer for generated local sites#3309
Conversation
|
Testing it with lots of different prompts using automated benchmarks. Site build prompts being tested: https://github.com/chubes4/homeboy-rigs/tree/main/Automattic/studio/bench/prompts/site-build Here is a screenshot from the first passing (all valid blocks matching the frontend) generation from last night:
|
|
I'm giving this PR a try. Quick questions though:
|
My initial approach was to mirror the client side rawHandler in Gutenberg, using the core HTML API. Yes only core blocks for now. Non-supported blocks fallback to core/html. There is some hardcoding, but I am trying my best to make it more about generic pattern recognition. My approach has been to run many different prompts, fix the gaps, and run again. The quality has improved significantly over time. Even with hardcoding, there are certain patterns the AI is likely to use. I think we could account for most of them and handle edge cases as they come up.
Honestly, I did not think of that. I had started this experiment a while back for a plugin that creates content at scale. The site editor angle is new. Eventually, this could be the responsibility of Gutenberg, instead of Studio.
You did nothing wrong, it was me who accidentally left a local path in the original draft 🥇. Sorry about that! |
## Summary Extends the eval-runner result JSON with two new fields that any consumer of `EVAL_RUNNER_RESULT_FILE` can read: - **`error: string | null`** — set when an exception is thrown inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.). - **`timedOut: boolean`** — flipped by the timeout callback before it calls `query.interrupt()`. Together these address the `finalError` ask in #3262 section 1 ("First actionable failure") and complete the failure-visibility story started by #3273 (`firstToolError`, `toolEvents`, `phaseTimingsMs`). ## Why Three failure modes today, three different visibility outcomes: | Failure shape | Today | After this PR | |---|---|---| | Model returns `success: false` cleanly | ✅ Captured fully | Same | | Timeout fires (`query.interrupt()` after `timeoutMs`) |⚠️ Captured as `success: false`, **indistinguishable** from a clean model failure | ✅ `timedOut: true` distinguishes it | | Mid-loop exception (auth, MCP, network, SDK throw) |⚠️ Caught at `main()`'s top-level → emits stripped-down `{ success: false, error }` JSON, **losing all timings/tools/turns** | ✅ Caught inside `runEval()` → full structured result with `error` set alongside everything else | The third row is the most consequential: today, a run that completes meaningful work and then hits a late exception loses all the diagnostic state that was already captured. The new `try { … } catch` keeps the structured result intact. ## What this enables A consumer can now answer: - Did the eval **time out** vs. the model **gave up cleanly**? (Today: indistinguishable. After: check `result.timedOut`.) - Was there a **runtime exception** during the run, even if `success` was `true`? (Today: not representable — `success: true` and the exception path are mutually exclusive. After: both fields can coexist.) - When an exception fires, **what tools ran before it, how long did each phase take**? (Today: lost in the `main()`-level fallback JSON. After: preserved in the structured result.) This is the same shape as #3273's contribution — producer-side observability improvements that any downstream consumer benefits from. Studio's own `npm run eval`, future internal CI, promptfoo configurations, and external benchmark harnesses all read the same JSON contract. ## Diff 8 lines: - `let error: string | null = null` and `let timedOut = false` declarations - Timeout callback rewritten to set `timedOut = true` before calling `query.interrupt()` - `try { … } catch ( caught ) { error = … }` wrapper around the message loop - Two new fields in the return shape The catch only changes behavior on exception paths; the existing successful and clean-failure paths are unchanged. ## Origin Originally drafted on the experimental Static Site Importer branch (#3309) where it was being used by an out-of-tree benchmark harness. Split out for upstream review because: 1. The change is generic — useful to any consumer of the eval-runner's structured output, not just the SSI experiment. 2. It shouldn't ship gated on the SSI experiment merging. 3. Reviewing it on its own keeps the surface area small (8 lines vs. ~200). The SSI draft PR will rebase on top of this once it lands. ## Tests - `npm run typecheck` (all workspaces) — clean - `npx eslint apps/cli/ai/eval-runner.ts` — clean The change is small enough that it's covered by the existing eval-runner exercise paths (`npm run eval`, etc.). If reviewers want explicit unit tests around the new fields I'm happy to add them. ## Refs - Issue #3262 — Eval runner should expose structured diagnostics and benchmark metadata - PR #3273 — CLI: add eval-runner diagnostics (the previous installment of this story) - PR #3309 — Experiment: use Static Site Importer for generated local sites (where this code originally lived) ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Drafted the catch wrapper, the `timedOut` flag, and the result-shape extension under Chris's direction. Chris reviewed the diff, the issue framing, and the split rationale.
ddc9586 to
6502021
Compare
6502021 to
3892194
Compare
|
I think this PR is trying to do too much, change how we transform blocks but also change other things in the system prompt... I've also fallen into this trap on previous PRs and it's very hard to gain confidence about whether the change you're making is the right one or not. I would like to suggestion smaller steps. For the static site import (or block transformation) which is the main thing this PR is trying to achieve. I'm wondering if there's a simple solution that avoid building a php transformer entirely and just rely fully on the raw handler. The advantage is that it will work consistently and support any blocks that is available on the site. We could achieve it by using an eval in playwright like we do today for block validation, it's just that instead of block validation, we'll be doing block transformation directly. |
|
Thanks for the feedback. I will see if I can come up with a simpler approach. I do want to keep experimenting with this, because I already built the PHP transformer, and have been getting good results with it. For the system prompt, I think we should gradually be moving design and workflow constraints out of the system prompt and into a skill. As part of this experiment, I tweaked the prompt to try and speed up generation by reducing the number of steps. I am running some benchmarks that track the agent's font choices, color choices, and design decisions in a database over the course of many eval runs with a given system prompt. This can help us measure the impact of prompt changes on output with hard data and make sound decisions. The idea is that eventually, Studio Code can become a general-purpose WordPress agent that is powerful enough to be the daily driver for the average WordPress developer. |
I think that makes sense yeah :) Looking forward to see where these experiments lead. |
Drops the STUDIO_STATIC_SITE_IMPORTER_PLUGIN_PATH env-var contract that required reviewers to clone chubes4/static-site-importer separately and point Studio at it. The PR is now self-contained: a fresh `npm install` fetches SSI's main zipball and extracts it into wp-files/, the CLI bundle ships it, and the mu-plugin loader symlinks from the bundle path. - scripts/download-wp-server-files.ts: add a `static-site-importer` entry that pulls https://github.com/chubes4/static-site-importer/archive/refs/heads/main.zip and renames the extracted root into wp-files/static-site-importer/. Iteration-mode contract: re-running `npm install` refreshes SSI to the current main HEAD. Before the PR un-drafts, swap the URL for a pinned tag/sha to lock the substrate. - apps/cli/lib/dependency-management/paths.ts: expose `getStaticSiteImporterPluginPath()` resolving against `getWpFilesPath()`. - tools/common/lib/mu-plugins.ts: drop the env-var gate. The SSI loader mu-plugin and the symlink branch now key on `options.staticSiteImporterPluginPath`, passed in from each CLI runtime entry point. - apps/cli/{php-server-child,playground-server-child}.ts and apps/cli/lib/run-wp-cli-command.ts: pass the bundled SSI path through to mu-plugin generation. - apps/cli/scripts/studio-bfb-smoke.mjs: drop the env-var requirement; the bundled CLI already knows where SSI lives. - tools/common/lib/tests/mu-plugins.test.ts: cover loader + symlink presence when an SSI path is provided, and absence when it isn't. - Rename mu-plugin filename from 1-static-site-importer-experiment.php to 1-static-site-importer.php, and add the old name to the legacy cleanup list so existing installs scrub it. Also bundles two unrelated prompt/UI tweaks from the same iteration: - system-prompt.ts: drop two outdated guidance lines now superseded by the SSI workflow itself. - ui.ts: welcome text "block themes" -> "WordPress sites" for accuracy. ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Drafted the wp-files download entry, mu-plugin contract refactor, smoke-script clean-up, and tests under Chris's direction and review.
The 'do not run visual screenshot loops unless the user explicitly asks' guidance reads as anti-instruction for a behavior that is no longer part of the workflow. Negative framing in prompts trains the model on the behavior we're trying to avoid; it's better to omit the topic entirely. Removes the trailing clause from the two 'Finish promptly' workflow steps and the parenthetical from the take_screenshot tool description. The neutral tool description (what it does) remains in both the WordPress.com and local tool listings. - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Mechanical removal under Chris's direction.
Studio doesn't depend on homeboy, so the error path shouldn't either. Replace the homeboy command suggestion with the Studio-native build sequence (`npm install` at the root, `npm run build` in apps/cli) so a reviewer running the smoke without a homeboy rig sees instructions they can actually follow. The script filename and the studio_bfb_unsupported_fallback_count option key are unchanged \u2014 "BFB" there refers to Block Format Bridge (the upstream conversion substrate), not the homeboy rig name. ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Mechanical message swap under Chris's direction.
Adding draft mu-plugin filenames to LEGACY_MU_PLUGIN_FILENAMES is overengineering for a draft PR \u2014 there is no installed base of the old experiment-suffixed filename to scrub, and adding the current filename to the legacy list is conceptually wrong (it's not legacy, it's the current shipping name). The list is for cleaning up files older Studio versions wrote on disk; the SSI mu-plugin has never shipped in a Studio release. If/when this PR lands and a future rename is needed, that's the right time to add the previous name to the legacy list. ## AI assistance - **AI assistance:** Yes - **Tool(s):** Claude Code (Sonnet 4.5) - **Used for:** Mechanical removal under Chris's direction.
4349fe5 to
fe16ee9
Compare


Summary
wp-files/download chain. A freshnpm installfetches SSI'smainzipball, extracts towp-files/static-site-importer/, and the CLI bundle ships it. No environment variables, no separate clone, no setup steps required.tmp/static-site/index.html, then imports it withstatic-site-importer import-theme.apps/cli/scripts/studio-bfb-smoke.mjs) to validate the bundled flow end-to-end: SSI generates and activates a block theme with templates, parts, a front-page pattern, and zero fallback events.Why
Studio's local site-generation path should not require agents to hand-author Gutenberg block markup or maintain brittle block-format prompt skills. Static Site Importer gives Studio a reusable pipeline:
This also supports customer-provided static HTML, migrated templates, and AI-generated prototypes as inputs instead of forcing Studio to discard or rewrite them.
Related upstream projects:
How SSI gets into the site
Path resolution lives in
apps/cli/lib/dependency-management/paths.ts::getStaticSiteImporterPluginPath(), which mirrorsgetPhpMyAdminPath()/getSqliteCommandPath()/getBlueprintsPharPath()— all bundled assets resolved againstgetWpFilesPath().Iteration model (draft phase)
While the SSI substrate is still rapidly evolving, the downloader pulls
mainrather than a tagged release. Reviewers automatically pick up upstream fixes by re-runningnpm install. There's no per-machine setup and no env-var contract — a fresh clone of this PR plusnpm installis everything a reviewer needs.Before this PR un-drafts, swap the URL in
scripts/download-wp-server-files.tsfor a pinned tag, e.g.:That locks the substrate to a known-good release for merge.
Verification (Riad-equivalent reviewer simulation)
I created a fresh worktree off this branch in a directory that has no relationship to my local SSI development setup, with no environment variables set, and walked it through the reviewer flow:
Tests
npm run typecheck(all workspaces) — cleannpx eslint <touched files>— cleannpx vitest run— 1483 passed across 157 test files, including 9 intools/common/lib/tests/mu-plugins.test.ts(3 new SSI tests cover loader + symlink presence when the plugin path is provided, and absence when it isn't)apps/cli/scripts/studio-bfb-smoke.mjs— passes end-to-end against a freshly-bundled siteNotes
apps/cli/ai/system-prompt.ts(now superseded by the SSI workflow itself), and a welcome-text wording change inapps/cli/ai/ui.ts("block themes" → "WordPress sites"). Both are minor, both ride along.take_screenshot/share_screenshotURL resolution for local sites). The two PRs touch separate concerns; Experiment: use Static Site Importer for generated local sites #3309 no longer rides on top of Fix AI screenshot local site resolution #3307 and can land independently.AI assistance
2f336e86) was authored separately by OpenCode/GPT-5.5; the iteration on top is Claude Code's.