diff --git a/drafts/2026-05-08T193033Z.md b/drafts/2026-05-08T193033Z.md new file mode 100644 index 0000000..b9d64fb --- /dev/null +++ b/drafts/2026-05-08T193033Z.md @@ -0,0 +1,75 @@ +# Reply draft: Veris Show HN, mock-vs-live divergence and the runtime hook seam + +- **HN:** https://news.ycombinator.com/item?id=48054313 +- **Story:** "Show HN: Veris - Agent sandboxes with simulated external services" (id=48054313, posted by jrm-veris, 9 points / 23 hours / 0 comments at draft time, links to https://veris.ai/sandbox) +- **Status:** draft (pending manual post) + +## Discovery + +Browser sweep (no memorized links): + +1. `https://news.ycombinator.com/ask` - scanned top 24 Ask HN; mostly meta-topics (career, AI cost, MCP-process count, "is Claude Code getting worse", LLM comments) which the thread-fit gate filters out. +2. `https://news.ycombinator.com/show` - scanned top 30 Show HN; spotted Tilde.run (id=48037724) at 196 points / 129 comments (saturated, mid-thread visibility near zero per gate), and a cluster of fresh adjacent Show HNs. +3. `https://hn.algolia.com/?q=agent+deleted&type=story&dateRange=pastWeek&sort=byDate` - 1 result (Crit, id=48062402; review-tool space already heavily covered in our open PRs). +4. `https://hn.algolia.com/?q=claude+code&type=story&dateRange=pastWeek&sort=byPopularity` - surfaced the Claude Code symlink-sandbox-escape CVE (id=48057842, 42 pts, 5 comments). Inspected the FailProof `block-read-outside-cwd` source in `src/hooks/builtin-policies.ts` lines 763-820: it uses `path.resolve()` with no `realpath`/symlink resolution, so it shares the same bypass shape as the Claude Code CVE. Cannot honestly pitch it as a fix; thread fails the gate. Skipped. +5. `https://hn.algolia.com/?q=agent+sandbox&type=story&dateRange=pastWeek&sort=byDate` - found Veris (id=48054313): fresh Show HN of an eval-time agent-sandbox tool with stateful LLM-powered mocks, 9 points / 0 comments / 23 hours. Clean adjacent-product Show HN, gate-passing. + +Three-surface duplicate scan confirms id=48054313 is not in `drafts/`, not in `comments/`, and not in any open PR diff on this repo. + +## OP + +The submission is link-only - no inline `toptext` on the HN page. Per the linked Veris product page (https://veris.ai/sandbox): + +- Veris is a pre-prod simulation environment for testing AI agents end-to-end. +- It ships **stateful, LLM-powered mock services** for 50+ enterprise platforms (SWIFT and OpenSanctions for banking, Salesforce / HubSpot for CRM, Zendesk / Intercom for support, Slack / Jira for productivity, Stripe / Shopify for payments). +- Fault classes it explicitly catches: hallucinations, incorrect tool usage, policy violations, context retention failures, latency. +- Architecture: scenario generation from agent code / production logs / past incidents, deterministic Veris Simulation Engine with rewards and replay scoring, multi-layer grading (scripted, LLM-judge, hybrid), training integration for SFT and RL. +- Positioning: framework-agnostic eval-time platform, no MCP / Claude Code / Agents-SDK specifics in the public material. +- Use cases highlighted: customer support, fraud detection. +- The pitch is "ship knowing only the happy path" is the failure mode. + +## My reply + +``` +(disclosure: I work on FailProof AI: https://github.com/exospherehost/failproofai) + +The load-bearing property here is that the mocks are LLM-powered and stateful, so you can run 10k scenarios safely without moving real money or paging a real Salesforce admin. The cost: bugs whose pathology only surfaces against the live service (idempotency-key replay, prod-vs-staging account-ID prefix drift, the rate limiter's actual jitter, partial state on a 502) won't reproduce in the mock. The same agent that passes every Veris scenario can emit a malformed call against real Stripe on the first prod prompt that nudges its plan off-script. A PreToolUse hook fills that gap by denying on call shape rather than scenario coverage: + + customPolicies.add({ + name: "block-prod-stripe-transfer-over-threshold", + match: { events: ["PreToolUse"] }, + fn: async (ctx) => { + const url = String(ctx.toolInput?.url ?? ""); + const amount = Number(ctx.toolInput?.body?.amount ?? 0); + if (url.includes("api.stripe.com/v1/transfers") && amount > 100_000) + return deny("Stripe transfer above $1000 (cents) blocked at runtime"); + return allow(); + }, + }); + +Eval-time mocks gate the scenarios you wrote; runtime hooks gate the calls you didn't see coming. +``` + +## Insight for the FailProof team + +The Veris-shaped axis is meaningfully distinct from the static-vs-runtime (Snyk, PR #42) and scenario-vs-runtime (TrainForgeTester PR #53, Spec27 PR #41) framings already in the open-PR set: it's specifically about **mock-vs-live divergence**. A scenario-test runner doesn't have to mock the world; Veris (and Veris-likes) explicitly do, and that simulation makes the eval-time guarantees inherently softer than scenario tests against a real staging environment. The honest framing is: + +- Static analysis (Snyk-shape): catches what's enumerable from code at rest. +- Scenario tests against real services (TrainForgeTester / Spec27 shape): catches what's enumerable from prompt / tool-call traces. +- Stateful simulated services (Veris-shape): catches what's enumerable plus the multi-turn state-machine behaviors, but at the cost of mock fidelity. +- Runtime PreToolUse hooks: catches the always-wrong call shape regardless of whether anyone enumerated it. + +This is a fourth seam that deserves its own one-page doc note alongside the other three. The angle "your simulator is a model of the world; the hook gates the call about to land in the world" is sharp enough to slot into any future Veris / Cygnal / Agnostic / Patronus-style pre-prod simulation Show HN. Customer-support and fraud-detection are the two domains Veris highlights, and both are exactly where "the agent passed every scenario but the rate limiter / idempotency / partial-state behavior in prod still bit us" is a real story; FailProof should consider a `examples/payments-policy.ts` recipe in the repo for that audience. + +The thread is 0 comments at draft time, so a substantive top-level peer comment lands clean without competing against existing discussion. The OP is link-only on HN (no inline `toptext`), so anyone landing here is reading from the Veris site itself and has the simulation context loaded. + +## Notes / findings + +- Body word count: ~115 words of prose + ~50 words in the snippet = ~165 total; brand-voice band is "under ~150 words" with the working example at ~110. Slightly above the working-example footprint but well below the flagged-shape ~220-word footprint. Reads short on screen because the snippet is dense. +- ASCII punctuation only: hyphens (`-`), straight quotes (`"`/`'`), three ASCII dots if needed, parentheses for the list of mock-fidelity gaps. No em-dashes, en-dashes, curly quotes, fancy ellipses, or unicode arrows. The snippet uses ASCII `?`, `?`, `??`, and template strings only. +- One disclosure line in plain parens at the top, lowercase `disclosure:`, single repo URL. No second link at the bottom. No install command. No comma-list of policy names. No three-scope / 39-policies / dashboard / `~/.failproofai/` callouts. Custom-policy snippet, not a built-in name (so no over-specific claim that an OOTB policy exists for `block-prod-stripe-transfer-over-threshold` - it's illustrative). +- Cross-thread duplicate guard: framing axis ("mock-vs-live divergence", "the simulator is a model of the world") is materially distinct from the TrainForgeTester (PR #53) "scenarios catch enumerable behaviors, hooks catch always-wrong shapes" line and from Spec27 (PR #41) "tests validate the contract you wrote, hooks catch shapes the contract didn't list" line. Snippet domain (Stripe transfer URL + amount threshold) is unique to this thread - TrainForgeTester named `block-rm-rf` only, Spec27 used a `DROP TABLE` SQL regex. Closing aphorism is paraphrase-distinct: "scenarios you wrote vs calls you didn't see coming". +- Reply form on the Veris thread is open: `