Skip to content

fix(security): KB fileUrl LFI, MCP/Agiloft SSRF pinning, form OTP, KB authz#4639

Merged
waleedlatif1 merged 6 commits into
stagingfrom
waleedlatif1/fix-kb-fileurl-lfi
May 17, 2026
Merged

fix(security): KB fileUrl LFI, MCP/Agiloft SSRF pinning, form OTP, KB authz#4639
waleedlatif1 merged 6 commits into
stagingfrom
waleedlatif1/fix-kb-fileurl-lfi

Conversation

@waleedlatif1

Copy link
Copy Markdown
Collaborator

Summary

Bundled security hardening across multiple surfaces:

  • KB fileUrl LFI: POST /api/knowledge/[id]/documents/upsert previously accepted any non-empty string for fileUrl, which the background processor then passed to fs.readFile / parseFile, allowing authenticated arbitrary local file reads (e.g. /app/.env, /etc/passwd). Now gated at the boundary via knowledgeDocumentFileUrlSchema (only data: URIs or http(s):// URLs); defense-in-depth in document-processor.ts throws on any other scheme.
  • KB authorization & transaction hardening: introduces KnowledgeBasePermissionError, tightens permission checks, and wraps multi-step KB mutations in transactions.
  • MCP DNS-rebinding (SSRF): outbound MCP calls now go through pinned-fetch.ts (undici Agent with a pinned lookup) so the host the policy validated is the host actually connected to.
  • Agiloft SSRF/DNS pinning: resolveAgiloftInstance + pinned fetch for attach / retrieve routes.
  • Grafana baseUrl validation: update_alert_rule / update_dashboard route through secureFetchWithValidation.
  • Form email-OTP flow: new POST /api/form/[identifier]/otp route + email-auth.tsx UI; shared OTP module under lib/core/security/otp.ts keyed by DeploymentKind, with the chat OTP route refactored to consume it.
  • Audit script ratchet: bumps the API-validation baseline by one route for the new form OTP endpoint.

Test plan

  • bun run check:api-validation:strict passes
  • Lint passes
  • 194 unit tests pass across affected areas (contracts, route handlers, services, pinned-fetch, OTP)
  • Production data audited via PlanetScale MCP — 10,501 legitimate relative /api/files/serve/... rows have zero completed status (already failing under the old fs.readFile path); change is strictly a clearer error, not a regression
  • Backwards-compat verified end-to-end on staging

🤖 Generated with Claude Code

@vercel

vercel Bot commented May 17, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment May 17, 2026 5:51am

Request Review

@gitguardian

gitguardian Bot commented May 17, 2026

Copy link
Copy Markdown

️✅ There are no secrets present in this pull request anymore.

If these secrets were true positive and are still valid, we highly recommend you to revoke them.
While these secrets were previously flagged, we no longer have a reference to the
specific commits where they were detected. Once a secret has been leaked into a git
repository, you should consider it compromised, even if it was deleted immediately.
Find here more information about risks.


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@cursor

cursor Bot commented May 17, 2026

Copy link
Copy Markdown

PR Summary

High Risk
High risk because it changes authentication flows and adds/changes SSRF and file-ingestion validation logic that can block previously accepted inputs or affect outbound integrations (MCP/Agiloft/Grafana). The changes are broad and touch critical request/response paths and network I/O behavior.

Overview
Adds email OTP authentication for deployed forms. Introduces new POST/PUT /api/form/[identifier]/otp endpoints plus client UI (EmailAuth) and query hooks/contracts, and changes form auth validation to require OTP (otp_required) instead of granting access based solely on an allowed email match.

Centralizes OTP storage/attempt tracking. Moves chat OTP logic into a shared lib/core/security/otp module (Redis/DB storage, rate limits, attempt locking), and refactors the chat OTP route to use the shared implementation.

Closes a KB document ingestion LFI vector. Tightens fileUrl contract validation to only allow data: or http(s):// URLs and removes local filesystem reads in the document processor (fails closed on unsupported schemes), with new tests covering allowed/rejected inputs.

Pins and validates outbound connections to reduce SSRF/TOCTOU risk. MCP SSRF validation now returns a resolved IP and, when available, the MCP client uses an undici-based pinned fetch to prevent DNS rebinding; Agiloft attach/retrieve routes similarly resolve once and pin all subsequent requests; Grafana update tools now validate baseUrl before issuing requests.

Hardens knowledge base workspace authorization. Adds KnowledgeBasePermissionError, enforces actor-aware permission checks for workspace transfers/clearing in updateKnowledgeBase within a transaction/row lock, and maps these failures to 403 in the API routes with added tests.

Updates API validation baseline and adds extensive unit tests across the new and hardened paths.

Reviewed by Cursor Bugbot for commit 21a93a5. Configure here.

Comment thread apps/sim/lib/api/contracts/knowledge/shared.ts
@greptile-apps

greptile-apps Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR delivers a bundled security hardening across five surfaces: LFI prevention on Knowledge Base file upload, KB permission errors and transaction hardening, DNS-rebinding (SSRF) pinning for MCP and Agiloft, basic URL validation for Grafana tools, and a new email-OTP gate for form deployments with a shared otp.ts module reused by both chat and form routes.

  • LFI & KB hardening: knowledgeDocumentFileUrlSchema restricts accepted fileUrl values to data: URIs and http(s):// URLs at the Zod boundary; document-processor.ts removes the fs.readFile fallback. KnowledgeBasePermissionError is introduced and multi-step KB mutations are wrapped in transactions with SELECT … FOR UPDATE.
  • MCP DNS pinning: validateMcpServerSsrf now returns the resolved IP; createMcpPinnedFetch (backed by undici's fetch) forces all subsequent transport connections to that IP, closing the TOCTOU window between validation and connection.
  • Agiloft DNS pinning: resolveAgiloftInstance + secureFetchWithPinnedIP give the Agiloft attach/retrieve routes the same resolve-once-pin-always protection as MCP.
  • Grafana URL validation: validateExternalUrl is applied before each Grafana fetch call, blocking IP-literal SSRF; DNS rebinding is not mitigated (see comment).
  • Form OTP: New POST /api/form/[identifier]/otp + PUT routes mirror the existing chat OTP flow, sharing the centralized lib/core/security/otp.ts module.

Confidence Score: 5/5

Safe to merge; the Grafana URL validation gap does not regress from the pre-PR state and is noted for a follow-up improvement.

The five hardened surfaces (KB LFI, KB authz, MCP DNS pinning, Agiloft DNS pinning, form OTP) are all correctly implemented. The Grafana tools move from no validation to basic format checking — an improvement, not a regression — but do not reach the DNS-pinned level of the Agiloft fix. All previous review comments were addressed in follow-up commits. The OTP module's Redis Lua path, DB optimistic-lock fail-closed path, and rate-limiting are sound.

apps/sim/tools/grafana/update_alert_rule.ts and apps/sim/tools/grafana/update_dashboard.ts — both validate baseUrl with the synchronous format-only check and still use plain fetch; a follow-up to add DNS resolution and IP pinning would bring them in line with the Agiloft pattern.

Security Review

  • Grafana SSRF (DNS rebinding)update_alert_rule.ts / update_dashboard.ts use validateExternalUrl (synchronous, format-only) before a plain fetch. This blocks IP-literal SSRF but not DNS-rebinding; a hostname that resolves to a private IP at connection time would bypass the check. The Agiloft routes were upgraded to validateUrlWithDNS + secureFetchWithPinnedIP; the same pattern should be applied to the Grafana tools to close the gap.
  • No new secrets or credentials exposed in this change.
  • KB LFI fix (fs.readFile path removed) and MCP/Agiloft DNS-pinning are correctly implemented.
  • Form OTP uses crypto.randomInt for code generation, rate-limits at both IP and email granularity, and fails closed on DB-path retry exhaustion — all correct.

Important Files Changed

Filename Overview
apps/sim/lib/core/security/otp.ts New shared OTP module: correct Redis Lua atomic increment, DB optimistic-lock retry with fail-closed exhaustion, per-kind key namespacing, and clean encode/decode for the code:attempts format.
apps/sim/lib/mcp/pinned-fetch.ts New pinned-fetch implementation using undici's typed fetch export with a createPinnedLookup dispatcher; correctly bridges DOM/undici type gap via documented cast and preserves hostname for TLS SNI.
apps/sim/lib/mcp/domain-check.ts Returns resolved IP from validateMcpServerSsrf for downstream pinning; loopback is now blocked on hosted environments; DNS failure path is now distinct from SSRF-block path.
apps/sim/tools/agiloft/utils.server.ts New server-only Agiloft helpers: resolves DNS once via validateUrlWithDNS, then uses secureFetchWithPinnedIP for login/logout/attach/retrieve, closing the TOCTOU SSRF window.
apps/sim/tools/grafana/update_alert_rule.ts Adds validateExternalUrl before the Grafana API call — blocks IP-literal SSRF but does not resolve DNS, leaving DNS-rebinding attacks open (inconsistent with the Agiloft/MCP pattern).
apps/sim/tools/grafana/update_dashboard.ts Same as update_alert_rule — synchronous format-only validation with plain fetch; DNS rebinding SSRF not mitigated.
apps/sim/lib/knowledge/service.ts KB update wrapped in a SELECT FOR UPDATE transaction; workspace-change permission check added via actorUserId; KnowledgeBasePermissionError distinguishes auth failures from 500s.
apps/sim/lib/knowledge/documents/document-processor.ts Removes fs.readFile fallback; all non-data:/http(s):// paths now throw; case-insensitive regex checks applied consistently across downloadFileForBase64 and parseWithFileParser.
apps/sim/app/api/form/[identifier]/otp/route.ts New form OTP route mirrors chat OTP pattern precisely; IP and email rate limiting, allowlist check before OTP store/compare, generic 500 messages, and auth cookie set on success.
apps/sim/lib/api/contracts/knowledge/shared.ts Adds knowledgeDocumentFileUrlSchema with case-insensitive data: and https?:// guards; correctly applied at the Zod validation boundary.
apps/sim/app/form/[identifier]/components/email-auth.tsx New email-auth UI component with two-step flow (email → OTP), countdown resend timer, 6-digit OTP auto-submit, and clear error messaging.

Reviews (2): Last reviewed commit: "fix(mcp): annotate undici/DOM type-bridg..." | Re-trigger Greptile

Comment thread apps/sim/app/api/form/[identifier]/otp/route.ts Outdated
Comment thread apps/sim/lib/core/security/otp.ts
Comment thread apps/sim/lib/api/contracts/knowledge/shared.ts
Comment thread apps/sim/app/api/form/utils.ts
Comment thread apps/sim/lib/mcp/pinned-fetch.ts Outdated
waleedlatif1 and others added 6 commits May 16, 2026 22:45
…haust

- Chat/form OTP routes: replace `error.message || fallback` with generic
  `Failed to process request` in 500 responses (logger still captures detail).
- otp.ts incrementOTPAttempts DB path: on MAX_RETRIES exhaustion, delete the
  verification row and return `'locked'` instead of trusting a possibly-
  undercounted final read.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace `globalThis.fetch` + double-cast with `undici.fetch` so the
`dispatcher` option is part of the real type contract. This guarantees
pinning won't silently break if a future runtime swaps the underlying
fetch implementation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tool config files are statically reachable from the client bundle (via
tools/registry.ts → tools/{service}/index.ts). Importing
`@/lib/core/security/input-validation.server` from these files pulled
`node:dns/promises` into the Turbopack client bundle and broke the build.

Split agiloft utils into client-safe (`utils.ts`, plain fetch + sync
`validateExternalUrl`) and server-only (`utils.server.ts`, DNS-pinned
variants). Routes that need TOCTOU protection import the pinned helpers;
the executor-side tool path falls back to sync URL validation (matches
the supabase precedent and pre-PR baseline).

Grafana update tools likewise switch from `secureFetchWithValidation`
(server-only) to inline sync `validateExternalUrl` + plain fetch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Boundary schema accepted uppercase schemes (e.g. HTTPS://, DATA:) via the
case-insensitive http regex, but the processor's case-sensitive
startsWith('data:') / startsWith('http') / startsWith('https://') checks
rejected them with a confusing "Unsupported fileUrl scheme" error.
Aligns processor checks to the schema using case-insensitive regex per
RFC 3986 §3.1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Strict audit was failing on two new `as unknown as` casts in pinned-fetch.ts.
They bridge DOM `RequestInit`/`Response` ↔ undici equivalents (structurally
compatible at runtime since Node's global fetch is undici) and are required
to satisfy the FetchLike contract. Annotate so they count as documented
exemptions instead of new violations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@waleedlatif1 waleedlatif1 force-pushed the waleedlatif1/fix-kb-fileurl-lfi branch from abdc919 to 21a93a5 Compare May 17, 2026 05:45
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@greptile

@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

Comment thread apps/sim/tools/grafana/update_alert_rule.ts
@waleedlatif1

Copy link
Copy Markdown
Collaborator Author

@cursor review

@waleedlatif1 waleedlatif1 merged commit 08eeecb into staging May 17, 2026
14 checks passed
@waleedlatif1 waleedlatif1 deleted the waleedlatif1/fix-kb-fileurl-lfi branch May 17, 2026 06:05

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 21a93a5. Configure here.

form: {
redisKey: (email: string, deploymentId: string) => `form-otp:${email}:${deploymentId}`,
dbIdentifier: (email: string, deploymentId: string) => `form-otp:${deploymentId}:${email}`,
},

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Form OTP key component ordering inconsistent across storage backends

Low Severity

The form OTP key format has reversed component ordering between Redis (form-otp:${email}:${deploymentId}) and DB (form-otp:${deploymentId}:${email}). For chat, this inconsistency is preserved for backward compatibility with in-flight OTPs (documented in the comment above). But form is an entirely new key format with no legacy data — there's no reason to replicate the asymmetry. This makes the code harder to reason about and could cause confusion when debugging OTP issues or if a deployment ever switches storage backends mid-flight.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 21a93a5. Configure here.

waleedlatif1 added a commit that referenced this pull request Jun 10, 2026
* fix(knowledge): require write access for batch chunk operations

The PATCH /api/knowledge/[id]/documents/[documentId]/chunks handler
performs enable/disable/delete operations but authorized callers with
only read-level access (checkDocumentAccess). This let read-only
workspace members destroy or disable indexed chunks.

Switch to checkDocumentWriteAccess (write/admin required), matching the
sibling POST/PUT/DELETE chunk mutation endpoints.

* fix(env): restrict decrypted workspace env vars to secret admins

GET /api/workspaces/:id/environment returned decrypted workspace
environment variables to any member, including read-only collaborators,
leaking API tokens, database URLs, and other secrets.

Mask workspace variable values for non-admin viewers while preserving
the variable names, so editor autocomplete and conflict detection keep
working. A value is revealed only when the caller is a credential admin
of that key, or — for legacy keys with no per-secret ACL — holds
workspace admin permission. This mirrors the per-key edit gating already
enforced by PUT/DELETE: if you can administer a secret, you can read it.

Personal variables and execution-time resolution are unchanged.

* fix(files): block cross-tenant deletion via client-controlled context

POST /api/files/delete trusted a client-supplied `context`, letting any
authenticated user delete another tenant's file by naming an arbitrary
key with `context: "og-images"`. verifyFileAccess() short-circuited the
three public contexts (profile-pictures, og-images, workspace-logos) to
`true` before any ownership/requireWrite check.

- Derive the storage context strictly from the trusted key prefix in the
  delete route; reject a supplied `context` that disagrees with the key.
- Gate the public-context short-circuit to reads only. Destructive ops
  (requireWrite) now prove ownership via verifyPublicAssetWriteAccess:
  workspace-logos require write/admin on the bound workspace,
  profile-pictures require an exact owner match, og-images always deny.

Reads of public assets are unchanged.

* fix(telegram): verify X-Telegram-Bot-Api-Secret-Token on inbound webhooks

Telegram triggers accepted any forged update from anyone who knew the
webhook URL path: verifyAuth was a no-op that always returned null, and
setWebhook registered no secret_token.

Generate a per-webhook secret in createSubscription, register it with
Telegram as secret_token, and persist it to providerConfig. verifyAuth
now fails closed — rejects when no token is configured, when the
X-Telegram-Bot-Api-Secret-Token header is absent, or when it does not
match via constant-time safeCompare.

* fix(security): pin DNS for Agiloft directExecution and Grafana update tools

The Agiloft directExecution tools (read/create/search/update/delete/lock/
saved_search/select/get_choice_line_id/remove_attachment/attachment_info)
and the Grafana update_dashboard/update_alert_rule postProcess hooks issued
outbound HTTP to a fully user-controlled host (instanceUrl/baseUrl) via the
global fetch(), guarded only by the synchronous validateExternalUrl() — which
never resolves DNS, so a hostname resolving to an internal/reserved IP passed
validation (SSRF).

Route all of these through the codebase's standard SSRF-safe path:
- Agiloft: moved executeAgiloftRequest into utils.server.ts where the existing
  pinned helpers live. It now resolves+validates the instance URL once and pins
  every hop (login, operation, logout) to that IP via secureFetchWithPinnedIP.
  The 11 tool configs now import it from utils.server; URL builders stay in the
  client-safe utils.ts.
- Grafana: the postProcess POST/PUT now uses validateUrlWithDNS +
  secureFetchWithPinnedIP, matching the already-pinned initial GET.

This completes the Agiloft SSRF pinning started in #4639 (which covered the
attach/retrieve API routes) by closing the directExecution path, and extends
the same guard to the Grafana update tools.

* fix(api): enforce workspace allowPersonalApiKeys policy on v1 surface

The external v1 API authenticated API keys without evaluating the
per-workspace allowPersonalApiKeys setting, so a personal API key could
read and mutate a workspace's resources (workflows, tables, files,
knowledge, logs) even when the workspace had explicitly disabled personal
keys. The same control is already enforced on the workflow-execution
surface.

Enforce the policy in checkWorkspaceScope (covering validateWorkspaceAccess
too): reject personal keys with 403 when the workspace has
allowPersonalApiKeys=false. checkWorkspaceScope becomes async; all v1
route callsites updated to await it.

* fix(billing): close usage-cap admission race with atomic reservation

The server-side usage-limit gate read already-recorded cost, but cost is
only written when an execution finishes. A burst of concurrent executions
all observed the same pre-burst usage, all passed the cap, and all ran —
collectively spending far past the limit before any cost landed in the
ledger (free-tier abuse / hard-cap defeat). manual/chat triggers also skip
rate limiting, removing the only throttle.

Add an atomic check-then-reserve admission step (Redis Lua) that bounds
in-flight, un-costed executions per billing entity by both a per-plan
concurrency cap and remaining usage headroom, so recordedUsage +
reservedSlots * estimate <= limit always holds. The slot is released at
execution completion via LoggingSession (skipped on pause; TTL self-heals
crashes). Runs for all trigger types, covering the previously-unthrottled
manual/chat paths.

Fails open when billing is disabled or Redis is unavailable, matching the
rate limiter — a Redis blip can't turn into an execution outage, and the
recorded-usage gate still runs.

* fix(workflows): validate folderId belongs to workflow's workspace on create/update/reorder

Reject a folderId that references a folder in a different workspace (or
an archived/non-existent folder) before writing it to workflow.folderId.
Previously create, update, and reorder only checked workspace permission
on the workflow and the folder's lock status, never that the folder lived
in the workflow's own workspace, allowing a dangling cross-workspace
folder reference.

Adds isFolderInWorkspace/assertFolderInWorkspace + FolderNotFoundError to
@sim/workflow-authz (mirroring assertTargetFolderMutable in the duplicate
path), enforced in performCreateWorkflow, performUpdateWorkflow, and the
reorder route. Invalid folders now return 400.

* fix(folders): validate parentId against workspace on create/update/reorder

Folder write endpoints accepted a caller-supplied parentId and persisted it
without verifying the parent existed in the same workspace, and the create and
reorder paths had no cycle guard. A workspace member with write access could
reparent a folder to a foreign-workspace folder, a non-existent id, or (via
reorder) into a cycle, hiding the folder and its workflows from all members.

- performCreateFolder: reject self-parenting and validate the parent exists in
  the workspace and is not archived (mirrors the duplicate route).
- performUpdateFolder: add the same workspace/archived parent check alongside
  the existing circular-reference guard.
- folders/reorder: validate every target parent against the workspace, detect
  cycles in the resulting parent graph (catches batch cycles), and normalize
  falsy parentId to null to prevent orphaning.

Adds tests for cross-workspace parent rejection and batch-cycle rejection.

* chore(knowledge): drop non-TSDoc inline comments from chunks route

* fix(webhooks): fail closed when HMAC signing secret is not configured

Inbound webhook signature verification failed open for HMAC providers
(GitHub, Intercom, Jira, JSM, Confluence, Cal.com, Notion, Greenhouse,
Typeform, Fireflies, Circleback): when no signing secret was stored,
verifyAuth returned null and the workflow executed on a fully
attacker-controlled body. Reject these deliveries with 401 instead,
matching the fail-closed Stripe/WhatsApp/Vercel providers.

Run provider reachability/verification handshakes (Notion
verification_token, Grain/Intercom ping) ahead of auth so the
pre-secret setup handshake still completes — those return a canned 200
without executing the workflow, and real event payloads fall through to
fail-closed verification.

Update the trigger secret-field copy to state the secret is required
for deliveries to be accepted (was misleadingly marked optional).

* style(files): trim verbose inline comments on delete authorization fix

* fix(auth): close account-enumeration oracle on email sign-up

The custom before-hook pre-check threw a distinguishing
422/USER_ALREADY_EXISTS for already-registered emails, letting an
unauthenticated attacker enumerate accounts — defeating better-auth's
own OWASP enumeration protection (active under requireEmailVerification).

Remove the pre-check and rely on better-auth's generic duplicate-sign-up
response, wiring:
- onExistingUserSignUp: notify the real account owner out-of-band,
  mirroring the privacy-preserving forget-password flow.
- customSyntheticUser: include admin (role/banned/banReason/banExpires)
  and Stripe (stripeCustomerId, billing-gated) user fields so the fake
  response shape is byte-identical to a real new-user response.

Adds an ExistingAccountEmail template + 'existing-account' subject.

* style(tools): drop non-TSDoc inline comments from Grafana/Agiloft SSRF tools

* chore(api): trim extraneous inline comments in v1 logs/files routes

Remove a redundant size annotation and two verbose multi-line
materialization comments whose intent is already clear from the code.
Load-bearing comments (race-condition and key-translation notes) kept.

* fix(billing): exclude table-cell dispatch from admission reservation

Table-cell dispatch is row-bounded, async rate-limited, and already
surfaces a graceful usage state. Applying the in-flight concurrency
reservation there turned its 429 into a hard cell error on a normal
>15-concurrent-cell run (only 402 was handled gracefully). Skip the
reservation for that surface via a new skipConcurrencyReservation option
(the usage-cost cap is still enforced), and tidy the reservation comments
to TSDoc.

* fix(chat): rate-limit and constant-time password auth for public chats

Password-protected public chat (POST /api/chat/[identifier]) had no
throttling on the password check and compared with a non-constant-time
!==, allowing unlimited brute-force and per-character timing leaks.

- Add per-IP rate limiting (10 / 15min) to the password branch of
  validateChatAuth, mirroring the OTP/SSO endpoints; return 429 with
  Retry-After. Only explicit unlock attempts consume tokens — message
  sends carry no password and ride the auth cookie.
- Replace password !== decrypted with safeCompare.
- Fails open on rate-limiter storage errors; no availability regression.

* fix(security): cap JSON request body size and gate public chat endpoint

The shared parseJsonBody helper (behind parseRequest, used by nearly
every contract route) read request bodies with no size limit, buffering
the full body into memory before validation. The unauthenticated public
deployed-chat endpoint reached this sink with no admission gate, enabling
an anonymous memory-exhaustion DoS.

- parseRequest/parseJsonBody now enforce a byte cap via a size-limited
  stream read (content-length precheck + streamed cap), returning 413.
  Default is API_MAX_JSON_BODY_BYTES (50 MB), overridable per route via
  maxBodyBytes. Decoding uses TextDecoder to match request.json() BOM
  handling.
- Public chat POST is wrapped with the admission gate (tryAdmit) and
  passes an explicit CHAT_MAX_REQUEST_BYTES (20 MB) cap.
- Chat body contract gains .max() bounds on input, password,
  conversationId, file data/name/type, and files array length.
- Admin bulk workspace import opts into a higher 100 MB cap to avoid
  regressing large multi-workflow imports.

* fix(chat): rate-limit and constant-time password auth for public chats

Password-protected public chat (POST /api/chat/[identifier]) had no
throttling on the password check and compared with a non-constant-time
!==, allowing unlimited brute-force and per-character timing leaks.

- Add per-IP rate limiting (10 / 15min) to the password branch of
  validateChatAuth, mirroring the OTP/SSO endpoints; return 429 with
  Retry-After. Only explicit unlock attempts consume tokens — message
  sends carry no password and ride the auth cookie.
- Replace password !== decrypted with safeCompare.
- Fails open on rate-limiter storage errors; no availability regression.

Reinstates the fix reverted by an intervening commit.

* fix(billing): never block a lone execution on usage headroom

The admission reservation tapered allowed concurrency by remaining usage
headroom. With under one credit of headroom left (but not yet over the
cap), floor(headroom / estimate) hit zero and rejected even a single,
zero-concurrency execution — stricter than the recorded-usage gate, which
would have allowed that last run, and with a misleading "too many
concurrent executions" message. Floor the headroom term at 1 so a lone
execution is governed only by the cost gate; concurrency above the first
slot still tapers with headroom.

* refactor(env): document workspace env masking, drop inline comments

Extract the workspace-env value masking into a TSDoc-documented
maskWorkspaceEnvForViewer helper and remove the redundant inline
comments from the GET handler and its test. No behavior change.

* refactor(env): convert PUT/DELETE authz comments to TSDoc

Move the tiered-authorization rationale for the workspace env upsert and
delete handlers into TSDoc blocks and drop the inline comments. No
behavior change.

* fix(telegram): keep legacy webhooks working via Telegram source-IP fallback

The secret-token check rejected every webhook registered before secret_token
support, breaking live triggers until re-saved. Fall back to verifying the
request originates from Telegram's published webhook IP ranges when no secret
is configured, so existing triggers keep firing with no re-save or migration
while forged updates from arbitrary hosts are still rejected. Webhooks with a
registered secret continue to use strict constant-time token verification.

* fix(chat): restore constant-time password auth and IP rate limit

A billing commit (ac56525) reverted the public-chat auth hardening as
collateral, leaving HEAD with a timing-oracle password comparison
(password !== decrypted) and no per-IP brute-force rate limit. Restore
safeCompare and the password-attempt rate limiter, and re-add the 429 test.

* revert(webhooks): undo trigger auth hardening pending compat plan

Reverts the Telegram inbound-token verification (3ed97a4, 41f133a)
and the HMAC fail-closed change (5b6cae9). Production data shows ~79
live webhooks have no signing secret configured (63 GitHub, 9 Fireflies,
3 Jira, 2 Circleback, 1 Confluence, 1 Cal.com), so failing closed would
401 them. Restoring fail-open behavior until a backwards-compatible
rollout (grandfather existing secretless webhooks / migration) is designed.
Other security fixes on this branch are unaffected.

* test(chat): make RateLimiter mock a constructable class

The arrow-function mockImplementation form was not reliably constructable
in the full suite run (`new RateLimiter()` threw "is not a constructor"),
though it passed in isolation. Switch to the class-based mock used by the
sibling OTP/speech route tests.

* fix(billing): release admission slot on pre-execution aborts; cluster-safe release

Addresses PR review on the usage-cap admission reservation:

- Slot leak: the reservation taken at the end of preprocessing was only
  released when the LoggingSession finalized. The execute route's
  pre-execution exits (client cancel, workspace/API-key guards) returned
  without finalizing a session, leaking the slot until its TTL and wrongly
  throttling later runs. Release explicitly on those paths; executions that
  start are still released via session finalization.
- Release is now cluster-safe: replaced the Lua script that rebuilt the
  in-flight key from the pointer value (a key not declared in KEYS, which
  silently breaks Redis Cluster slot routing) with discrete single-key
  GETDEL + ZREM commands.

* improvement(files): log missing owner metadata distinctly on profile-picture delete deny

Per PR review: when a profile-picture delete is denied, distinguish a
missing owner record (no userId metadata) from a genuine ownership
mismatch so the fail-closed denial is diagnosable. Behavior unchanged —
both still deny.

* fix(billing): release admission slot when async enqueue fails

If queueing the background workflow job throws, no job runs and no
LoggingSession finalizes, so the admission slot reserved during
preprocessing would leak until its TTL. Release it before returning 500.

* fix(api): make body-size caps NaN-safe and raise chat input/attachment limits

- DEFAULT_MAX_JSON_BODY_BYTES and CHAT_MAX_REQUEST_BYTES now fall back to
  hardcoded defaults (50 MB / 220 MB) when the env value is missing or
  non-numeric, so a misconfig can't silently produce a NaN cap that never
  rejects.
- Raise CHAT_MAX_REQUEST_BYTES default to 220 MB to cover 15 base64 file
  attachments, and MAX_CHAT_INPUT_CHARS to 1,000,000.
- Minor: tidy use-inline-rename onSave type; drop two redundant test comments.

* fix(hooks): restore void return in useInlineRename onSave type

A prior commit changed onSave's return type from `void | Promise<unknown>`
to `undefined | Promise<unknown>`, which broke the build: callbacks that
return nothing (table-grid column rename, table header rename) infer a
`void` return, which is not assignable to `undefined`. Restore the `void`
union so both fire-and-forget and Promise-returning callbacks type-check.

* fix(billing,api): release chat reservation slot on early exit; preserve 413 on oversized import

- Chat route: preprocessExecution reserves a billing concurrency slot, but
  the post-preprocess early exits (missing workspaceId, execution-setup
  failure) returned without releasing it, leaking the slot until TTL and
  wrongly throttling later runs. Release explicitly on those paths
  (idempotent), mirroring the workflows execute route.
- Admin import route: an oversized JSON body now returns the real 413 from
  parseJsonBody instead of being remapped to a 400; invalid JSON still 400s.

* fix(icons): make Infisical icon black for contrast; regenerate docs

The Infisical mark rendered near-white on its yellow block background and
was barely visible; switch its fill from currentColor to #000000 (matching
the hardcoded-fill pattern of sibling brand icons). Sync the docs icon copy
and pick up a stale servicenow doc regeneration.

* fix(billing): release reserved slot on execute-route 503 and setup throw

After preprocessExecution reserves a billing concurrency slot, the streaming
path could exit without releasing it: the 503 return when
initializeExecutionStreamMeta fails, and any throw during stream setup (caught
by the outer handler, which only returned 500). Both left the slot held until
TTL, wrongly throttling unrelated runs. Release on the 503 path and in the
outer catch (executionId hoisted so the catch can see it; release is
idempotent and a no-op when no slot was reserved).

* fix(icons): make Linkup icon black for contrast

The Linkup mark rendered with currentColor (near-white on its block
background); switch its fill to #000000 for legibility, matching the
Infisical fix. Docs icon copy synced via generate-docs.

* fix(billing): release reserved slot if inline async job never starts

In the inline (single-process) async path, if jobQueue.startJob threw before
executeWorkflowJob ran, no LoggingSession finalized and the reserved billing
slot was held until TTL. Release it in the fire-and-forget catch (idempotent;
a no-op when the job already finalized and released). The queued-worker path
and all in-job outcomes already release via the job's LoggingSession finalize.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant