Skip to content

feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation#2980

Draft
ericallam wants to merge 6 commits intomainfrom
ea-branch-117
Draft

feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation#2980
ericallam wants to merge 6 commits intomainfrom
ea-branch-117

Conversation

@ericallam
Copy link
Member

@ericallam ericallam commented Jan 30, 2026

Summary

This PR implements environment-level queue size limits to prevent runaway queues from overwhelming Redis, along with several related improvements.

Environment Queue Limits

Prevents any single environment from overwhelming the shared Redis cluster by enforcing queue size limits at trigger time.

  • Added MAXIMUM_ENV_QUEUE_SIZE environment variable with sensible defaults
  • Queue size checks happen at trigger time - runs are rejected if queue would exceed limit
  • Dashboard UI shows queue limits on both the Queues page and a new Limits page
  • In-memory caching for queue size checks to reduce Redis load
Queues page showing limit Limits page

Fix: batchTriggerAndWait stuck when items skipped

Fixed a bug where batchTriggerAndWait would get stuck forever when some items were skipped due to queue limits.

Root cause: When items were skipped during batch streaming:

  1. Phase 1 (createBatch) returned original runCount to SDK
  2. Phase 2 (streamBatchItems) skipped items due to limits, updated DB with new count
  3. SDK still waited for original count but fewer runs existed
  4. Parent run stuck forever

Fix: Server now returns actual runCount in Phase 2 response, SDK uses this for waitForBatch.

New TTL System

Improved TTL (time-to-live) handling for run expiration with better test coverage.

Lazy Waitpoint Creation

Better idempotency handling through lazy waitpoint creation - waitpoints are now created on-demand rather than eagerly.

Run Queue Improvements

Various improvements to the run queue system with expanded test coverage.

Test plan

  • Test triggering runs when queue is at/near limit
  • Test batchTriggerAndWait with queue limits causing skipped items
  • Verify dashboard displays limits correctly
  • Run new TTL and waitpoint tests

refs TRI-7167

🤖 Generated with Claude Code

@changeset-bot
Copy link

changeset-bot bot commented Jan 30, 2026

⚠️ No Changeset found

Latest commit: 814a9a4

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Centralizes queue-size logic (new v3/queueLimits utility and environment queueSizeLimit exposure) and adds an LRU cache for environment queue lengths. Refactors queue validation to per-queue semantics (resolveQueueNamesForBatchItems, validateMultipleQueueLimits) and surfaces itemsSkipped/runCount through batch streaming APIs. Introduces per-item retry for batch queue processing, batch-run-count updates, and a TriggerFailedTaskService for creating pre-failed runs. Adds a TTL expiration subsystem (batched TTL consumers, Redis TTL scripts, ttlSystem callback) and lazy get-or-create waitpoints with related waitpoint APIs. Numerous RunEngine/RunQueue/BatchQueue public API additions and tests updated; UI presenters and routes updated to use the single queueSize quota.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is comprehensive, covering the objectives, changes, and test plan. However, it is missing the required checklist items and structured sections from the template. Add the required checklist section with checkboxes for contributing guide, PR title convention, and code testing. Include Testing and Changelog sections as specified in the template.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main changes: new TTL system, queue size limit enforcement, and lazy waitpoint creation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ea-branch-117

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vibe-kanban-cloud
Copy link

Review Complete

Your review story is ready!

View Story

Comment !reviewfast on this PR to re-generate the story.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@apps/webapp/app/presenters/v3/LimitsPresenter.server.ts`:
- Around line 170-194: The code silently treats a missing runtime environment as
a zero queue by checking runtimeEnv and continuing; change this to explicitly
handle the case where runtimeEnv is null for the required environmentId by
returning/throwing an error (or returning a clear "not found" result) instead of
falling through to the deployed default logic—locate the block where runtimeEnv
is fetched (the runtimeEnvironment.findFirst call and the subsequent use of
runtimeEnv, currentQueueSize and engine.lengthOfEnvQueue) and add an explicit
branch that raises a descriptive error or returns an appropriate 404-style
response when runtimeEnv is falsy so callers don’t get misleading limits
computed for a non-existent environment.
🧹 Nitpick comments (1)
apps/webapp/app/presenters/v3/LimitsPresenter.server.ts (1)

311-329: Queue size limit logic is duplicated with EnvironmentQueuePresenter.

The limit computation logic (lines 314-317) and source determination (lines 321-328) duplicate the same pattern found in EnvironmentQueuePresenter.server.ts (lines 44-47). Consider extracting this into a shared utility function to ensure consistency and reduce maintenance burden.

♻️ Example shared utility
// In a shared utils file, e.g., ~/utils/queueSizeLimit.ts
export function getQueueSizeLimitForEnvironment(
  environmentType: string,
  organization: { maximumDevQueueSize: number | null; maximumDeployedQueueSize: number | null },
  defaultDevLimit: number | null,
  defaultDeployedLimit: number | null
): { limit: number | null; source: "plan" | "default" } {
  if (environmentType === "DEVELOPMENT") {
    return {
      limit: organization.maximumDevQueueSize ?? defaultDevLimit ?? null,
      source: organization.maximumDevQueueSize ? "plan" : "default",
    };
  }
  return {
    limit: organization.maximumDeployedQueueSize ?? defaultDeployedLimit ?? null,
    source: organization.maximumDeployedQueueSize ? "plan" : "default",
  };
}
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 279102c and a68ef30.

📒 Files selected for processing (4)
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Access environment variables via env export from apps/webapp/app/env.server.ts, never use process.env directly

Files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
🧠 Learnings (10)
📚 Learning: 2025-12-08T15:19:56.823Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2760
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.runs.$runParam/route.tsx:278-281
Timestamp: 2025-12-08T15:19:56.823Z
Learning: In apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.runs.$runParam/route.tsx, the tableState search parameter uses intentional double-encoding: the parameter value contains a URL-encoded URLSearchParams string, so decodeURIComponent(value("tableState") ?? "") is required to fully decode it before parsing with new URLSearchParams(). This pattern allows bundling multiple filter/pagination params as a single search parameter.

Applied to files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Control concurrency using the `queue` property with `concurrencyLimit` option

Applied to files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/v3/presenters/**/*.server.{ts,tsx} : Organize presenters in the webapp following the pattern `app/v3/presenters/*/*.server.ts` to move complex loader code into classes

Applied to files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to apps/webapp/**/*.{ts,tsx} : Access environment variables via `env` export from `apps/webapp/app/env.server.ts`, never use `process.env` directly

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to apps/webapp/**/*.test.{ts,tsx} : For testable code in the webapp, never import env.server.ts in test files - pass configuration as options instead

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/**/*.test.{ts,tsx} : Test files should only import classes and functions from `app/**/*.ts` files and should not import `env.server.ts` directly or indirectly; pass configuration through options instead

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
📚 Learning: 2025-11-14T16:03:06.917Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2681
File: apps/webapp/app/services/platform.v3.server.ts:258-302
Timestamp: 2025-11-14T16:03:06.917Z
Learning: In `apps/webapp/app/services/platform.v3.server.ts`, the `getDefaultEnvironmentConcurrencyLimit` function intentionally throws an error (rather than falling back to org.maximumConcurrencyLimit) when the billing client returns undefined plan limits. This fail-fast behavior prevents users from receiving more concurrency than their plan entitles them to. The org.maximumConcurrencyLimit fallback is only for self-hosted deployments where no billing client exists.

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2026-01-12T17:18:09.451Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2870
File: apps/webapp/app/services/redisConcurrencyLimiter.server.ts:56-66
Timestamp: 2026-01-12T17:18:09.451Z
Learning: In `apps/webapp/app/services/redisConcurrencyLimiter.server.ts`, the query concurrency limiter will not be deployed with Redis Cluster mode, so multi-key operations (keyKey and globalKey in different hash slots) are acceptable and will function correctly in standalone Redis mode.

Applied to files:

  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Use the Run Engine 2.0 from `internal/run-engine` for new run lifecycle code in the webapp instead of the legacy run engine

Applied to files:

  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
🧬 Code graph analysis (2)
apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx (2)
apps/webapp/app/utils/numberFormatter.ts (1)
  • formatNumberCompact (3-5)
apps/webapp/app/components/primitives/Tooltip.tsx (1)
  • InfoIconTooltip (110-139)
apps/webapp/app/presenters/v3/LimitsPresenter.server.ts (1)
apps/webapp/app/v3/runEngine.server.ts (1)
  • engine (11-11)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: typecheck / typecheck
🔇 Additional comments (5)
apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts (1)

1-56: LGTM!

The implementation correctly extends the Environment type with queueSizeLimit and computes it based on environment type with appropriate fallback chain (org override → env var → null). Environment variable access follows coding guidelines via ~/env.server.

apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx (1)

510-511: LGTM!

Clean consolidation from separate dev/deployed queue size quotas to a unified queueSize quota, conditionally rendered only when a limit is set. This aligns well with the LimitsPresenter changes.

apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx (2)

349-391: LGTM!

The queue size limit display with color-coded usage indicators is well-implemented. The conditional rendering shows the limit suffix when queueSizeLimit exists, with appropriate fallback to the paused indicator.

Minor note: The component has env from useEnvironment() (line 313) shadowing the env import from ~/env.server (line 57). This works correctly since the loader uses the imported env and the component uses the hook result, but renaming one variable (e.g., runtimeEnv for the hook result) could improve clarity.


1145-1152: LGTM!

Clean helper function with consistent threshold logic (90% warning, 100% error). Returning undefined when no limit is set allows proper fallback behavior in the calling code.

apps/webapp/app/presenters/v3/LimitsPresenter.server.ts (1)

15-15: LGTM on the structural changes.

The import of engine and the updated LimitsResult type with consolidated queueSize: QuotaInfo properly support the new unified queue size quota feature.

Also applies to: 70-70

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@apps/webapp/app/env.server.ts`:
- Around line 536-539: The MAXIMUM_DEV_QUEUE_SIZE default of 500 introduces an
unintended hard cap for dev environments because guardQueueSizeLimitsForEnv
reads this value and enforces it; remove the .default(500) so
MAXIMUM_DEV_QUEUE_SIZE remains optional/undefined unless explicitly set in the
environment, leaving the z.coerce.number().int().optional() schema for
MAXIMUM_DEV_QUEUE_SIZE and ensure any code calling guardQueueSizeLimitsForEnv
continues to treat undefined as "no cap" (verify guardQueueSizeLimitsForEnv
behavior and update it only if it currently treats undefined incorrectly).

In `@apps/webapp/app/presenters/v3/LimitsPresenter.server.ts`:
- Around line 313-319: The queue-size quota object (queueSize) in
LimitsPresenter.server.ts is missing the isUpgradable flag so the UI never shows
an Upgrade action; update the queueSize payload returned by the presenter to
include isUpgradable: true for plan-based limits (and add canExceed: true|false
as appropriate), e.g., set isUpgradable to true when
getQueueSizeLimitSource(environmentType, organization) indicates a plan-based
source and ensure the UI-facing fields (queueSize.currentUsage, queueSize.limit,
queueSize.source) remain unchanged.
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a0f94ff and a50a5f5.

📒 Files selected for processing (7)
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Access environment variables via env export from apps/webapp/app/env.server.ts, never use process.env directly

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
🧠 Learnings (8)
📓 Common learnings
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Control concurrency using the `queue` property with `concurrencyLimit` option
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Control concurrency using the `queue` property with `concurrencyLimit` option

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
📚 Learning: 2025-11-14T16:03:06.917Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2681
File: apps/webapp/app/services/platform.v3.server.ts:258-302
Timestamp: 2025-11-14T16:03:06.917Z
Learning: In `apps/webapp/app/services/platform.v3.server.ts`, the `getDefaultEnvironmentConcurrencyLimit` function intentionally throws an error (rather than falling back to org.maximumConcurrencyLimit) when the billing client returns undefined plan limits. This fail-fast behavior prevents users from receiving more concurrency than their plan entitles them to. The org.maximumConcurrencyLimit fallback is only for self-hosted deployments where no billing client exists.

Applied to files:

  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-12-08T15:19:56.823Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2760
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.runs.$runParam/route.tsx:278-281
Timestamp: 2025-12-08T15:19:56.823Z
Learning: In apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.runs.$runParam/route.tsx, the tableState search parameter uses intentional double-encoding: the parameter value contains a URL-encoded URLSearchParams string, so decodeURIComponent(value("tableState") ?? "") is required to fully decode it before parsing with new URLSearchParams(). This pattern allows bundling multiple filter/pagination params as a single search parameter.

Applied to files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use the `task()` function from `trigger.dev/sdk/v3` to define tasks with id and run properties

Applied to files:

  • apps/webapp/app/runEngine/concerns/queues.server.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Specify runtime environment (node or bun) in trigger.config.ts using the `runtime` property

Applied to files:

  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2026-01-12T17:18:09.451Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2870
File: apps/webapp/app/services/redisConcurrencyLimiter.server.ts:56-66
Timestamp: 2026-01-12T17:18:09.451Z
Learning: In `apps/webapp/app/services/redisConcurrencyLimiter.server.ts`, the query concurrency limiter will not be deployed with Redis Cluster mode, so multi-key operations (keyKey and globalKey in different hash slots) are acceptable and will function correctly in standalone Redis mode.

Applied to files:

  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
🧬 Code graph analysis (2)
apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts (1)
apps/webapp/app/v3/utils/queueLimits.server.ts (1)
  • getQueueSizeLimit (24-33)
apps/webapp/app/presenters/v3/LimitsPresenter.server.ts (2)
apps/webapp/app/v3/runEngine.server.ts (1)
  • engine (11-11)
apps/webapp/app/v3/utils/queueLimits.server.ts (2)
  • getQueueSizeLimit (24-33)
  • getQueueSizeLimitSource (42-51)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (26)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: typecheck / typecheck
🔇 Additional comments (7)
apps/webapp/app/runEngine/concerns/queues.server.ts (3)

18-33: LRU cache setup looks solid.

The singleton + bounded LRU store should help reduce Redis churn without unbounded memory growth.


298-317: Confirm stale cache is acceptable for enforcement.

guardQueueSizeLimitsForEnv now relies on an SWR cache; with a ~30s TTL, queue size can be undercounted and allow bursts above the limit. If this is a hard/billing limit, consider bypassing the cache for enforcement (or forcing a fresh read when nearing the limit).


68-68: No issues to flag in these edits.

Also applies to: 94-96, 116-118, 130-131

apps/webapp/app/v3/utils/queueLimits.server.ts (1)

7-51: Queue-limit helpers look good.

Clear resolution order and source labeling; the API surface is minimal and consistent.

apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts (1)

4-52: queueSizeLimit wiring looks consistent.

The new field is derived in one place and keeps the presenter output cohesive.

apps/webapp/app/presenters/v3/LimitsPresenter.server.ts (1)

2-18: API/type expansion for queue limits looks consistent.

The new environmentType parameter and consolidated quota surface align with the updated limit logic.

Also applies to: 58-96

apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx (1)

81-86: Loader and quota table updates look good.

Passing environmentType and gating the queue-size quota on a defined limit is consistent with the new presenter shape.

Also applies to: 511-513

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 536 to 539
MAXIMUM_DEV_QUEUE_SIZE: z.coerce.number().int().optional().default(500),
MAXIMUM_DEPLOYED_QUEUE_SIZE: z.coerce.number().int().optional(),
QUEUE_SIZE_CACHE_TTL_MS: z.coerce.number().int().optional().default(30_000), // 30 seconds
QUEUE_SIZE_CACHE_MAX_SIZE: z.coerce.number().int().optional().default(5_000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Defaulting MAXIMUM_DEV_QUEUE_SIZE changes enforcement behavior.

This turns previously-unlimited dev environments into a hard 500-queue cap (via guardQueueSizeLimitsForEnv). If that’s not intentional, remove the default and require an explicit env var to enable the limit.

💡 Suggested change (avoid unintended hard limit)
-    MAXIMUM_DEV_QUEUE_SIZE: z.coerce.number().int().optional().default(500),
+    MAXIMUM_DEV_QUEUE_SIZE: z.coerce.number().int().optional(),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
MAXIMUM_DEV_QUEUE_SIZE: z.coerce.number().int().optional().default(500),
MAXIMUM_DEPLOYED_QUEUE_SIZE: z.coerce.number().int().optional(),
QUEUE_SIZE_CACHE_TTL_MS: z.coerce.number().int().optional().default(30_000), // 30 seconds
QUEUE_SIZE_CACHE_MAX_SIZE: z.coerce.number().int().optional().default(5_000),
MAXIMUM_DEV_QUEUE_SIZE: z.coerce.number().int().optional(),
MAXIMUM_DEPLOYED_QUEUE_SIZE: z.coerce.number().int().optional(),
QUEUE_SIZE_CACHE_TTL_MS: z.coerce.number().int().optional().default(30_000), // 30 seconds
QUEUE_SIZE_CACHE_MAX_SIZE: z.coerce.number().int().optional().default(5_000),
🤖 Prompt for AI Agents
In `@apps/webapp/app/env.server.ts` around lines 536 - 539, The
MAXIMUM_DEV_QUEUE_SIZE default of 500 introduces an unintended hard cap for dev
environments because guardQueueSizeLimitsForEnv reads this value and enforces
it; remove the .default(500) so MAXIMUM_DEV_QUEUE_SIZE remains
optional/undefined unless explicitly set in the environment, leaving the
z.coerce.number().int().optional() schema for MAXIMUM_DEV_QUEUE_SIZE and ensure
any code calling guardQueueSizeLimitsForEnv continues to treat undefined as "no
cap" (verify guardQueueSizeLimitsForEnv behavior and update it only if it
currently treats undefined incorrectly).

Comment on lines 313 to 319
queueSize: {
name: "Max queued runs",
description: "Maximum pending runs across all queues in this environment",
limit: getQueueSizeLimit(environmentType, organization),
currentUsage: currentQueueSize,
source: getQueueSizeLimitSource(environmentType, organization),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Queue-size quota never shows an upgrade action.

isUpgradable is omitted, so the Upgrade column is empty even for plan-based limits. If upgrades should be offered, set it to true (and optionally canExceed).

💡 Suggested change
         queueSize: {
           name: "Max queued runs",
           description: "Maximum pending runs across all queues in this environment",
           limit: getQueueSizeLimit(environmentType, organization),
           currentUsage: currentQueueSize,
           source: getQueueSizeLimitSource(environmentType, organization),
+          isUpgradable: true,
         },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
queueSize: {
name: "Max queued runs",
description: "Maximum pending runs across all queues in this environment",
limit: getQueueSizeLimit(environmentType, organization),
currentUsage: currentQueueSize,
source: getQueueSizeLimitSource(environmentType, organization),
},
queueSize: {
name: "Max queued runs",
description: "Maximum pending runs across all queues in this environment",
limit: getQueueSizeLimit(environmentType, organization),
currentUsage: currentQueueSize,
source: getQueueSizeLimitSource(environmentType, organization),
isUpgradable: true,
},
🤖 Prompt for AI Agents
In `@apps/webapp/app/presenters/v3/LimitsPresenter.server.ts` around lines 313 -
319, The queue-size quota object (queueSize) in LimitsPresenter.server.ts is
missing the isUpgradable flag so the UI never shows an Upgrade action; update
the queueSize payload returned by the presenter to include isUpgradable: true
for plan-based limits (and add canExceed: true|false as appropriate), e.g., set
isUpgradable to true when getQueueSizeLimitSource(environmentType, organization)
indicates a plan-based source and ensure the UI-facing fields
(queueSize.currentUsage, queueSize.limit, queueSize.source) remain unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal-packages/run-engine/src/engine/tests/ttl.test.ts (1)

1-9: ⚠️ Potential issue | 🟡 Minor

Missing vi import from vitest.

Similar to lazyWaitpoint.test.ts, the vi global is used but not imported.

🔧 Proposed fix
 import { containerTest, assertNonNullable } from "@internal/testcontainers";
 import { trace } from "@internal/tracing";
-import { expect } from "vitest";
+import { expect, vi } from "vitest";
 import { RunEngine } from "../index.js";
apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx (1)

5-5: ⚠️ Potential issue | 🟡 Minor

Use subpath export for tryCatch from @trigger.dev/core/utils.

Per coding guidelines, imports from @trigger.dev/core in the webapp must use subpath exports from the package.json. Change import { tryCatch } from "@trigger.dev/core"; to import { tryCatch } from "@trigger.dev/core/utils";.

🤖 Fix all issues with AI agents
In `@internal-packages/run-engine/src/batch-queue/completionTracker.ts`:
- Around line 112-130: The updateRunCount method currently trusts newRunCount;
add input validation in updateRunCount (in completionTracker.ts) to reject
negative, NaN, non-finite or non-integer values before calling
getMeta/storeMeta: use Number.isInteger(newRunCount) && newRunCount >= 0 (and
optionally Number.isFinite) and if the check fails log an error via
this.logger.error with batchId and the invalid value and then throw an Error or
return a rejected Promise so the invalid value is never persisted; keep existing
behavior for meta retrieval (getMeta) and storage (storeMeta) otherwise.

In `@internal-packages/run-engine/src/batch-queue/index.ts`:
- Around line 380-386: The updateRunCount method can lower runCount below the
already-processed count and leave a batch stuck; before calling
this.completionTracker.updateRunCount(batchId, newRunCount) clamp newRunCount to
at least the current processed count (e.g. newRunCount = Math.max(newRunCount,
this.completionTracker.getProcessedCount(batchId))) or, alternatively, after
updating call into the completion tracker to run the completion/finalization
check (e.g. this.completionTracker.checkCompletion/finalizeBatch for batchId) so
that if processedCount >= runCount the batch is finalized immediately; update
updateRunCount to perform one of these fixes using the existing
completionTracker APIs.

In `@internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts`:
- Around line 1-8: The test uses the vitest runner's mock/time API via
vi.setConfig but never imports vi; update the imports to explicitly import vi
from 'vitest' (for example change the existing import of expect to import {
expect, vi } from "vitest") so the vi symbol used by vi.setConfig is defined and
the test runner APIs are available.

In `@internal-packages/run-engine/src/run-queue/index.ts`:
- Around line 1238-1304: The code currently calls `#expireTtlRuns`(shard, now,
batchSize) which deletes/marks runs as expired before invoking
options.ttlSystem!.callback(expiredRuns), so callback failures only get logged
and those runs can be lost; update `#startTtlConsumer` (or `#expireTtlRuns`) so that
expiration is durable: either (A) defer permanent deletion/marking of runs until
after the callback completes successfully (move the delete/mark logic to after
options.ttlSystem!.callback returns) or (B) if deletion must happen first, add a
durable retry/enqueue path when the callback throws — e.g., push failed expired
run IDs to a retry store (Redis list/sorted set) with metadata and backoff, and
spawn a retry worker to reprocess them; ensure the callback and retry path are
idempotent and update logging to include run IDs and shard when queuing retries.
- Around line 2561-2637: expireTtlRuns currently removes run entries from
per-queue structures but never updates the master queue (masterQueueKey),
leaving stale entries; update the logic so that either (A) the Lua script also
computes the corresponding master/shard key from queueKey and ZREM the runId
from masterQueueKey (or adjust scores) for each expired member, or (B) return
the affected queueKey values (not just member) from expireTtlRuns so the
TypeScript caller can call the existing rebalance/cleanup routine for
masterQueueKey per returned queue; locate expireTtlRuns, the queueKey local
variable and the final results/return section to implement one of these fixes.
🧹 Nitpick comments (7)
packages/core/src/v3/schemas/api.ts (1)

412-424: Tighten validation for new count fields.
These are counts, so restricting them to non‑negative integers prevents accidental floats/negatives in API responses.

♻️ Suggested update
-  itemsSkipped: z.number().optional(),
+  itemsSkipped: z.number().int().nonnegative().optional(),
...
-  runCount: z.number().optional(),
+  runCount: z.number().int().nonnegative().optional(),
apps/webapp/app/runEngine/concerns/queues.server.ts (1)

421-432: Consider potential cache stampede scenario.

The SWR pattern handles stale reads well, but when the cache is completely cold (e.g., after deployment), multiple concurrent requests for the same queue could all trigger Redis lookups simultaneously before the first one populates the cache.

For most use cases this is acceptable, but if queue size lookups become a bottleneck, consider adding a coalescing mechanism.

apps/webapp/app/runEngine/services/triggerTask.server.ts (1)

258-269: Queue validation correctly uses resolved queue name.

The reordering to validate after resolveQueueProperties is correct since the queue name is needed for per-queue validation.

Consider enhancing the error message to include the queue name for better debuggability:

💡 Optional: Include queue name in error message
         if (!queueSizeGuard.ok) {
           throw new ServiceValidationError(
-            `Cannot trigger ${taskId} as the queue size limit for this environment has been reached. The maximum size is ${queueSizeGuard.maximumSize}`
+            `Cannot trigger ${taskId} as the queue size limit for queue '${queueName}' has been reached. The maximum size is ${queueSizeGuard.maximumSize}`
           );
         }
internal-packages/run-engine/src/run-queue/types.ts (1)

56-128: Convert RunQueueKeyProducer interface to a type alias to align with TypeScript guidelines.

♻️ Suggested refactor
-export interface RunQueueKeyProducer {
+export type RunQueueKeyProducer = {
   //queue
   queueKey(
     orgId: string,
     projId: string,
     envId: string,
     queue: string,
     concurrencyKey?: string
   ): string;
   queueKey(env: RunQueueKeyProducerEnvironment, queue: string, concurrencyKey?: string): string;

   legacyMasterQueueKey(masterQueueName: string): string;

   masterQueueKeyForEnvironment(envId: string, shardCount: number): string;
   masterQueueKeyForShard(shard: number): string;
   masterQueueShardForEnvironment(envId: string, shardCount: number): number;
   workerQueueKey(workerQueue: string): string;

   envQueueKey(env: RunQueueKeyProducerEnvironment): string;
   envQueueKeyFromQueue(queue: string): string;
   queueConcurrencyLimitKey(env: RunQueueKeyProducerEnvironment, queue: string): string;
   queueConcurrencyLimitKeyFromQueue(queue: string): string;
   queueCurrentConcurrencyKeyFromQueue(queue: string): string;
   queueCurrentConcurrencyKey(
     env: RunQueueKeyProducerEnvironment,
     queue: string,
     concurrencyKey?: string
   ): string;
   queueCurrentDequeuedKeyFromQueue(queue: string): string;
   queueCurrentDequeuedKey(
     env: RunQueueKeyProducerEnvironment,
     queue: string,
     concurrencyKey?: string
   ): string;

   //env oncurrency
   envCurrentConcurrencyKey(env: EnvDescriptor): string;
   envCurrentConcurrencyKey(env: RunQueueKeyProducerEnvironment): string;

   envConcurrencyLimitKey(env: EnvDescriptor): string;
   envConcurrencyLimitKey(env: RunQueueKeyProducerEnvironment): string;

   envCurrentDequeuedKey(env: EnvDescriptor): string;
   envCurrentDequeuedKey(env: RunQueueKeyProducerEnvironment): string;

   envConcurrencyLimitBurstFactorKey(env: EnvDescriptor): string;
   envConcurrencyLimitBurstFactorKey(env: RunQueueKeyProducerEnvironment): string;
   envConcurrencyLimitBurstFactorKeyFromQueue(queue: string): string;

   envConcurrencyLimitKeyFromQueue(queue: string): string;
   envCurrentConcurrencyKeyFromQueue(queue: string): string;
   envCurrentDequeuedKeyFromQueue(queue: string): string;

   //message payload
   messageKeyPrefixFromQueue(queue: string): string;
   messageKey(orgId: string, messageId: string): string;
   //utils
   orgIdFromQueue(queue: string): string;
   envIdFromQueue(queue: string): string;
   projectIdFromQueue(queue: string): string;
   descriptorFromQueue(queue: string): QueueDescriptor;

   deadLetterQueueKey(env: RunQueueKeyProducerEnvironment): string;
   deadLetterQueueKey(env: EnvDescriptor): string;
   deadLetterQueueKeyFromQueue(queue: string): string;

   // Concurrency sweeper methods
   markedForAckKey(): string;
   currentConcurrencySetKeyScanPattern(): string;

   // TTL system methods
   ttlQueueKeyForShard(shard: number): string;
-}
+};
internal-packages/run-engine/src/engine/systems/ttlSystem.ts (2)

252-260: Unnecessary type cast for TaskRunStatus.

The string literal "EXPIRED" is a valid TaskRunStatus value, so the as TaskRunStatus cast is redundant.

🧹 Minor cleanup
         await this.$.prisma.taskRun.updateMany({
           where: { id: { in: runIdsToExpire } },
           data: {
-            status: "EXPIRED" as TaskRunStatus,
+            status: "EXPIRED",
             completedAt: now,
             expiredAt: now,
           },
         });

263-321: Consider error handling for individual run processing within the batch.

If any single run fails during snapshot creation or waitpoint completion, Promise.all will reject and subsequent runs won't be processed. Since the batch updateMany on line 252 has already succeeded, this could leave some runs in a partially processed state (status=EXPIRED but no snapshot).

For improved resilience, consider using Promise.allSettled and collecting failures:

💡 Suggested approach
-        await Promise.all(
-          runsToExpire.map(async (run) => {
+        const results = await Promise.allSettled(
+          runsToExpire.map(async (run) => {
             // ... existing processing ...
             expired.push(run.id);
+            return run.id;
           })
         );
+
+        // Log any failures but don't throw - the status update already succeeded
+        for (const result of results) {
+          if (result.status === "rejected") {
+            this.$.logger.error("Failed to fully process expired run", {
+              error: result.reason,
+            });
+          }
+        }
internal-packages/run-engine/src/run-queue/index.ts (1)

108-110: Use a type alias for the callback.

This project prefers type over interface in TS; a function-shaped callback is a clean fit for a type alias.

♻️ Suggested change
-export interface TtlSystemCallback {
-  (runs: Array<{ queueKey: string; runId: string; orgId: string }>): Promise<void>;
-}
+export type TtlSystemCallback = (
+  runs: Array<{ queueKey: string; runId: string; orgId: string }>
+) => Promise<void>;

As per coding guidelines: Use types over interfaces for TypeScript.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a50a5f5 and aaea8d6.

⛔ Files ignored due to path filters (1)
  • references/hello-world/src/trigger/batches.ts is excluded by !references/**
📒 Files selected for processing (32)
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • internal-packages/run-engine/src/engine/index.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/types.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • internal-packages/run-engine/src/run-queue/keyProducer.ts
  • internal-packages/run-engine/src/run-queue/types.ts
  • packages/core/src/v3/schemas/api.ts
  • packages/trigger-sdk/src/v3/shared.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
  • apps/webapp/app/presenters/v3/EnvironmentQueuePresenter.server.ts
🧰 Additional context used
📓 Path-based instructions (12)
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{ts,tsx}: Use types over interfaces for TypeScript
Avoid using enums; prefer string unions or const objects instead

**/*.{ts,tsx}: Always import tasks from @trigger.dev/sdk, never use @trigger.dev/sdk/v3 or deprecated client.defineJob pattern
Every Trigger.dev task must be exported and have a unique id property with no timeouts in the run function

Files:

  • packages/core/src/v3/schemas/api.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • internal-packages/run-engine/src/run-queue/types.ts
  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/types.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/run-queue/keyProducer.ts
  • apps/webapp/app/env.server.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • apps/webapp/app/runEngine/types.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • internal-packages/run-engine/src/engine/index.ts
{packages/core,apps/webapp}/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use zod for validation in packages/core and apps/webapp

Files:

  • packages/core/src/v3/schemas/api.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use function declarations instead of default exports

Import from @trigger.dev/core using subpaths only, never import from root

Files:

  • packages/core/src/v3/schemas/api.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • internal-packages/run-engine/src/run-queue/types.ts
  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/types.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/run-queue/keyProducer.ts
  • apps/webapp/app/env.server.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • apps/webapp/app/runEngine/types.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • internal-packages/run-engine/src/engine/index.ts
**/*.ts

📄 CodeRabbit inference engine (.cursor/rules/otel-metrics.mdc)

**/*.ts: When creating or editing OTEL metrics (counters, histograms, gauges), ensure metric attributes have low cardinality by using only enums, booleans, bounded error codes, or bounded shard IDs
Do not use high-cardinality attributes in OTEL metrics such as UUIDs/IDs (envId, userId, runId, projectId, organizationId), unbounded integers (itemCount, batchSize, retryCount), timestamps (createdAt, startTime), or free-form strings (errorMessage, taskName, queueName)
When exporting OTEL metrics via OTLP to Prometheus, be aware that the exporter automatically adds unit suffixes to metric names (e.g., 'my_duration_ms' becomes 'my_duration_ms_milliseconds', 'my_counter' becomes 'my_counter_total'). Account for these transformations when writing Grafana dashboards or Prometheus queries

Files:

  • packages/core/src/v3/schemas/api.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • internal-packages/run-engine/src/run-queue/types.ts
  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/types.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/run-queue/keyProducer.ts
  • apps/webapp/app/env.server.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • apps/webapp/app/runEngine/types.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • internal-packages/run-engine/src/engine/index.ts
**/*.{js,ts,jsx,tsx,json,md,yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Format code using Prettier before committing

Files:

  • packages/core/src/v3/schemas/api.ts
  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • internal-packages/run-engine/src/run-queue/types.ts
  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/types.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/run-queue/keyProducer.ts
  • apps/webapp/app/env.server.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/systems/runAttemptSystem.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • apps/webapp/app/runEngine/types.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
  • internal-packages/run-engine/src/engine/index.ts
{packages,integrations}/**/*

📄 CodeRabbit inference engine (CLAUDE.md)

Add a changeset when modifying any public package in packages/* or integrations/* using pnpm run changeset:add

Files:

  • packages/core/src/v3/schemas/api.ts
  • packages/trigger-sdk/src/v3/shared.ts
apps/webapp/app/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

Access all environment variables through the env export of env.server.ts instead of directly accessing process.env in the Trigger.dev webapp

Files:

  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
apps/webapp/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.cursor/rules/webapp.mdc)

apps/webapp/**/*.{ts,tsx}: When importing from @trigger.dev/core in the webapp, use subpath exports from the package.json instead of importing from the root path
Follow the Remix 2.1.0 and Express server conventions when updating the main trigger.dev webapp

Access environment variables via env export from apps/webapp/app/env.server.ts, never use process.env directly

Files:

  • apps/webapp/app/runEngine/services/createBatch.server.ts
  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
**/*.{test,spec}.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use vitest for all tests in the Trigger.dev repository

Files:

  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
**/*.test.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.test.{ts,tsx,js,jsx}: Test files should live beside the files under test and use descriptive describe and it blocks
Tests should avoid mocks or stubs and use the helpers from @internal/testcontainers when Redis or Postgres are needed
Use vitest for running unit tests

**/*.test.{ts,tsx,js,jsx}: Use vitest exclusively for testing and never mock anything - use testcontainers instead
Place test files next to source files with naming pattern: source file (e.g., MyService.ts) → MyService.test.ts

Files:

  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (CLAUDE.md)

Use testcontainers helpers (redisTest, postgresTest, containerTest) from @internal/testcontainers for Redis/PostgreSQL testing instead of mocks

Files:

  • internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
packages/trigger-sdk/**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

In the Trigger.dev SDK (packages/trigger-sdk), prefer isomorphic code like fetch and ReadableStream instead of Node.js-specific code

Files:

  • packages/trigger-sdk/src/v3/shared.ts
🧠 Learnings (40)
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `idempotencyKeyTTL` option to define a time window during which duplicate triggers return the original run

Applied to files:

  • internal-packages/run-engine/src/run-queue/types.ts
  • internal-packages/run-engine/src/engine/types.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/run-queue/keyProducer.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • internal-packages/run-engine/src/run-queue/index.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2026-01-12T17:18:09.451Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2870
File: apps/webapp/app/services/redisConcurrencyLimiter.server.ts:56-66
Timestamp: 2026-01-12T17:18:09.451Z
Learning: In `apps/webapp/app/services/redisConcurrencyLimiter.server.ts`, the query concurrency limiter will not be deployed with Redis Cluster mode, so multi-key operations (keyKey and globalKey in different hash slots) are acceptable and will function correctly in standalone Redis mode.

Applied to files:

  • internal-packages/run-engine/src/run-queue/types.ts
  • apps/webapp/app/v3/runEngine.server.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use metadata methods (set, del, replace, append, remove, increment, decrement, stream, flush) to update metadata during task execution

Applied to files:

  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `runs.subscribeToBatch()` to subscribe to changes for all runs in a batch

Applied to files:

  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/batch-queue/index.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Attach metadata to task runs using the metadata option when triggering, and access/update it inside runs using metadata functions

Applied to files:

  • internal-packages/run-engine/src/batch-queue/completionTracker.ts
  • packages/trigger-sdk/src/v3/shared.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `.withStreams()` to subscribe to realtime streams from task metadata in addition to run changes

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Subscribe to run updates using `runs.subscribeToRun()` for realtime monitoring of task execution

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.batchTrigger()` to trigger multiple runs of a task from inside another task

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `tasks.batchTrigger()` to trigger multiple runs of a single task with different payloads

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to packages/trigger-sdk/**/*.{ts,tsx} : In the Trigger.dev SDK (packages/trigger-sdk), prefer isomorphic code like fetch and ReadableStream instead of Node.js-specific code

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.trigger()` to trigger multiple different tasks at once from backend code

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • apps/webapp/app/runEngine/services/streamBatchItems.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerByTaskAndWait()` to batch trigger tasks by passing task instances and wait for results

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerAndWait()` to batch trigger multiple different tasks and wait for results

Applied to files:

  • packages/trigger-sdk/src/v3/shared.ts
  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-14T16:03:06.917Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2681
File: apps/webapp/app/services/platform.v3.server.ts:258-302
Timestamp: 2025-11-14T16:03:06.917Z
Learning: In `apps/webapp/app/services/platform.v3.server.ts`, the `getDefaultEnvironmentConcurrencyLimit` function intentionally throws an error (rather than falling back to org.maximumConcurrencyLimit) when the billing client returns undefined plan limits. This fail-fast behavior prevents users from receiving more concurrency than their plan entitles them to. The org.maximumConcurrencyLimit fallback is only for self-hosted deployments where no billing client exists.

Applied to files:

  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Control concurrency using the `queue` property with `concurrencyLimit` option

Applied to files:

  • apps/webapp/app/v3/utils/queueLimits.server.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
  • internal-packages/run-engine/src/run-queue/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use the `task()` function from `trigger.dev/sdk/v3` to define tasks with id and run properties

Applied to files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
  • apps/webapp/app/runEngine/concerns/queues.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.triggerAndWait()` to trigger a task and wait for its result from a parent task

Applied to files:

  • internal-packages/run-engine/src/engine/systems/waitpointSystem.ts
  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:26:44.496Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/executing-commands.mdc:0-0
Timestamp: 2025-11-27T16:26:44.496Z
Learning: For running tests, navigate into the package directory and run `pnpm run test --run` to enable single-file test execution (e.g., `pnpm run test ./src/engine/tests/ttl.test.ts --run`)

Applied to files:

  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • internal-packages/run-engine/src/engine/tests/attemptFailures.test.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:26:37.432Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-27T16:26:37.432Z
Learning: Applies to **/*.{test,spec}.{ts,tsx} : Use vitest for all tests in the Trigger.dev repository

Applied to files:

  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
📚 Learning: 2026-01-15T11:50:06.067Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-01-15T11:50:06.067Z
Learning: Applies to **/*.{ts,tsx} : Every Trigger.dev task must be exported and have a unique `id` property with no timeouts in the run function

Applied to files:

  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-10-08T11:48:12.327Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2593
File: packages/core/src/v3/workers/warmStartClient.ts:168-170
Timestamp: 2025-10-08T11:48:12.327Z
Learning: The trigger.dev runners execute only in Node 21 and 22 environments, so modern Node.js APIs like AbortSignal.any (introduced in v20.3.0) are supported.

Applied to files:

  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
📚 Learning: 2025-07-12T18:06:04.133Z
Learnt from: matt-aitken
Repo: triggerdotdev/trigger.dev PR: 2264
File: apps/webapp/app/services/runsRepository.server.ts:172-174
Timestamp: 2025-07-12T18:06:04.133Z
Learning: In apps/webapp/app/services/runsRepository.server.ts, the in-memory status filtering after fetching runs from Prisma is intentionally used as a workaround for ClickHouse data delays. This approach is acceptable because the result set is limited to a maximum of 100 runs due to pagination, making the performance impact negligible.

Applied to files:

  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/tests/ttl.test.ts
  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `yourTask.batchTriggerAndWait()` to batch trigger tasks and wait for all results from a parent task

Applied to files:

  • internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts
  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
  • internal-packages/run-engine/src/engine/index.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `tasks.trigger()` with type-only imports to trigger tasks from backend code without importing the task implementation

Applied to files:

  • internal-packages/run-engine/src/engine/tests/trigger.test.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Applies to apps/webapp/app/**/*.{ts,tsx} : Access all environment variables through the `env` export of `env.server.ts` instead of directly accessing `process.env` in the Trigger.dev webapp

Applied to files:

  • apps/webapp/app/env.server.ts
  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-08-14T18:35:44.370Z
Learnt from: nicktrn
Repo: triggerdotdev/trigger.dev PR: 2390
File: apps/webapp/app/env.server.ts:764-765
Timestamp: 2025-08-14T18:35:44.370Z
Learning: The BoolEnv helper in apps/webapp/app/utils/boolEnv.ts uses z.preprocess with inconsistent default value types across the codebase - some usages pass boolean defaults (correct) while others pass string defaults (incorrect), leading to type confusion. The helper should enforce boolean-only defaults or have clearer documentation.

Applied to files:

  • apps/webapp/app/env.server.ts
📚 Learning: 2025-08-19T09:49:07.011Z
Learnt from: julienvanbeveren
Repo: triggerdotdev/trigger.dev PR: 2417
File: apps/webapp/app/routes/api.v1.projects.$projectRef.envvars.$slug.import.ts:56-61
Timestamp: 2025-08-19T09:49:07.011Z
Learning: In the Trigger.dev codebase, environment variables should default to `isSecret: false` when not explicitly marked as secrets in the syncEnvVars functionality. This is the intended behavior for both regular variables and parent variables.

Applied to files:

  • apps/webapp/app/env.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Scope idempotency keys globally or to current run using the scope parameter

Applied to files:

  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `idempotencyKeys.create()` to create idempotency keys for preventing duplicate task executions

Applied to files:

  • apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Export tasks with unique IDs within the project to enable proper task discovery and execution

Applied to files:

  • apps/webapp/app/runEngine/services/triggerTask.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `schedules.task()` for scheduled/cron tasks instead of regular `task()`

Applied to files:

  • apps/webapp/app/runEngine/services/triggerTask.server.ts
  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `schemaTask()` from `trigger.dev/sdk/v3` with Zod schema for payload validation

Applied to files:

  • apps/webapp/app/runEngine/services/triggerTask.server.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger.config.ts : Specify runtime environment (node or bun) in trigger.config.ts using the `runtime` property

Applied to files:

  • apps/webapp/app/presenters/v3/LimitsPresenter.server.ts
📚 Learning: 2025-12-18T14:09:10.154Z
Learnt from: ericallam
Repo: triggerdotdev/trigger.dev PR: 2794
File: internal-packages/run-engine/src/engine/systems/debounceSystem.ts:390-397
Timestamp: 2025-12-18T14:09:10.154Z
Learning: In the debounce system (internal-packages/run-engine/src/engine/systems/debounceSystem.ts), millisecond delays are not supported. The minimum debounce delay is 1 second (1s). The parseNaturalLanguageDuration function supports w/d/hr/h/m/s units only.

Applied to files:

  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Limit task duration using the `maxDuration` property (in seconds)

Applied to files:

  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
  • internal-packages/run-engine/src/engine/systems/ttlSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use logger methods (debug, log, info, warn, error) from `trigger.dev/sdk/v3` for structured logging in tasks

Applied to files:

  • internal-packages/run-engine/src/engine/systems/enqueueSystem.ts
📚 Learning: 2025-11-27T16:27:35.304Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/writing-tasks.mdc:0-0
Timestamp: 2025-11-27T16:27:35.304Z
Learning: Applies to **/trigger/**/*.{ts,tsx,js,jsx} : Use `batch.triggerByTask()` to batch trigger tasks by passing task instances for static task sets

Applied to files:

  • apps/webapp/app/runEngine/types.ts
  • apps/webapp/app/runEngine/services/batchTrigger.server.ts
📚 Learning: 2026-02-03T18:27:40.429Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2994
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.environment-variables/route.tsx:553-555
Timestamp: 2026-02-03T18:27:40.429Z
Learning: In apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.environment-variables/route.tsx, the menu buttons (e.g., Edit with PencilSquareIcon) in the TableCellMenu are intentionally icon-only with no text labels as a compact UI pattern. This is a deliberate design choice for this route; preserve the icon-only behavior for consistency in this file.

Applied to files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
📚 Learning: 2025-12-08T15:19:56.823Z
Learnt from: 0ski
Repo: triggerdotdev/trigger.dev PR: 2760
File: apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.runs.$runParam/route.tsx:278-281
Timestamp: 2025-12-08T15:19:56.823Z
Learning: In apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.runs.$runParam/route.tsx, the tableState search parameter uses intentional double-encoding: the parameter value contains a URL-encoded URLSearchParams string, so decodeURIComponent(value("tableState") ?? "") is required to fully decode it before parsing with new URLSearchParams(). This pattern allows bundling multiple filter/pagination params as a single search parameter.

Applied to files:

  • apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx
📚 Learning: 2025-11-27T16:26:58.661Z
Learnt from: CR
Repo: triggerdotdev/trigger.dev PR: 0
File: .cursor/rules/webapp.mdc:0-0
Timestamp: 2025-11-27T16:26:58.661Z
Learning: Use the Run Engine 2.0 from `internal/run-engine` for new run lifecycle code in the webapp instead of the legacy run engine

Applied to files:

  • internal-packages/run-engine/src/engine/index.ts
🧬 Code graph analysis (15)
apps/webapp/app/v3/utils/queueLimits.server.ts (1)
apps/webapp/app/env.server.ts (1)
  • env (1332-1332)
internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)
internal-packages/run-engine/src/engine/statuses.ts (1)
  • isFinalRunStatus (55-57)
apps/webapp/app/v3/runEngine.server.ts (2)
apps/webapp/app/env.server.ts (1)
  • env (1332-1332)
apps/supervisor/src/env.ts (1)
  • env (144-144)
internal-packages/run-engine/src/run-queue/keyProducer.ts (1)
internal-packages/run-engine/src/run-queue/index.ts (4)
  • shard (1170-1236)
  • shard (1258-1318)
  • shard (1324-1350)
  • shard (1420-1545)
apps/webapp/app/env.server.ts (2)
apps/webapp/app/utils/boolEnv.ts (1)
  • BoolEnv (12-14)
apps/supervisor/src/envUtil.ts (1)
  • BoolEnv (15-17)
apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts (1)
packages/core/src/v3/isomorphic/friendlyId.ts (1)
  • RunId (93-93)
apps/webapp/app/runEngine/services/triggerTask.server.ts (3)
internal-packages/run-engine/src/engine/errors.ts (1)
  • ServiceValidationError (69-78)
apps/webapp/app/v3/services/common.server.ts (1)
  • ServiceValidationError (1-6)
apps/webapp/app/utils/packets.ts (1)
  • handleMetadataPacket (11-37)
apps/webapp/app/presenters/v3/LimitsPresenter.server.ts (2)
apps/webapp/app/v3/runEngine.server.ts (1)
  • engine (11-11)
apps/webapp/app/v3/utils/queueLimits.server.ts (2)
  • getQueueSizeLimit (24-33)
  • getQueueSizeLimitSource (42-51)
internal-packages/run-engine/src/batch-queue/index.ts (3)
apps/webapp/app/runEngine/services/batchTrigger.server.ts (1)
  • batchId (164-350)
packages/core/src/v3/apiClient/index.ts (1)
  • batchId (419-537)
internal-packages/run-engine/src/engine/systems/batchSystem.ts (1)
  • batchId (39-137)
internal-packages/run-engine/src/engine/systems/enqueueSystem.ts (1)
packages/core/src/v3/isomorphic/duration.ts (1)
  • parseNaturalLanguageDuration (76-79)
apps/webapp/app/runEngine/types.ts (2)
internal-packages/run-engine/src/run-queue/fairQueueSelectionStrategy.ts (4)
  • env (501-521)
  • env (560-578)
  • env (580-594)
  • env (596-616)
internal-packages/run-engine/src/shared/index.ts (1)
  • AuthenticatedEnvironment (4-6)
apps/webapp/app/runEngine/concerns/queues.server.ts (2)
apps/webapp/app/v3/models/workerDeployment.server.ts (1)
  • findCurrentWorkerFromEnvironment (198-224)
apps/webapp/app/runEngine/types.ts (1)
  • QueueValidationResult (38-46)
internal-packages/run-engine/src/run-queue/index.ts (1)
internal-packages/redis/src/index.ts (2)
  • Callback (4-4)
  • Result (4-4)
apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.limits/route.tsx (1)
apps/webapp/app/utils/numberFormatter.ts (1)
  • formatNumber (15-22)
internal-packages/run-engine/src/engine/index.ts (2)
internal-packages/run-engine/src/run-queue/index.ts (2)
  • options (1152-1168)
  • options (1240-1256)
internal-packages/run-engine/src/shared/index.ts (1)
  • MinimalAuthenticatedEnvironment (8-19)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (7, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (2, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (5, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (4, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (1, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (6, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (6, 8)
  • GitHub Check: units / webapp / 🧪 Unit Tests: Webapp (3, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (8, 8)
  • GitHub Check: units / internal / 🧪 Unit Tests: Internal (5, 8)
  • GitHub Check: sdk-compat / Cloudflare Workers
  • GitHub Check: typecheck / typecheck
  • GitHub Check: sdk-compat / Deno Runtime
  • GitHub Check: units / packages / 🧪 Unit Tests: Packages (1, 1)
  • GitHub Check: sdk-compat / Node.js 20.20 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
  • GitHub Check: sdk-compat / Bun Runtime
  • GitHub Check: sdk-compat / Node.js 22.12 (ubuntu-latest)
  • GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +112 to +130
/**
* Update the runCount in batch metadata.
* Used when items are skipped due to queue limits.
*/
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
const meta = await this.getMeta(batchId);
if (!meta) {
this.logger.error("Cannot update runCount: batch metadata not found", { batchId });
return;
}

const updatedMeta: BatchMeta = {
...meta,
runCount: newRunCount,
};

await this.storeMeta(batchId, updatedMeta);
this.logger.debug("Updated batch runCount", { batchId, oldRunCount: meta.runCount, newRunCount });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against invalid runCount values.
updateRunCount should reject negative/NaN/non‑integer values to avoid corrupting batch metadata.

🛡️ Suggested guard
  async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
+    if (!Number.isInteger(newRunCount) || newRunCount < 0) {
+      this.logger.error("Invalid runCount update", { batchId, newRunCount });
+      return;
+    }
     const meta = await this.getMeta(batchId);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Update the runCount in batch metadata.
* Used when items are skipped due to queue limits.
*/
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
const meta = await this.getMeta(batchId);
if (!meta) {
this.logger.error("Cannot update runCount: batch metadata not found", { batchId });
return;
}
const updatedMeta: BatchMeta = {
...meta,
runCount: newRunCount,
};
await this.storeMeta(batchId, updatedMeta);
this.logger.debug("Updated batch runCount", { batchId, oldRunCount: meta.runCount, newRunCount });
}
/**
* Update the runCount in batch metadata.
* Used when items are skipped due to queue limits.
*/
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
if (!Number.isInteger(newRunCount) || newRunCount < 0) {
this.logger.error("Invalid runCount update", { batchId, newRunCount });
return;
}
const meta = await this.getMeta(batchId);
if (!meta) {
this.logger.error("Cannot update runCount: batch metadata not found", { batchId });
return;
}
const updatedMeta: BatchMeta = {
...meta,
runCount: newRunCount,
};
await this.storeMeta(batchId, updatedMeta);
this.logger.debug("Updated batch runCount", { batchId, oldRunCount: meta.runCount, newRunCount });
}
🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/batch-queue/completionTracker.ts` around
lines 112 - 130, The updateRunCount method currently trusts newRunCount; add
input validation in updateRunCount (in completionTracker.ts) to reject negative,
NaN, non-finite or non-integer values before calling getMeta/storeMeta: use
Number.isInteger(newRunCount) && newRunCount >= 0 (and optionally
Number.isFinite) and if the check fails log an error via this.logger.error with
batchId and the invalid value and then throw an Error or return a rejected
Promise so the invalid value is never persisted; keep existing behavior for meta
retrieval (getMeta) and storage (storeMeta) otherwise.

Comment on lines +380 to +404
/**
* Update the runCount for a batch.
* Used when items are skipped due to queue limits.
*/
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
return this.completionTracker.updateRunCount(batchId, newRunCount);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Batch can get stuck if runCount drops after processing completes.
If newRunCount is lowered to a value already reached by processedCount, no further items may run to trigger finalization, leaving the batch incomplete.

🛠️ Suggested fix
  async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
-    return this.completionTracker.updateRunCount(batchId, newRunCount);
+    await this.completionTracker.updateRunCount(batchId, newRunCount);
+    const meta = await this.completionTracker.getMeta(batchId);
+    if (!meta) return;
+    const processedCount = await this.completionTracker.getProcessedCount(batchId);
+    if (processedCount >= meta.runCount) {
+      await this.#finalizeBatch(batchId, meta);
+    }
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Update the runCount for a batch.
* Used when items are skipped due to queue limits.
*/
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
return this.completionTracker.updateRunCount(batchId, newRunCount);
}
/**
* Update the runCount for a batch.
* Used when items are skipped due to queue limits.
*/
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
await this.completionTracker.updateRunCount(batchId, newRunCount);
const meta = await this.completionTracker.getMeta(batchId);
if (!meta) return;
const processedCount = await this.completionTracker.getProcessedCount(batchId);
if (processedCount >= meta.runCount) {
await this.#finalizeBatch(batchId, meta);
}
}
🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/batch-queue/index.ts` around lines 380 -
386, The updateRunCount method can lower runCount below the already-processed
count and leave a batch stuck; before calling
this.completionTracker.updateRunCount(batchId, newRunCount) clamp newRunCount to
at least the current processed count (e.g. newRunCount = Math.max(newRunCount,
this.completionTracker.getProcessedCount(batchId))) or, alternatively, after
updating call into the completion tracker to run the completion/finalization
check (e.g. this.completionTracker.checkCompletion/finalizeBatch for batchId) so
that if processedCount >= runCount the batch is finalized immediately; update
updateRunCount to perform one of these fixes using the existing
completionTracker APIs.

Comment on lines +1 to +8
import { containerTest, assertNonNullable } from "@internal/testcontainers";
import { trace } from "@internal/tracing";
import { expect } from "vitest";
import { RunEngine } from "../index.js";
import { setTimeout } from "node:timers/promises";
import { setupAuthenticatedEnvironment, setupBackgroundWorker } from "./setup.js";

vi.setConfig({ testTimeout: 60_000 });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing vi import from vitest.

The vi global is used on line 8 but not imported. While vitest may auto-inject globals, explicit imports are more reliable.

🔧 Proposed fix
 import { containerTest, assertNonNullable } from "@internal/testcontainers";
 import { trace } from "@internal/tracing";
-import { expect } from "vitest";
+import { expect, vi } from "vitest";
 import { RunEngine } from "../index.js";
🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/engine/tests/lazyWaitpoint.test.ts` around
lines 1 - 8, The test uses the vitest runner's mock/time API via vi.setConfig
but never imports vi; update the imports to explicitly import vi from 'vitest'
(for example change the existing import of expect to import { expect, vi } from
"vitest") so the vi symbol used by vi.setConfig is defined and the test runner
APIs are available.

Comment on lines +1238 to +1304
// TTL System Methods

#startTtlConsumers() {
if (!this.options.ttlSystem) {
this.logger.debug("TTL system disabled (no ttlSystem config)");
return;
}

const shardCount = this.options.ttlSystem.shardCount ?? this.shardCount;

for (let i = 0; i < shardCount; i++) {
this.logger.debug(`Starting TTL consumer ${i}`);
this.#startTtlConsumer(i).catch((err) => {
this.logger.error(`Failed to start TTL consumer ${i}`, { error: err });
});
}

this.logger.debug(`Started ${shardCount} TTL consumers`);
}

async #startTtlConsumer(shard: number) {
if (!this.options.ttlSystem) {
return;
}

const pollIntervalMs = this.options.ttlSystem.pollIntervalMs ?? 1000;
const batchSize = this.options.ttlSystem.batchSize ?? 100;
let processedCount = 0;

try {
for await (const _ of setInterval(pollIntervalMs, null, {
signal: this.abortController.signal,
})) {
const now = Date.now();

const [error, expiredRuns] = await tryCatch(
this.#expireTtlRuns(shard, now, batchSize)
);

if (error) {
this.logger.error(`Failed to expire TTL runs for shard ${shard}`, {
error,
service: this.name,
shard,
});
continue;
}

if (expiredRuns.length > 0) {
this.logger.debug(`Expired ${expiredRuns.length} TTL runs in shard ${shard}`, {
service: this.name,
shard,
count: expiredRuns.length,
});

// Call the callback with expired runs
try {
await this.options.ttlSystem!.callback(expiredRuns);
processedCount += expiredRuns.length;
} catch (callbackError) {
this.logger.error(`TTL callback failed for shard ${shard}`, {
error: callbackError,
service: this.name,
shard,
runCount: expiredRuns.length,
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Callback failures can silently drop expired runs.

#expireTtlRuns() removes runs before invoking the callback, and failures are only logged. If the callback is responsible for persisting expiration state, those runs may be lost without retry. Consider a durable retry path (e.g., a Redis list/sorted set of failed expirations, or a dedicated worker with backoff) or make the callback itself handle retry + idempotency explicitly.

🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/run-queue/index.ts` around lines 1238 -
1304, The code currently calls `#expireTtlRuns`(shard, now, batchSize) which
deletes/marks runs as expired before invoking
options.ttlSystem!.callback(expiredRuns), so callback failures only get logged
and those runs can be lost; update `#startTtlConsumer` (or `#expireTtlRuns`) so that
expiration is durable: either (A) defer permanent deletion/marking of runs until
after the callback completes successfully (move the delete/mark logic to after
options.ttlSystem!.callback returns) or (B) if deletion must happen first, add a
durable retry/enqueue path when the callback throws — e.g., push failed expired
run IDs to a retry store (Redis list/sorted set) with metadata and backoff, and
spawn a retry worker to reprocess them; ensure the callback and retry path are
idempotent and update logging to include run IDs and shard when queuing retries.

@ericallam ericallam changed the title feat(dashboard): Display environment queue length limits on queues and limits page feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation Feb 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal-packages/run-engine/src/batch-queue/index.ts (1)

958-976: ⚠️ Potential issue | 🟠 Major

Completion check at Line 969 uses meta.runCount which may be stale.

meta is fetched once per message at Line 762-764. If updateRunCount was called between the meta fetch and the completion check, the comparison processedCount === meta.runCount could use an outdated value, potentially missing or double-triggering finalization. For correctness, consider re-fetching runCount atomically alongside the processedCount increment, or having updateRunCount trigger a finalization check.

🤖 Fix all issues with AI agents
In
`@apps/webapp/app/routes/_app.orgs`.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx:
- Around line 367-370: The call to getQueueUsageColorClass(environment.queued,
environment.queueSizeLimit) is unresolved — add a definition or import for
getQueueUsageColorClass that mirrors the logic of the existing
getUsageColorClass used in the limits route: implement a function
getQueueUsageColorClass(queued: number | undefined, queueSizeLimit: number |
undefined) that returns a CSS class string (or null) based on usage percentage
thresholds and handles undefined values; place it in the same file (or import
it) so valueClassName can call it, and ensure it returns null when no class
should be applied so the existing env.paused fallback ("text-warning
tabular-nums" or "tabular-nums") continues to work.
- Around line 515-518: The per-queue limit check (isAtQueueLimit) is using
queue.queued against environment.queueSizeLimit which is an environment-wide
cap; change the comparison to use environment.queued (total queued across all
queues) versus environment.queueSizeLimit so the badge reflects environment
capacity (or implement a true per-queue limit if intended). Update the
isAtQueueLimit logic to reference environment.queued and
environment.queueSizeLimit (consistent with the BigNumber usage) and remove or
revise any reliance on queue.queued for this environment-wide badge.

In `@apps/webapp/app/v3/runEngineHandlers.server.ts`:
- Around line 722-754: The duplicated pre-failed run creation via
triggerFailedTaskService.call when isFinalAttempt can itself throw and cause a
second attempt in the outer catch — wrap each inner invocation of
triggerFailedTaskService.call (the calls inside the isFinalAttempt branch and
the analogous call in the outer catch) in a small try/catch so any errors from
that call are caught and handled locally (log the error to the existing
span/process logger and avoid rethrowing), ensure span.end() still runs, and
return the same successful pre-failed-run response when a failedRunId is
returned; this prevents a partial success from causing a duplicate pre-failed
run while preserving the current control flow for both the isFinalAttempt path
and the outer catch path.

In `@internal-packages/run-engine/src/batch-queue/completionTracker.ts`:
- Around line 116-130: The updateRunCount function performs a non-atomic
read-modify-write (getMeta → modify → storeMeta) which can lose updates under
concurrent callers; change it to perform the increment atomically in the backing
store (e.g., use a Redis Lua script or the same atomic pattern used by
recordSuccessIdempotent) so runCount is updated with a single atomic operation,
or enforce/document single-writer semantics; specifically replace the current
getMeta/storeMeta sequence in updateRunCount with an atomic increment/update
that targets the BatchMeta.runCount field (or wrap the logic in a Lua script
that reads, increments/sets and returns the new value) and update logging to
reflect the atomic result.

In `@internal-packages/run-engine/src/engine/index.ts`:
- Around line 2275-2300: The method `#resolveMaxTtl` currently returns the
original unparseable ttl string when parseNaturalLanguageDurationInMs(ttl)
yields undefined, which allows invalid TTLs to bypass the max clamp; change the
logic so that if ttl fails to parse but the configured options.defaultMaxTtl
parses successfully, you return the parsed max (i.e., maxTtl) to enforce the
cap; keep the existing fallback that if defaultMaxTtl itself is unparseable
(parseNaturalLanguageDurationInMs(maxTtl) === undefined) you return ttl as
before.
- Around line 2245-2273: The `#ttlExpiredCallback` currently calls
this.ttlSystem.expireRunsBatch after a Lua script has already removed runs from
Redis, so if expireRunsBatch throws the DB update fails the runs become
orphaned; to fix, add compensation in `#ttlExpiredCallback` to detect errors from
ttlSystem.expireRunsBatch and, on failure, re-add the affected runIds back into
the TTL structures (sorted set/queue/concurrency tracking) via a new ttlSystem
method (e.g., ttlSystem.requeueRuns or ttlSystem.restoreRuns) or invoke existing
requeue logic, and/or schedule the runIds for a background cleanup retry; ensure
the added code references `#ttlExpiredCallback`, ttlSystem.expireRunsBatch, and
the Lua removal behavior so the requeue/cleanup is called only when
expireRunsBatch fails.

In `@internal-packages/run-engine/src/engine/systems/ttlSystem.ts`:
- Around line 248-291: The batch update (taskRun.updateMany) followed by per-run
updates (taskRun.update and taskRunExecutionSnapshot.create) can leave runs
marked EXPIRED without snapshots/errors if any per-run op fails; change
ttlExpiredCallback to perform each run's status change, error write, and
snapshot creation inside a single per-run prisma.$transaction() (instead of the
global updateMany + Promise.all) so that setting status to EXPIRED,
completedAt/expiredAt, writing the error, and creating taskRunExecutionSnapshot
happen atomically for each run (operate on runsToExpire entries and replace uses
of taskRun.updateMany, taskRun.update, and taskRunExecutionSnapshot.create
accordingly), and ensure transaction failures are surfaced/logged so those runs
remain PENDING for the next TTL poll.

In `@internal-packages/run-engine/src/run-queue/index.ts`:
- Around line 2710-2748: The dequeue Lua branch that handles TTL expiration
(checks using ttlExpiresAt <= currentTime) currently removes the message from
queueKey, envQueueKey, messageKey and ttlQueueKey but never reports the expired
run to the application; to fix this, stop evicting TTL-expired messages in the
dequeue path: modify the ttlExpiresAt <= currentTime branch so it does not
remove the message or its TTL membership (i.e., do not call ZREM on
queueKey/envQueueKey or DEL messageKey or ZREM ttlQueueKey) and simply skip
adding the run to results (leaving the message for the TTL consumer to handle),
ensuring that results, dequeuedCount and SADD calls are untouched for the
non-expired branch; locate and change the code around the symbols ttlExpiresAt,
currentTime, queueKey, envQueueKey, ttlQueueKey, messageKey, results and
dequeuedCount.
🧹 Nitpick comments (10)
apps/webapp/app/env.server.ts (1)

536-540: QUEUE_SIZE_CACHE_ENABLED uses numeric type instead of BoolEnv.

Other boolean-like env vars in this file (e.g., RUN_ENGINE_TTL_SYSTEM_DISABLED, BATCH_TRIGGER_CACHED_RUNS_CHECK_ENABLED) use BoolEnv. Using z.coerce.number().int() for an enable/disable flag is inconsistent and forces consumers to compare against 1 instead of using a proper boolean.

Suggested change
-    QUEUE_SIZE_CACHE_ENABLED: z.coerce.number().int().optional().default(1),
+    QUEUE_SIZE_CACHE_ENABLED: BoolEnv.default(true),
apps/webapp/app/runEngine/concerns/idempotencyKeys.server.ts (1)

120-122: Non-null assertion on associatedWaitpoint! is safe but could be cleaner.

The ! is safe because of the early return on line 98-100, but TypeScript can't narrow let variables inside closures. Consider assigning to a const after the guard to eliminate the assertion.

Suggested approach
         if (!associatedWaitpoint) {
           return { isCached: true, run: existingRun };
         }
+        const waitpoint = associatedWaitpoint;

         await this.traceEventConcern.traceIdempotentRun(
           ...
             await this.engine.blockRunWithWaitpoint({
               runId: RunId.fromFriendlyId(parentRunId),
-              waitpoints: associatedWaitpoint!.id,
+              waitpoints: waitpoint.id,
internal-packages/run-engine/src/engine/tests/getSnapshotsSince.test.ts (1)

214-218: Relaxed assertion removes verification of waitpoint completion in snapshots.

The original assertion that latest.completedWaitpoints.length > 0 was removed and replaced with comments. While timing sensitivity is real, this means the test no longer verifies the core "waitpoints only on latest" behavior — it only checks that earlier snapshots have empty waitpoints.

Consider adding a retry/poll loop (e.g., poll getSnapshotsSince until the latest snapshot has completedWaitpoints.length > 0 or a timeout), so the test still validates the positive case without being flaky.

internal-packages/run-engine/src/engine/systems/waitpointSystem.ts (1)

17-17: Derive/validate env + project from the run to prevent mismatches.
Nice use of isFinalRunStatus to avoid creating waitpoints for terminal runs. Since you already load the run, consider using/validating its projectId and runtimeEnvironmentId instead of trusting caller inputs, to avoid accidental cross‑environment linkage.

🔧 Suggested adjustment
-    const run = await this.$.prisma.taskRun.findFirst({
-      where: { id: runId },
-      include: { associatedWaitpoint: true },
-    });
+    const run = await this.$.prisma.taskRun.findFirst({
+      where: { id: runId },
+      select: {
+        status: true,
+        projectId: true,
+        runtimeEnvironmentId: true,
+        associatedWaitpoint: true,
+      },
+    });

...

-      const runAfterLock = await this.$.prisma.taskRun.findFirst({
-        where: { id: runId },
-        include: { associatedWaitpoint: true },
-      });
+      const runAfterLock = await this.$.prisma.taskRun.findFirst({
+        where: { id: runId },
+        select: {
+          status: true,
+          projectId: true,
+          runtimeEnvironmentId: true,
+          associatedWaitpoint: true,
+        },
+      });

+      if (
+        runAfterLock.projectId !== projectId ||
+        runAfterLock.runtimeEnvironmentId !== environmentId
+      ) {
+        throw new Error(
+          "getOrCreateRunWaitpoint: run does not match provided project/environment"
+        );
+      }

-      const waitpointData = this.buildRunAssociatedWaitpoint({ projectId, environmentId });
+      const waitpointData = this.buildRunAssociatedWaitpoint({
+        projectId: runAfterLock.projectId,
+        environmentId: runAfterLock.runtimeEnvironmentId,
+      });

Also applies to: 819-875

apps/webapp/app/runEngine/types.ts (1)

59-77: Prefer a type alias for QueueManager (repo guideline).
Since this interface is being updated, consider converting it to a type alias to match the TS style guide.

♻️ Suggested refactor
-export interface QueueManager {
+export type QueueManager = {
   resolveQueueProperties(
     request: TriggerTaskRequest,
     lockedBackgroundWorker?: LockedBackgroundWorker
   ): Promise<QueueProperties>;
   getQueueName(request: TriggerTaskRequest): Promise<string>;
   validateQueueLimits(
     env: AuthenticatedEnvironment,
     queueName: string,
     itemsToAdd?: number
   ): Promise<QueueValidationResult>;
   resolveQueueNamesForBatchItems(
     env: AuthenticatedEnvironment,
     items: Array<{ task: string; options?: { queue?: { name?: string } } }>
   ): Promise<Map<string, number>>;
   validateMultipleQueueLimits(
     env: AuthenticatedEnvironment,
     queueCounts: Map<string, number>
   ): Promise<{ ok: true } | { ok: false; queueName: string; maximumSize: number; queueSize: number }>;
   getWorkerQueue(
     env: AuthenticatedEnvironment,
     regionOverride?: string
   ): Promise<string | undefined>;
-}
+};
As per coding guidelines "Use types over interfaces for TypeScript".
apps/webapp/app/v3/runEngineHandlers.server.ts (1)

651-670: TriggerFailedTaskService is instantiated on every batch item invocation.

The service is only used on final-attempt failures, but it's constructed eagerly at the top of every callback invocation. Consider moving the instantiation inside the isFinalAttempt guards to avoid unnecessary allocations on the happy path.

apps/webapp/app/runEngine/concerns/queues.server.ts (1)

421-436: getCachedQueueSize fails open on cache errors — verify this is intentional.

Line 435 returns result.val ?? 0, meaning if the cache fetch fails (returning undefined), the queue size is treated as 0. This allows requests through even when queue size can't be determined, which is a permissive-failure strategy. This is likely the right trade-off (availability over strict enforcement), but worth documenting.

internal-packages/run-engine/src/run-queue/index.ts (3)

108-110: Use type instead of interface for TtlSystemCallback.

Per coding guidelines, prefer types over interfaces in TypeScript files.

Proposed fix
-export interface TtlSystemCallback {
-  (runs: Array<{ queueKey: string; runId: string; orgId: string }>): Promise<void>;
-}
+export type TtlSystemCallback = (
+  runs: Array<{ queueKey: string; runId: string; orgId: string }>
+) => Promise<void>;

As per coding guidelines: **/*.{ts,tsx}: Use types over interfaces for TypeScript.


669-677: TTL member uses | as delimiter — consider a safer encoding.

The TTL member format ${queueKey}|${runId}|${orgId} relies on | never appearing in any of these values. While current ID formats make this safe today, a future change to key formats could silently break the parsing in the expireTtlRuns Lua script (Lines 2582–2588). A null byte (\0) or a fixed-width encoding would be more robust.


1713-1721: Minor: TTL shard is computed even when TTL system is disabled.

The TTL shard derivation (Lines 1714–1718) runs unconditionally, though the result is only meaningful when this.options.ttlSystem is set. This is functionally harmless since the empty-string ttlQueueKey disables TTL logic in the Lua script, but wrapping it in the conditional would be slightly cleaner.

Comment on lines +367 to +370
valueClassName={
getQueueUsageColorClass(environment.queued, environment.queueSizeLimit) ??
(env.paused ? "text-warning tabular-nums" : "tabular-nums")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for where getQueueUsageColorClass is defined or referenced
rg "getQueueUsageColorClass" --type=ts --type=tsx -n

Repository: triggerdotdev/trigger.dev

Length of output: 96


🏁 Script executed:

#!/bin/bash
# Search for getQueueUsageColorClass without file type filter
rg "getQueueUsageColorClass" -n

Repository: triggerdotdev/trigger.dev

Length of output: 268


🏁 Script executed:

#!/bin/bash
# Search for similar function names like getUsageColorClass
rg "getUsageColorClass" -n

Repository: triggerdotdev/trigger.dev

Length of output: 574


🏁 Script executed:

#!/bin/bash
# Check the imports in the queues route file
head -50 "apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx"

Repository: triggerdotdev/trigger.dev

Length of output: 193


🏁 Script executed:

#!/bin/bash
# Find the queues route file
find . -name "route.tsx" -path "*queues*" | head -5

Repository: triggerdotdev/trigger.dev

Length of output: 175


🏁 Script executed:

#!/bin/bash
# Read the first 100 lines of the queues route to check imports
head -100 "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.queues/route.tsx" | cat -n

Repository: triggerdotdev/trigger.dev

Length of output: 5079


🏁 Script executed:

#!/bin/bash
# Search around line 368 to see the context and also check the end of the file for function definitions
sed -n '360,380p' "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.queues/route.tsx" | cat -n

Repository: triggerdotdev/trigger.dev

Length of output: 1025


🏁 Script executed:

#!/bin/bash
# Check the end of the file to see if the function is defined locally
tail -100 "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.queues/route.tsx" | cat -n

Repository: triggerdotdev/trigger.dev

Length of output: 3402


🏁 Script executed:

#!/bin/bash
# Get the total line count of the file
wc -l "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.queues/route.tsx"

Repository: triggerdotdev/trigger.dev

Length of output: 180


🏁 Script executed:

#!/bin/bash
# Search for getQueueUsageColorClass definition in the entire file
grep -n "function getQueueUsageColorClass\|const getQueueUsageColorClass" "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.queues/route.tsx"

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

#!/bin/bash
# Search for getQueueUsageColorClass in all TypeScript/JavaScript files
rg "getQueueUsageColorClass" -A 2 -B 2

Repository: triggerdotdev/trigger.dev

Length of output: 823


🏁 Script executed:

#!/bin/bash
# Check the limits route to see the getUsageColorClass function for reference
sed -n '799,850p' "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.limits/route.tsx" | cat -n

Repository: triggerdotdev/trigger.dev

Length of output: 1699


🏁 Script executed:

#!/bin/bash
# Look for any color class functions in the queues route file
grep -n "function.*Color\|const.*Color" "./apps/webapp/app/routes/_app.orgs.\$organizationSlug.projects.\$projectParam.env.\$envParam.queues/route.tsx"

Repository: triggerdotdev/trigger.dev

Length of output: 51


getQueueUsageColorClass function is missing and will cause a runtime error.

Line 368 calls getQueueUsageColorClass(environment.queued, environment.queueSizeLimit) but this function is neither defined in this file nor imported from anywhere. The codebase has no definition of this function.

Define this function locally in this file or import it from the appropriate module. Reference the similar getUsageColorClass function in the limits route for implementation guidance.

🤖 Prompt for AI Agents
In
`@apps/webapp/app/routes/_app.orgs`.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
around lines 367 - 370, The call to getQueueUsageColorClass(environment.queued,
environment.queueSizeLimit) is unresolved — add a definition or import for
getQueueUsageColorClass that mirrors the logic of the existing
getUsageColorClass used in the limits route: implement a function
getQueueUsageColorClass(queued: number | undefined, queueSizeLimit: number |
undefined) that returns a CSS class string (or null) based on usage percentage
thresholds and handles undefined values; place it in the same file (or import
it) so valueClassName can call it, and ensure it returns null when no class
should be applied so the existing env.paused fallback ("text-warning
tabular-nums" or "tabular-nums") continues to work.

Comment on lines +515 to +518
const isAtConcurrencyLimit = queue.running >= limit;
const isAtQueueLimit =
environment.queueSizeLimit !== null &&
queue.queued >= environment.queueSizeLimit;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n "queueSizeLimit" --type=ts --type=tsx -C 3

Repository: triggerdotdev/trigger.dev

Length of output: 96


🏁 Script executed:

rg -n "queueSizeLimit" -C 3

Repository: triggerdotdev/trigger.dev

Length of output: 6310


🏁 Script executed:

rg -n "function getQueueSizeLimit|const getQueueSizeLimit|export.*getQueueSizeLimit" -C 5

Repository: triggerdotdev/trigger.dev

Length of output: 2121


🏁 Script executed:

cat -n apps/webapp/app/v3/utils/queueLimits.server.ts

Repository: triggerdotdev/trigger.dev

Length of output: 2183


🏁 Script executed:

rg -n "guardQueueSizeLimitsForEnv" -C 10

Repository: triggerdotdev/trigger.dev

Length of output: 10225


🏁 Script executed:

cat -n apps/webapp/app/v3/queueSizeLimits.server.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1577


🏁 Script executed:

sed -n '340,375p' apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx

Repository: triggerdotdev/trigger.dev

Length of output: 177


🏁 Script executed:

find apps/webapp/app/routes -name "route.tsx" | grep queues

Repository: triggerdotdev/trigger.dev

Length of output: 173


🏁 Script executed:

sed -n '340,375p' 'apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx'

Repository: triggerdotdev/trigger.dev

Length of output: 1559


🏁 Script executed:

sed -n '510,525p' 'apps/webapp/app/routes/_app.orgs.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx'

Repository: triggerdotdev/trigger.dev

Length of output: 901


The per-queue queue limit badge uses incorrect semantics.

isAtQueueLimit at lines 517–518 compares queue.queued (a single queue's count) against environment.queueSizeLimit, which is an environment-wide limit. The queueSizeLimit is calculated once per environment and represents the total allowed items across all queues, not per-queue capacity.

This comparison will almost never trigger correctly. For example, with a 1000-item environment limit and 10 queues, a single queue would need 1000 items to trigger the badge, even though the environment might already be at capacity with items spread across queues.

Use environment.queued (total across all queues) instead of queue.queued for consistency with the BigNumber component at line 368, or implement a per-queue limit if that's the intended behavior.

🤖 Prompt for AI Agents
In
`@apps/webapp/app/routes/_app.orgs`.$organizationSlug.projects.$projectParam.env.$envParam.queues/route.tsx
around lines 515 - 518, The per-queue limit check (isAtQueueLimit) is using
queue.queued against environment.queueSizeLimit which is an environment-wide
cap; change the comparison to use environment.queued (total queued across all
queues) versus environment.queueSizeLimit so the badge reflects environment
capacity (or implement a true per-queue limit if intended). Update the
isAtQueueLimit logic to reference environment.queued and
environment.queueSizeLimit (consistent with the BigNumber usage) and remove or
revise any reliance on queue.queued for this environment-wide badge.

Comment on lines 722 to 754
span.setAttribute("batch.result.error", "TriggerTaskService returned undefined");
span.end();

// Only create a pre-failed run on the final attempt; otherwise let the retry mechanism handle it
if (isFinalAttempt) {
const failedRunId = await triggerFailedTaskService.call({
taskId: item.task,
environment,
payload: item.payload,
payloadType: item.payloadType as string,
errorMessage: "TriggerTaskService returned undefined",
parentRunId: meta.parentRunId,
resumeParentOnCompletion: meta.resumeParentOnCompletion,
batch: { id: batchId, index: itemIndex },
options: item.options as Record<string, unknown>,
traceContext: meta.traceContext as Record<string, unknown> | undefined,
spanParentAsLink: meta.spanParentAsLink,
errorCode: TaskRunErrorCodes.BATCH_ITEM_COULD_NOT_TRIGGER,
});

span.end();

if (failedRunId) {
return { success: true as const, runId: failedRunId };
}
} else {
span.end();
}

return {
success: false as const,
error: "TriggerTaskService returned undefined",
errorCode: "TRIGGER_FAILED",
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Pre-failed run creation on final attempt — duplicated error-handling blocks.

The two paths (undefined result at Line 725 and caught exception at Line 762) share nearly identical triggerFailedTaskService.call(...) invocations with the same parameters. This is acceptable given the different control-flow contexts but worth noting for future extraction if the parameter list grows.

One thing to verify: if triggerFailedTaskService.call itself throws, the exception in the first block (Line 726) would bubble into the outer catch (Line 756), which would then attempt to call triggerFailedTaskService.call again on the same item. This could create a duplicate pre-failed run if the first call partially succeeded (e.g., DB row created but an error thrown after).

Proposed fix: wrap the inner triggerFailedTaskService.call in try/catch
             if (isFinalAttempt) {
+              try {
               const failedRunId = await triggerFailedTaskService.call({
                 // ...params...
               });

               span.end();

               if (failedRunId) {
                 return { success: true as const, runId: failedRunId };
               }
+              } catch (failedTaskError) {
+                logger.error("Failed to create pre-failed run", {
+                  batchId,
+                  itemIndex,
+                  error: failedTaskError instanceof Error ? failedTaskError.message : String(failedTaskError),
+                });
+                span.end();
+              }
             } else {
               span.end();
             }

Also applies to: 756-791

🤖 Prompt for AI Agents
In `@apps/webapp/app/v3/runEngineHandlers.server.ts` around lines 722 - 754, The
duplicated pre-failed run creation via triggerFailedTaskService.call when
isFinalAttempt can itself throw and cause a second attempt in the outer catch —
wrap each inner invocation of triggerFailedTaskService.call (the calls inside
the isFinalAttempt branch and the analogous call in the outer catch) in a small
try/catch so any errors from that call are caught and handled locally (log the
error to the existing span/process logger and avoid rethrowing), ensure
span.end() still runs, and return the same successful pre-failed-run response
when a failedRunId is returned; this prevents a partial success from causing a
duplicate pre-failed run while preserving the current control flow for both the
isFinalAttempt path and the outer catch path.

Comment on lines +116 to +130
async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
const meta = await this.getMeta(batchId);
if (!meta) {
this.logger.error("Cannot update runCount: batch metadata not found", { batchId });
return;
}

const updatedMeta: BatchMeta = {
...meta,
runCount: newRunCount,
};

await this.storeMeta(batchId, updatedMeta);
this.logger.debug("Updated batch runCount", { batchId, oldRunCount: meta.runCount, newRunCount });
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non-atomic read-modify-write on batch metadata.

updateRunCount performs getMeta → modify → storeMeta without any locking or atomic operation. If called concurrently for the same batch (e.g., two queue-limit rejections racing), one update could be lost, leaving runCount incorrect and potentially causing the batch to never complete.

Consider using a Lua script (similar to recordSuccessIdempotent) to atomically update the runCount field, or document that this method must only be called from a single-writer context.

🔒 Sketch of atomic Lua approach
+ // Register in `#registerCommands`():
+ this.redis.defineCommand("updateRunCountAtomic", {
+   numberOfKeys: 1,
+   lua: `
+     local metaKey = KEYS[1]
+     local newRunCount = tonumber(ARGV[1])
+     local metaJson = redis.call('GET', metaKey)
+     if not metaJson then return nil end
+     local meta = cjson.decode(metaJson)
+     meta.runCount = newRunCount
+     redis.call('SET', metaKey, cjson.encode(meta))
+     return 'OK'
+   `,
+ });

  async updateRunCount(batchId: string, newRunCount: number): Promise<void> {
-   const meta = await this.getMeta(batchId);
-   if (!meta) {
-     this.logger.error("Cannot update runCount: batch metadata not found", { batchId });
-     return;
-   }
-   const updatedMeta: BatchMeta = { ...meta, runCount: newRunCount };
-   await this.storeMeta(batchId, updatedMeta);
+   const result = await this.redis.updateRunCountAtomic(
+     this.metaKey(batchId),
+     newRunCount.toString()
+   );
+   if (!result) {
+     this.logger.error("Cannot update runCount: batch metadata not found", { batchId });
+     return;
+   }
    this.logger.debug("Updated batch runCount", { batchId, newRunCount });
  }
🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/batch-queue/completionTracker.ts` around
lines 116 - 130, The updateRunCount function performs a non-atomic
read-modify-write (getMeta → modify → storeMeta) which can lose updates under
concurrent callers; change it to perform the increment atomically in the backing
store (e.g., use a Redis Lua script or the same atomic pattern used by
recordSuccessIdempotent) so runCount is updated with a single atomic operation,
or enforce/document single-writer semantics; specifically replace the current
getMeta/storeMeta sequence in updateRunCount with an atomic increment/update
that targets the BatchMeta.runCount field (or wrap the logic in a Lua script
that reads, increments/sets and returns the new value) and update logging to
reflect the atomic result.

Comment on lines +2245 to +2273
async #ttlExpiredCallback(
runs: Array<{ queueKey: string; runId: string; orgId: string }>
): Promise<void> {
if (runs.length === 0) return;

try {
const runIds = runs.map((r) => r.runId);
const result = await this.ttlSystem.expireRunsBatch(runIds);

if (result.expired.length > 0) {
this.logger.debug("TTL system expired runs", {
expiredCount: result.expired.length,
expiredRunIds: result.expired,
});
}

if (result.skipped.length > 0) {
this.logger.debug("TTL system skipped runs", {
skippedCount: result.skipped.length,
skipped: result.skipped,
});
}
} catch (error) {
this.logger.error("Failed to expire runs via TTL system", {
runIds: runs.map((r) => r.runId),
error,
});
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's understand the overall TTL system structure and where `#ttlExpiredCallback` is used
rg -n "ttlExpiredCallback|ttlSystem|expireRunsBatch" --type=ts -g '!node_modules' internal-packages/run-engine/src/ -B 2 -A 2

Repository: triggerdotdev/trigger.dev

Length of output: 19275


🏁 Script executed:

# Find the TtlSystem class implementation
fd -e ts -g '!node_modules' internal-packages/run-engine/src/ | xargs grep -l "class TtlSystem\|export.*TtlSystem"

Repository: triggerdotdev/trigger.dev

Length of output: 53


🏁 Script executed:

# Check the entire TtlSystem class and its polling mechanism
rg -n "class TtlSystem" --type=ts -g '!node_modules' -A 100 internal-packages/run-engine/src/

Repository: triggerdotdev/trigger.dev

Length of output: 9426


🏁 Script executed:

# Find the Lua script for TTL queue operations
fd -e lua --type f internal-packages/run-engine/src/ | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 506


🏁 Script executed:

# Search for the Lua script or script loading for TTL
rg -n "ttl.*lua\|ZRANGE.*ttl\|expired.*lua" --type=ts -g '!node_modules' internal-packages/run-engine/src/ -B 2 -A 2

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check the full error handling in the TTL consumer, particularly what happens after the catch block
sed -n '1290,1320p' internal-packages/run-engine/src/run-queue/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 865


🏁 Script executed:

# Search for Lua script files in the run-engine
find internal-packages/run-engine/src -type f -name "*.lua"

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Search for script content embedded in TypeScript files related to TTL
rg -n "ZRANGE|ZREVRANGE|ZADD|ZREM" --type=ts internal-packages/run-engine/src/run-queue/ -B 5 -A 5 | head -100

Repository: triggerdotdev/trigger.dev

Length of output: 8490


🏁 Script executed:

# Look at the `#getPendingTtlRuns` method more carefully
sed -n '1320,1365p' internal-packages/run-engine/src/run-queue/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1664


🏁 Script executed:

# Let me see the complete Lua script for expireTtlRuns to understand the full flow
sed -n '2560,2630p' internal-packages/run-engine/src/run-queue/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 2932


🏁 Script executed:

# Check if there's any mechanism to restore runs if callback fails
rg -n "expireTtlRuns\|#expireTtlRuns" --type=ts -A 30 internal-packages/run-engine/src/run-queue/index.ts | head -80

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Get the rest of the Lua script after line 2630
sed -n '2620,2660p' internal-packages/run-engine/src/run-queue/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1563


🏁 Script executed:

# Let me also look at the test to see what happens when callback fails
rg -n "callback.*fail\|error.*ttl\|catch.*ttl" --type=ts internal-packages/run-engine/src/engine/tests/ttl.test.ts -B 2 -A 5

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Let me understand the callback flow better - what does expireRunsBatch do in case of failure?
sed -n '2245,2273p' internal-packages/run-engine/src/engine/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 911


🏁 Script executed:

# Check if there's any other mechanism that reschedules or retries failed TTL expirations
rg -n "ttlSystem.*retry\|ttlSystem.*reschedule\|ttlSystem.*recover" --type=ts -g '!node_modules' internal-packages/run-engine/src/

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Let me check if the runs that fail to expire in the database would somehow get picked up again
rg -n "PENDING.*status\|status.*PENDING" --type=ts internal-packages/run-engine/src/engine/systems/ttlSystem.ts -B 2 -A 2

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check if failed runs could be re-added to the TTL queue by any mechanism
rg -n "scheduleExpireRun\|ZADD.*ttl" --type=ts -g '!node_modules' internal-packages/run-engine/src/engine/ -B 2 -A 5 | head -60

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Look at the full flow of how runs get added to TTL queue initially
sed -n '735,750p' internal-packages/run-engine/src/engine/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 509


🏁 Script executed:

# Check if there's any background job or cron that might re-discover stuck PENDING runs
rg -n "PENDING\|pendingRuns\|stuckRuns\|orphaned" --type=ts -g '!node_modules' internal-packages/run-engine/src/engine/ | grep -i "pending.*run\|discover\|retry\|recovery" | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Verify the exact removal point in the Lua script - is it before callback or built into callback success?
sed -n '2560,2650p' internal-packages/run-engine/src/run-queue/index.ts | grep -A 50 "ZREM.*ttl"

Repository: triggerdotdev/trigger.dev

Length of output: 2062


🏁 Script executed:

# Critical finding: The Lua script removes runs BEFORE callback. Let me verify if there's any cleanup mechanism
rg -n "cleanup\|recover\|orphan\|stuck" --type=ts -g '!node_modules' internal-packages/run-engine/src/engine/ | head -30

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check if there's a mechanism to re-add runs that fail database updates
rg -n "scheduleExpireRun" --type=ts internal-packages/run-engine/src/engine/systems/ttlSystem.ts -A 20

Repository: triggerdotdev/trigger.dev

Length of output: 893


🏁 Script executed:

# The critical insight: runs are ZREM'd from Redis in the Lua script BEFORE callback
# Let me verify if there's any recovery: can orphaned PENDING runs be re-discovered?
rg -n "PENDING.*timeout\|TTL.*expired\|orphan" --type=ts -g '!node_modules' internal-packages/run-engine/src/engine/

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check what happens when expireRunsBatch partially succeeds/fails
sed -n '155,220p' internal-packages/run-engine/src/engine/systems/ttlSystem.ts

Repository: triggerdotdev/trigger.dev

Length of output: 2029


🏁 Script executed:

# Continue looking at expireRunsBatch to see error handling
sed -n '220,280p' internal-packages/run-engine/src/engine/systems/ttlSystem.ts

Repository: triggerdotdev/trigger.dev

Length of output: 2115


🏁 Script executed:

# Check if there's any compensation logic or rollback if expireRunsBatch fails
rg -n "catch\|throw\|error\|rollback" internal-packages/run-engine/src/engine/systems/ttlSystem.ts -B 1 -A 3

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Verify the critical point: does the Lua script execution and callback have transactional guarantees?
# Check if failed callback can somehow trigger re-addition to TTL queue
sed -n '1280,1320p' internal-packages/run-engine/src/run-queue/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1140


🏁 Script executed:

# Check if there's any watchdog or scheduled job that finds orphaned PENDING runs
fd . internal-packages/run-engine/src/engine/systems -name "*.ts" -type f | xargs grep -l "PENDING\|orphan\|stuck\|recovery"

Repository: triggerdotdev/trigger.dev

Length of output: 301


🏁 Script executed:

# Check if there's any background cleanup for orphaned PENDING runs
rg -n "expireRun\|PENDING" --type=ts internal-packages/run-engine/src/engine/index.ts | grep -i "cleanup\|background\|cron\|schedule\|task\|job"

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Look at the test to see if it covers failure scenarios
sed -n '1,50p' internal-packages/run-engine/src/engine/tests/ttl.test.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1517


🏁 Script executed:

# Let me verify this critical finding: check the exact atomicity guarantee
# The Lua script removes runs BEFORE callback - is there any compensation?
rg -n "expireRunsBatch\|#ttlExpiredCallback" --type=ts -A 5 internal-packages/run-engine/src/engine/index.ts | head -60

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check if there's any mechanism in place to recover from partial failures
# Look at the full test for TTL expiration and error scenarios
sed -n '1,100p' internal-packages/run-engine/src/engine/tests/ttl.test.ts | tail -50

Repository: triggerdotdev/trigger.dev

Length of output: 1612


Runs removed from Redis before database update completes—orphaned if callback fails.

The Lua script atomically removes runs from the TTL sorted set, queue, and concurrency tracking before the callback executes. If expireRunsBatch fails after the Lua removal completes, the runs remain PENDING in the database but are no longer in Redis, with no recovery mechanism. They cannot be rediscovered by the polling system, as there is no scheduled job to find orphaned PENDING runs past their TTL.

Consider either: (1) moving the database update into the Lua script to guarantee atomicity, (2) adding compensation logic to re-add failed runs to the TTL queue, or (3) implementing a background cleanup job to discover and expire orphaned PENDING runs.

🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/engine/index.ts` around lines 2245 - 2273,
The `#ttlExpiredCallback` currently calls this.ttlSystem.expireRunsBatch after a
Lua script has already removed runs from Redis, so if expireRunsBatch throws the
DB update fails the runs become orphaned; to fix, add compensation in
`#ttlExpiredCallback` to detect errors from ttlSystem.expireRunsBatch and, on
failure, re-add the affected runIds back into the TTL structures (sorted
set/queue/concurrency tracking) via a new ttlSystem method (e.g.,
ttlSystem.requeueRuns or ttlSystem.restoreRuns) or invoke existing requeue
logic, and/or schedule the runIds for a background cleanup retry; ensure the
added code references `#ttlExpiredCallback`, ttlSystem.expireRunsBatch, and the
Lua removal behavior so the requeue/cleanup is called only when expireRunsBatch
fails.

Comment on lines +2275 to +2300
/**
* Applies `defaultMaxTtl` to a run's TTL:
* - No max configured → pass through as-is.
* - No TTL on the run → use the max as the default.
* - Both exist → clamp to the smaller value.
*/
#resolveMaxTtl(ttl: string | undefined): string | undefined {
const maxTtl = this.options.defaultMaxTtl;

if (!maxTtl) {
return ttl;
}

if (!ttl) {
return maxTtl;
}

const ttlMs = parseNaturalLanguageDurationInMs(ttl);
const maxTtlMs = parseNaturalLanguageDurationInMs(maxTtl);

if (ttlMs === undefined || maxTtlMs === undefined) {
return ttl;
}

return ttlMs <= maxTtlMs ? ttl : maxTtl;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

#resolveMaxTtl — edge case when TTL string is unparseable.

If the user-provided ttl doesn't parse (e.g., "invalid"), parseNaturalLanguageDurationInMs returns undefined, and Line 2296 returns the original unparseable ttl string. This means an invalid TTL silently bypasses the max clamp. This is presumably handled downstream (the run queue / expiry system would ignore or reject an invalid TTL), but it's worth noting.

🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/engine/index.ts` around lines 2275 - 2300,
The method `#resolveMaxTtl` currently returns the original unparseable ttl string
when parseNaturalLanguageDurationInMs(ttl) yields undefined, which allows
invalid TTLs to bypass the max clamp; change the logic so that if ttl fails to
parse but the configured options.defaultMaxTtl parses successfully, you return
the parsed max (i.e., maxTtl) to enforce the cap; keep the existing fallback
that if defaultMaxTtl itself is unparseable
(parseNaturalLanguageDurationInMs(maxTtl) === undefined) you return ttl as
before.

Comment on lines +248 to +291
// Update all runs in a single batch
const now = new Date();
const runIdsToExpire = runsToExpire.map((r) => r.id);

await this.$.prisma.taskRun.updateMany({
where: { id: { in: runIdsToExpire } },
data: {
status: "EXPIRED" as TaskRunStatus,
completedAt: now,
expiredAt: now,
// Note: updateMany doesn't support nested writes, so we handle error and snapshots separately
},
});

// Create snapshots and set errors for each run (these require individual updates)
await Promise.all(
runsToExpire.map(async (run) => {
const latestSnapshot = run.executionSnapshots[0]!;
const error: TaskRunError = {
type: "STRING_ERROR",
raw: `Run expired because the TTL (${run.ttl}) was reached`,
};

// Update the error field (updateMany can't do JSON fields properly)
await this.$.prisma.taskRun.update({
where: { id: run.id },
data: { error },
});

// Create the snapshot
await this.$.prisma.taskRunExecutionSnapshot.create({
data: {
runId: run.id,
engine: "V2",
executionStatus: "FINISHED",
description: "Run was expired because the TTL was reached",
runStatus: "EXPIRED",
environmentId: latestSnapshot.environmentId,
environmentType: latestSnapshot.environmentType,
projectId: latestSnapshot.projectId,
organizationId: latestSnapshot.organizationId,
},
});

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Search for TTL batch usage, expiration handling, and any repair/backfill flows
rg -n "expireRunsBatch|runExpired|expiredAt|EXPIRED|ttlSystem" internal-packages/run-engine/src -g '*.ts' -A 2

Repository: triggerdotdev/trigger.dev

Length of output: 20150


🏁 Script executed:

# Look for any error handling or retry logic around the updateMany operation
rg -n "\.updateMany|\.update|\.create" internal-packages/run-engine/src/engine/systems/ttlSystem.ts -B 2 -A 5

Repository: triggerdotdev/trigger.dev

Length of output: 1526


🏁 Script executed:

# Search for TaskRun model definition and any constraints
fd -name "*.prisma" | xargs rg -n "model TaskRun|model TaskRunExecutionSnapshot" -A 15

Repository: triggerdotdev/trigger.dev

Length of output: 12802


🏁 Script executed:

# Check if there are any monitor/repair jobs related to EXPIRED runs
rg -rn "EXPIRED.*repair|repair.*EXPIRED|orphan|partial.*expir|backfill.*ttl" internal-packages/run-engine/src -g '*.ts'

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Understand the full context of the ttlSystem.ts file to see how expireRunsBatch is called
head -n 350 internal-packages/run-engine/src/engine/systems/ttlSystem.ts | tail -n 150

Repository: triggerdotdev/trigger.dev

Length of output: 4969


🏁 Script executed:

# Search for orphan handling, repair jobs, or post-expiration validation
rg -i "orphan|repair|backfill|expired.*without|validate.*expired" internal-packages/run-engine/src -g '*.ts'

Repository: triggerdotdev/trigger.dev

Length of output: 3287


🏁 Script executed:

# Check if there's error handling or retry logic around expireRunsBatch calls
rg -B 5 -A 10 "expireRunsBatch" internal-packages/run-engine/src -g '*.ts'

Repository: triggerdotdev/trigger.dev

Length of output: 14334


🏁 Script executed:

# Check the error handling and logging around the Promise.all
rg -B 5 -A 15 "Promise.all.*runsToExpire.map" internal-packages/run-engine/src/engine/systems/ttlSystem.ts

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check if there are any queries that assume EXPIRED runs have snapshots
rg -i "EXPIRED.*snapshot|snapshot.*EXPIRED" internal-packages/run-engine/src -g '*.ts' -A 2

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Look for any error recovery or monitoring for failed TTL operations
rg -B 2 -A 10 "catch.*callbackError\|catch.*expireRunsBatch" internal-packages/run-engine/src -g '*.ts'

Repository: triggerdotdev/trigger.dev

Length of output: 51


🏁 Script executed:

# Check error handling in ttlSystem more carefully
sed -n '265,330p' internal-packages/run-engine/src/engine/systems/ttlSystem.ts

Repository: triggerdotdev/trigger.dev

Length of output: 2361


🏁 Script executed:

# Check how the TTL callback is invoked and what happens on error
sed -n '2240,2280p' internal-packages/run-engine/src/engine/index.ts

Repository: triggerdotdev/trigger.dev

Length of output: 1315


🏁 Script executed:

# Look for any validation or assumptions about EXPIRED runs having snapshots
grep -r "executionSnapshots" internal-packages/run-engine/src -g --include="*.ts" | grep -i expired | head -20

Repository: triggerdotdev/trigger.dev

Length of output: 179


Risk of partial-expire state if per-run updates fail after batch status update.

The updateMany operation marks all runs as EXPIRED atomically, but the subsequent Promise.all that creates snapshots, sets errors, and emits events can fail partially. If any per-run operation fails (snapshot creation, waitpoint completion, event emission), the Promise.all rejects—but the runs are already marked EXPIRED in the database. Since these runs no longer match the PENDING filter, they won't be retried on the next TTL poll, leaving them stuck in an incomplete state without snapshots or errors.

While the batch size is bounded at 100 runs (default), this doesn't prevent the issue. The error handling in #ttlExpiredCallback only logs failures without recovery. Consider using per-run transactional updates via prisma.$transaction() to atomically update status, error, and snapshot together for each run, ensuring consistency.

🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/engine/systems/ttlSystem.ts` around lines
248 - 291, The batch update (taskRun.updateMany) followed by per-run updates
(taskRun.update and taskRunExecutionSnapshot.create) can leave runs marked
EXPIRED without snapshots/errors if any per-run op fails; change
ttlExpiredCallback to perform each run's status change, error write, and
snapshot creation inside a single per-run prisma.$transaction() (instead of the
global updateMany + Promise.all) so that setting status to EXPIRED,
completedAt/expiredAt, writing the error, and creating taskRunExecutionSnapshot
happen atomically for each run (operate on runsToExpire entries and replace uses
of taskRun.updateMany, taskRun.update, and taskRunExecutionSnapshot.create
accordingly), and ensure transaction failures are surfaced/logged so those runs
remain PENDING for the next TTL poll.

Comment on lines 2710 to +2748
if messagePayload then
-- Update concurrency
redis.call('ZREM', queueKey, messageId)
redis.call('ZREM', envQueueKey, messageId)
redis.call('SADD', queueCurrentConcurrencyKey, messageId)
redis.call('SADD', envCurrentConcurrencyKey, messageId)

-- Add to results
table.insert(results, messageId)
table.insert(results, messageScore)
table.insert(results, messagePayload)

dequeuedCount = dequeuedCount + 1
-- Parse the message to check for TTL expiration
local messageData = cjson.decode(messagePayload)
local ttlExpiresAt = messageData and messageData.ttlExpiresAt

-- Check if TTL has expired
if ttlExpiresAt and ttlExpiresAt <= currentTime then
-- TTL expired - remove from queues but don't add to results
redis.call('ZREM', queueKey, messageId)
redis.call('ZREM', envQueueKey, messageId)
redis.call('DEL', messageKey)

-- Remove from TTL set if provided
if ttlQueueKey and ttlQueueKey ~= '' then
-- Construct TTL member: queueKey|runId|orgId
local ttlMember = queueName .. '|' .. messageId .. '|' .. (messageData.orgId or '')
redis.call('ZREM', ttlQueueKey, ttlMember)
end
-- Don't add to results - this run is expired
else
-- Not expired - process normally
redis.call('ZREM', queueKey, messageId)
redis.call('ZREM', envQueueKey, messageId)
redis.call('SADD', queueCurrentConcurrencyKey, messageId)
redis.call('SADD', envCurrentConcurrencyKey, messageId)

-- Remove from TTL set if provided (run is being executed, not expired)
if ttlQueueKey and ttlQueueKey ~= '' and ttlExpiresAt then
local ttlMember = queueName .. '|' .. messageId .. '|' .. (messageData.orgId or '')
redis.call('ZREM', ttlQueueKey, ttlMember)
end

-- Add to results
table.insert(results, messageId)
table.insert(results, messageScore)
table.insert(results, messagePayload)

dequeuedCount = dequeuedCount + 1
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

TTL-expired runs silently discarded during dequeue are never reported to the application layer.

When the dequeue Lua script encounters a message with an expired TTL (Lines 2716–2727), it removes the message from the queue, env queue, message key, and the TTL sorted set — but never returns it as an expired run. This means:

  1. The TTL consumer will never pick it up (already removed from the TTL sorted set).
  2. The ttlSystem.callback is never invoked for this run.
  3. The run remains in QUEUED status in the database permanently — orphaned.

The dequeue path needs to either:

  • Return expired run info separately so the caller can invoke the expiration callback, or
  • Leave expired messages in the TTL sorted set (only remove from the normal queue/env queue) so the TTL consumer can handle them, or
  • Not evict TTL-expired messages during dequeue at all, and let the TTL consumer be the sole owner of expiration.

Option 3 is simplest and avoids the race between dequeue and TTL consumer entirely. The TTL consumer already handles cleanup atomically.

🤖 Prompt for AI Agents
In `@internal-packages/run-engine/src/run-queue/index.ts` around lines 2710 -
2748, The dequeue Lua branch that handles TTL expiration (checks using
ttlExpiresAt <= currentTime) currently removes the message from queueKey,
envQueueKey, messageKey and ttlQueueKey but never reports the expired run to the
application; to fix this, stop evicting TTL-expired messages in the dequeue
path: modify the ttlExpiresAt <= currentTime branch so it does not remove the
message or its TTL membership (i.e., do not call ZREM on queueKey/envQueueKey or
DEL messageKey or ZREM ttlQueueKey) and simply skip adding the run to results
(leaving the message for the TTL consumer to handle), ensuring that results,
dequeuedCount and SADD calls are untouched for the non-expired branch; locate
and change the code around the symbols ttlExpiresAt, currentTime, queueKey,
envQueueKey, ttlQueueKey, messageKey, results and dequeuedCount.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant