feat: make async in-flight task cap configurable by eric-tramel · Pull Request #699 · NVIDIA-NeMo/DataDesigner

eric-tramel · 2026-05-21T19:04:49Z

📋 Summary

Adds a public RunConfig.max_in_flight_tasks setting for the async scheduler's task-lease capacity and raises the default from 256 to 1024. The configured value is now the singular scheduler work cap: model-stage task admission and adaptive row-group queue backpressure follow it, while model API request concurrency remains controlled separately by max_parallel_requests and request admission.

🔗 Related Issue

N/A

🔄 Changes

Add RunConfig.max_in_flight_tasks with a default of 1024 and validation that it is at least 1.
Wire the async dataset builder to pass the configured in-flight task cap into AsyncTaskScheduler.
Set scheduler model-stage task admission from max_in_flight_tasks directly, without the old default floor or aggregate request headroom.
Collapse the duplicate scheduler default constant so the engine uses DEFAULT_IN_FLIGHT_TASK_CAPACITY for this concept.
Make adaptive row-group queue backpressure derive from max_in_flight_tasks, rather than an inflated model-stage admission value.
Update capacity reporting to mark submission capacity as sourced from run_config.
Add regression coverage for config validation, DataDesigner run config persistence, builder propagation, lowered model-stage admission, and row-group queue guard behavior.

🧪 Testing

make test passes (not run; focused tests below)
Unit tests added/updated
E2E tests added/updated (N/A — no E2E surface)

Focused checks run after rebasing onto latest origin/main:

uv run --group dev pytest packages/data-designer-config/tests/config/test_run_config.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py packages/data-designer-engine/tests/engine/dataset_builders/test_dataset_builder.py::test_run_config_default_non_inference_max_parallel_workers packages/data-designer/tests/interface/test_data_designer.py::test_run_config_setting_persists (109 passed)
uv run ruff format --check ... on touched files
uv run ruff check --output-format=full ... on touched files
git diff --check
Package-source search confirms no legacy submitted-task knob references or duplicate scheduler cap constants

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated (N/A — runtime config surface only)

github-actions · 2026-05-21T19:07:14Z

Review: PR #699 — feat: make async task cap configurable

Summary

Promotes the previously hard-coded async scheduler submission cap to a public RunConfig.max_submitted_tasks field (default 1024, validated >= 1) and raises the default from 256 to 1024. The async dataset builder now reads the value from run_config and forwards it to AsyncTaskScheduler; max_model_task_admission is computed as max(DEFAULT_TASK_POOL_SIZE, max_submitted_tasks, 2 * aggregate_max_parallel_requests) so model-heavy runs are not capped by the old 256 floor. Capacity reporting now sources submission_capacity from "run_config" rather than "dataset_builder". Total: +48/-8 lines across 7 files.

Findings

Correctness — looks right

DEFAULT_TASK_POOL_SIZE in async_scheduler.py:80 is now sourced from DEFAULT_SUBMISSION_CAPACITY in task_admission.py:39, giving the two constants a single source of truth. Both default to 1024, matching the new RunConfig default — invariant holds.
The max_model_task_admission formula at dataset_builder.py:1028-1032 preserves the existing floor semantics: model task admission is at least DEFAULT_TASK_POOL_SIZE even if the user sets max_submitted_tasks below 1024. This matches the PR description's intent.
Field validation (ge=1) is consistent with neighboring fields like non_inference_max_parallel_workers.
submission_capacity source string changed to "run_config" accurately reflects the new wiring; downstream consumers of AsyncCapacityPlan are unlikely to depend on the literal string, but worth verifying if any dashboards parse it.

Behavior change worth calling out

The default jumps from 256 → 1024 (4×). Users who never set this knob will see significantly more concurrent task leases. The PR notes this as intended (so model-heavy runs don't get throttled by the old floor), but consider:

Memory pressure for users with many columns / large row groups: each in-flight task carries scheduler bookkeeping plus any per-task buffers.
Existing performance baselines or capacity-plan snapshots in tests/recipes may shift; the PR updates the tests that assert submission_capacity directly, but external consumers (logs, telemetry dashboards, runbooks) could be affected.

Not blocking, but a release note / changelog entry beyond the PR body would help downstream users notice.

Naming / minor consistency

DEFAULT_TASK_POOL_SIZE and DEFAULT_SUBMISSION_CAPACITY are now aliases. Keeping both is fine for layering reasons (the engine module re-exports the config-layer constant), but a one-line comment at async_scheduler.py:80 explaining "alias retained for legacy import surface" would prevent a future refactor from quietly dropping it.
The docstring in run_config.py:136-137 reads "Maximum number of scheduler task leases, and therefore async scheduler worker tasks…" — the "and therefore" is a bit awkward. Consider: "Maximum number of in-flight scheduler task leases (i.e., the cap on concurrent async scheduler worker tasks)."

Test coverage — solid

test_run_config.py: default value, custom value, validation rejection — all three cases covered.
test_async_builder_integration.py: confirms the value flows through _setup_async_pipeline to AsyncTaskScheduler.__init__ (1536 → both max_submitted_tasks and max_model_task_admission, since 1536 > 2 * aggregate in the test).
test_data_designer.py::test_run_config_setting_persists: round-trip through DataDesigner.set_run_config.
test_dataset_builder.py: default assertion alongside other run-config defaults.

One gap: there's no test that exercises the floor behavior — i.e., a user setting max_submitted_tasks below DEFAULT_TASK_POOL_SIZE (say 64) and confirming max_model_task_admission stays at 1024. The integration test uses 1536, which exercises the upper path. A small unit test at the builder level locking in the floor would prevent a future refactor from silently dropping it.

Style / project conventions

Absolute imports preserved.
from __future__ import annotations present in changed source files.
Field uses Field(default=…, ge=…, description=…) matching the surrounding style.
No reverse imports introduced; the new constant lives in the engine package and is consumed there + read out of run_config (config → engine direction is preserved).

Security / performance

No security-sensitive surface touched.
Performance: the 4× default raise is the main concern (covered above). Validation is bounded only on the lower side (ge=1); there is no upper bound. A user could set max_submitted_tasks=1_000_000 and OOM. This matches existing fields like buffer_size (also unbounded above), so it's consistent with the project's posture, but a soft sanity ceiling or warning log on unusually large values would be a small defensive improvement.

Verdict

Approve-with-nits. The wiring is clean, tests cover the happy path and validation, and the layering respects the config → engine direction. Suggested non-blocking follow-ups:

Add a unit test that pins the max_model_task_admission floor when max_submitted_tasks is set below DEFAULT_TASK_POOL_SIZE.
Tighten the max_submitted_tasks docstring and add a one-line comment on the DEFAULT_TASK_POOL_SIZE/DEFAULT_SUBMISSION_CAPACITY alias.
Consider a changelog/release-note line flagging the default bump from 256 → 1024 for users with capacity-sensitive deployments.

greptile-apps · 2026-05-21T19:10:14Z

Greptile Summary

This PR exposes the async scheduler's in-flight task cap as a public RunConfig.max_in_flight_tasks setting (default raised from 256 to 1024) and threads it through the builder into AsyncTaskScheduler, removing the previous hard-coded pool-size constant and the aggregate-model-request calculation that formerly determined max_model_task_admission.

RunConfig.max_in_flight_tasks (default 1024, ge=1) replaces the hard-coded DEFAULT_TASK_POOL_SIZE = 256, with the value now sourced from run_config in capacity reporting.
max_model_task_admission is set equal to max_in_flight_tasks instead of max(256, 2 × aggregate_parallel_requests), decoupling model-stage admission sizing from the model registry; the previously-included \"local\" resource limit entry is also removed (it was never consumed by any task's resource request).
The queue_guard expression simplifies from max(max_submitted_tasks × 4, max_model_task_admission × 2) to max_in_flight_tasks × 4, with a new regression test validating this behaviour.

Confidence Score: 5/5

All changed paths are additive or cleanly rename an existing knob; no logic paths were silently altered.

The new max_in_flight_tasks field is validated at construction time, the old local resource limit entry was provably dead (no task ever requested it), and the old aggregate-based max_model_task_admission formula produced values that could never exceed submission_capacity in practice. The queue_guard simplification keeps the effective threshold within a few percent of the old default. Tests cover config validation, builder wiring, and scheduler admission behaviour end-to-end.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-config/src/data_designer/config/run_config.py	Adds max_in_flight_tasks field (default 1024, ge=1) to RunConfig; straightforward pydantic field addition with validation.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/scheduling/task_admission.py	Introduces DEFAULT_IN_FLIGHT_TASK_CAPACITY = 1024 and updates the default submission_capacity accordingly; no logic changes to admission control.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	Renames max_submitted_tasks to max_in_flight_tasks, removes the never-used local resource limit entry, and simplifies queue_guard to max_in_flight_tasks * 4; changes are coherent and behaviorally safe.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Wires run_config.max_in_flight_tasks into the scheduler; removes the aggregate-based max_model_task_admission calculation in favour of setting it equal to max_in_flight_tasks.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py	Renames all max_submitted_tasks kwargs to max_in_flight_tasks throughout tests; adds new regression test for the simplified queue_guard formula.
packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py	Asserts that the model registry aggregate query is no longer called and that max_in_flight_tasks and max_model_task_admission are both set to the run-config value.
packages/data-designer/tests/interface/test_data_designer.py	Extends the persistence test to verify max_in_flight_tasks round-trips correctly through set_run_config.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[RunConfig\nmax_in_flight_tasks=1024] -->|read by| B[DatasetBuilder\n_prepare_async_run]
    B -->|max_in_flight_tasks| C[AsyncTaskScheduler.__init__]
    B -->|max_model_task_admission = max_in_flight_tasks| C
    C -->|submission_capacity = max_in_flight_tasks| D[TaskAdmissionConfig]
    C -->|resource_limits llm_wait = max_model_task_admission| D
    C -->|model_group_limit_cap = max_model_task_admission| E[TaskSchedulingResolver]
    C -->|queue_guard = max_in_flight_tasks x 4| F[_adaptive_row_group_block_reason]
    D -->|controls| G[TaskAdmissionController\nleases up to submission_capacity tasks]

_{Reviews (4): Last reviewed commit: "feat: make async in-flight task cap conf..." | Re-trigger Greptile}

nabinchha

Thanks for putting this together, @eric-tramel — clean, well-scoped change.

Summary

Adds RunConfig.max_in_flight_tasks (default 1024, ge=1) so users can tune the async scheduler's task-lease capacity, raises the default from 256 to 1024, and wires the value end-to-end through DatasetBuilder into AsyncTaskScheduler (including the model-stage admission floor and capacity reporting). The implementation matches the stated intent in the PR description and keeps the engine/config layering intact — config grows a new field, engine consumes it.

Findings

Warnings — Worth addressing

packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py:1064-1068 — max_model_task_admission floor can surprise users tuning down

What: When a user explicitly lowers max_in_flight_tasks (e.g. 64), the floor still pins max_model_task_admission to max(DEFAULT_TASK_POOL_SIZE=1024, 64, 2 * aggregate) = 1024. So the llm_wait resource keeps a 1024-slot ceiling even though the user asked for a much smaller cap.
Why: This is consistent with the pre-PR behavior (the floor was 256 before), so it isn't a regression — but combined with the new public knob, it could read as "I set the cap to 64, why is the LLM stage still letting through 1024?" The local and submission resources do drop with the user's value; only llm_wait keeps the floor. The new docstring on max_in_flight_tasks doesn't hint at that asymmetry.
Suggestion: Either (a) drop the DEFAULT_TASK_POOL_SIZE term from the max(...) (use max(max_in_flight_tasks, 2 * aggregate)) so the user's setting always governs, or (b) add a one-line note to the max_in_flight_tasks docstring that the model-stage admission floor stays at the engine default. (a) is cleaner if there's no specific reason to keep the floor when the user has opted in to a smaller cap.

Suggestions — Take it or leave it

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:80 — Two names for the same constant

What: DEFAULT_TASK_POOL_SIZE is now defined as DEFAULT_IN_FLIGHT_TASK_CAPACITY. Both names refer to the same 1024 value but live in two modules.
Why: It's preserved as an alias (likely for downstream callers and tests that import it), but two names for one concept costs a bit of reading time when navigating the scheduler.
Suggestion: As a follow-up, consider collapsing to a single name (DEFAULT_IN_FLIGHT_TASK_CAPACITY reads well in context) once you're comfortable touching the in-tree import sites. Not blocking — and worth doing separately so this PR stays focused.

packages/data-designer-engine/tests/engine/dataset_builders/test_async_builder_integration.py:194 — Test name vs. what it now asserts

What: test_prepare_async_run_enables_request_pressure_advisory now also asserts max_in_flight_tasks / max_model_task_admission propagation.
Why: Searching for "what tests cover the new knob" won't surface this test by name. Both assertions are valid checks of the same code path, but the test name only describes the first.
Suggestion: Either split into a focused test_prepare_async_run_propagates_max_in_flight_tasks or rename to something like test_prepare_async_run_propagates_scheduler_kwargs. Tiny ergonomics nit.

packages/data-designer-config/src/data_designer/config/run_config.py:175-182 — Field(description=...) duplicates the Attributes: docstring

What: The class docstring already has a full max_in_flight_tasks: entry, and Field(description=...) repeats a shorter version of the same text.
Why: Other simple numeric fields in RunConfig (e.g. buffer_size, non_inference_max_parallel_workers) don't carry a description=. The redundancy is harmless but slightly inconsistent with the surrounding pattern.
Suggestion: Either drop the description= (keep just Field(default=1024, ge=1)) or, if you'd rather keep schema-level descriptions, leave it as-is. Pure style call.

What Looks Good

The new knob's docstring is the standout — it explicitly disambiguates max_in_flight_tasks from max_parallel_requests, which is exactly the confusion this change could otherwise introduce.
Coverage is layered well: config-level validation (test_run_config_*), interface persistence (test_run_config_setting_persists), and builder propagation (the integration test). Each layer asserts the right thing without overlap.
The capacity-source flip from dataset_builder → run_config on submission_capacity is a small but accurate provenance change, and it's still type-safe under CapacityValueSource.
Renaming max_submitted_tasks → max_in_flight_tasks is a real readability win — "submission" was just one phase, and the new name captures what the cap actually governs.

Verdict

Ship it (with nits) — the one Warning is non-blocking (a behavior nuance worth either tweaking or documenting), and the rest are pure style. Nothing critical here.

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>

eric-tramel requested a review from a team as a code owner May 21, 2026 19:04

eric-tramel temporarily deployed to agentic-ci May 21, 2026 19:05 — with GitHub Actions Inactive

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch from 1b75986 to 5981281 Compare May 21, 2026 19:17

eric-tramel changed the title ~~feat: make async task cap configurable~~ feat: make async in-flight task cap configurable May 21, 2026

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch 2 times, most recently from 1bcb049 to 5ae93df Compare May 21, 2026 19:21

nabinchha previously approved these changes May 21, 2026

View reviewed changes

eric-tramel self-assigned this May 22, 2026

feat: make async in-flight task cap configurable

eccce53

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>

eric-tramel dismissed nabinchha’s stale review via eccce53 May 22, 2026 01:46

eric-tramel force-pushed the codex/configurable-scheduler-task-cap branch from 9c7d4d2 to eccce53 Compare May 22, 2026 01:46

eric-tramel requested a review from nabinchha May 22, 2026 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make async in-flight task cap configurable#699

feat: make async in-flight task cap configurable#699
eric-tramel wants to merge 1 commit into
mainfrom
codex/configurable-scheduler-task-cap

eric-tramel commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

greptile-apps Bot commented May 21, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

nabinchha left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

eric-tramel commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📋 Summary

🔗 Related Issue

🔄 Changes

🧪 Testing

✅ Checklist

Uh oh!

github-actions Bot commented May 21, 2026

Review: PR #699 — feat: make async task cap configurable

Summary

Findings

Correctness — looks right

Behavior change worth calling out

Naming / minor consistency

Test coverage — solid

Style / project conventions

Security / performance

Verdict

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Summary

Findings

Warnings — Worth addressing

Suggestions — Take it or leave it

What Looks Good

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

eric-tramel commented May 21, 2026 •

edited

Loading

greptile-apps Bot commented May 21, 2026 •

edited

Loading