docs: add skip_when for conditional column generation (#479) by nabinchha · Pull Request #480 · NVIDIA-NeMo/DataDesigner

nabinchha · 2026-03-30T20:26:42Z

📋 Summary

Implementation plan for skip_when — a new field on SingleColumnConfig that enables conditional column generation. When the Jinja2 expression evaluates truthy for a row, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns.

Closes #479

🔄 Changes

✨ Added

plans/479/skip-when-conditional-generation.md — detailed implementation plan covering config, DAG, sync engine, async engine, validation, and test strategy

Plan Highlights

Config: skip_when field + Jinja2 validator + skip_when_columns property on SingleColumnConfig
DAG: skip_when referenced columns become edges ensuring correct execution order
Engine: Skip checks inserted before cell dispatch in both sync and async paths
Auto-propagation: Downstream columns whose required_columns include a skipped cell auto-skip
Open questions: Sentinel value (None vs dedicated type) and auto-removal of skipped rows

🤖 Generated with AI

Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-03-30T20:31:19Z

Greptile Summary

This PR adds a detailed implementation plan (plans/479/skip-when-conditional-generation.md) for skip_when — a new conditional column generation feature that lets users gate expensive downstream generation (LLM calls, etc.) on a Jinja2 expression. The plan covers config changes (SkipConfig + SingleColumnConfig fields), DAG/execution-graph edge wiring, a new skip_evaluator.py utility, sync and async engine modifications, serialization boundary handling, and validation.

All five concerns raised in prior review rounds have been addressed:

Native Python truthiness via NativeSandboxedEnvironment (no string-falsy set needed)
Consistent {{ }} delimiter convention documented in the Design Decisions table
FULL_COLUMN generators now pre-filter skipped rows before calling generate() / agenerate()
SkipConfig.columns is a @cached_property (parses once)
Unknown skip.when columns in execution_graph.py now raise ValueError (matching required_columns behavior)

Two new issues remain:

get_required_columns() missing from ExecutionGraph: _should_skip_cell() calls self._graph.get_required_columns(column_name), a method that does not exist on the current ExecutionGraph and is absent from the Files Modified section. Substituting get_upstream_columns() would be semantically incorrect — skip.when column references are also stored as DAG edges, so get_upstream_columns() includes them, which would cause spurious propagation-skips for columns with a skip.when gate but no data dependency on the gated column.
DropSkippedRowsProcessorConfig / stripping boundary ambiguity: The plan describes the processor reading record.get("__skipped__") from raw dicts, while Section 8 strips __skipped__ at every DataFrame conversion point. If processors run on DataFrames the key is gone; the plan needs to clarify which side of the boundary the processor sits on.

Confidence Score: 4/5

Safe to merge after addressing the missing get_required_columns() accessor in the execution-graph plan — the omission would cause a concrete propagation correctness bug in the implementation phase.
All five prior review concerns are cleanly resolved. Two new issues remain: one P1 (missing get_required_columns() method that has a concrete wrong-behavior scenario) and one P2 (processor/stripping boundary ambiguity). The P1 finding prevents a 4→5 score — the implementation as written would either fail to compile or use get_upstream_columns() incorrectly, causing spurious propagation skips.
plans/479/skip-when-conditional-generation.md — specifically Section 2b (ExecutionGraph modifications) and the Files Modified table need the get_required_columns() accessor added; Section "Open Questions / Resolved Questions" needs to clarify the processor stripping-boundary ordering.

Important Files Changed

Filename	Overview
plans/479/skip-when-conditional-generation.md	Implementation plan for `skip_when` conditional column generation; addresses all prior review comments but contains a concrete omission in the `ExecutionGraph` modifications and an architectural ambiguity around `DropSkippedRowsProcessorConfig`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["evaluate cell (col, row)"] --> B{dropped?}
    B -->|yes| C["return None"]
    B -->|no| D["read row_data from buffer"]
    D --> E{"propagate_skip=True?\ncheck graph.get_required_columns()"}
    E -->|yes| E2{"any required_col\nin __skipped__?"}
    E2 -->|yes| H
    E2 -->|no| F
    E -->|no| F{"SkipConfig\npresent?"}
    F -->|no| G["generate(row_data)\nLLM / expression / sampler"]
    F -->|yes| I["evaluate_skip_when()\nNativeSandboxedEnvironment\nbool(native result)"]
    I --> J{truthy?}
    J -->|no| G
    J -->|yes| H["SKIP:\n__skipped__.add(col)\nrecord[col] = skip.value\nside-effect cols = None"]
    H --> K["return None\n(counts as success)"]
    G --> L["write result to buffer"]

    subgraph serialization["Serialization boundary (Step 8)"]
        S1["get_dataframe() / write()\nstrip __skipped__ key"] --> S2["Parquet output\n(no __skipped__ column)"]
    end

    L --> serialization
    K --> serialization

Prompt To Fix All With AI

This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 327-348

Comment:
**`get_required_columns()` does not exist on `ExecutionGraph` and is semantically distinct from `get_upstream_columns()`**

`_should_skip_cell()` calls `self._graph.get_required_columns(column_name)` (line 338), but the current `ExecutionGraph` has no such method (verified against the live source). The Files Modified section for `execution_graph.py` lists `get_skip_config()` and `should_propagate_skip()` as new accessors but omits `get_required_columns()`.

This omission creates a correctness problem, not just a missing method. Using `get_upstream_columns()` as a drop-in replacement would be **semantically wrong** for propagation: the plan adds `skip.when` column references as DAG edges (Section 2a/2b), so `get_upstream_columns()` includes both `required_columns` edges *and* `skip.when` edges. For propagation purposes only `required_columns` relationships should matter.

**Concrete failure case:** Column A has `skip.when="{{ gating_col == 0 }}"` and no `required_columns`. Column B has `required_columns=["A"]` and no `SkipConfig`. If `gating_col` itself gets skipped for a row:
- `gating_col` is in the row's `__skipped__` set.
- `get_upstream_columns("A")` returns `{gating_col}` (the `skip.when` edge).
- Propagation check sees `gating_col ∈ __skipped__` → Column A is propagation-skipped.
- But the intent is that A should *evaluate* its expression (which would get `gating_col=None`, making `None == 0` → `False`, so A should NOT be skipped).

The fix is to add `_required_columns: dict[str, list[str]]` to `ExecutionGraph.__init__`, populate it in the second pass (parallel to `_skip_configs`), and expose it via `get_required_columns(column: str) -> list[str]`. This should be added to both the class description in Section 2b and the Files Modified table.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 566-572

Comment:
**`DropSkippedRowsProcessorConfig` cannot read `__skipped__` if processors consume DataFrames**

The Resolved Questions section says the processor "can read `record.get("__skipped__")` directly" from the record dict, but Section 8 strips `__skipped__` at *every* point where records are converted to DataFrames. The plan does not specify at which stage this processor runs relative to the stripping boundary.

If processors (consistent with the existing codebase pattern) receive DataFrames, the `__skipped__` key will already have been stripped and the processor would have no way to identify which rows were skipped — defeating the described mechanism.

The plan needs to clarify one of:
1. The processor runs on raw record dicts before DataFrame conversion (and the pipeline is designed to support this), **or**
2. `__skipped__` is kept as a proper DataFrame column through the processor stage and stripped only in the final parquet `write()`, **or**
3. The processor uses an alternative mechanism (e.g., checking whether all skippable columns are `None`/`NaN`) rather than reading `__skipped__` from the raw record.

Without this clarification the `DropSkippedRowsProcessorConfig` is underdefined and the implementation could silently produce a processor that can never identify skipped rows.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "plan: major revision — SkipConfig model,..." | Re-trigger Greptile}

plans/479/skip-when-conditional-generation.md

andreatgretel · 2026-03-31T16:30:30Z

plans/479/skip-when-conditional-generation.md

+#### 4d. Modify `_fan_out_with_async()` (line 621)
+Convert list comprehension to explicit loop with same skip check.
+
+#### 4e. Modify `_run_full_column_generator()` (line 503)


for FULL_COLUMN generators, the plan runs generate() first then overwrites skipped rows with None. That means batch LLM calls still happen for rows that should be skipped - which kind of defeats the main value prop of skip_when. Maybe pre-filter the DataFrame to exclude skipped rows before calling generate(), then merge back with Nones? That's closer to how _records_to_drop filtering works.

Addressed in 351bc29. Both sync (Step 4d _run_full_column_generator) and async (Step 5c _run_batch) now pre-filter the DataFrame to exclude skipped rows before calling the generator, then merge results back at original indices. Generator never sees skipped rows.

plans/479/skip-when-conditional-generation.md

andreatgretel · 2026-03-31T16:30:35Z

plans/479/skip-when-conditional-generation.md

+def evaluate_skip_when(expression: str, record: dict) -> bool:
+    """Render expression against record; return True if truthy."""
+
+def should_skip_by_propagation(


I think the propagation rule is too aggressive as written. required_columns here just means "variable name appears in the template", not "the generator can't handle nulls". Templates like {{ 'unknown' if country is none else country|upper }} or {{ 0 if score is none else score + 1 }} handle missing data fine but would get auto-skipped. Same issue shows up in step 6 where the expression generator blanket-returns None when any input is None - that overrides the user's own fallback logic. Maybe make propagation opt-out per column, or only auto-propagate for LLM generators where null inputs are genuinely unsafe?

Addressed in 351bc29. Propagation is now opt-out per column via propagate_skip=False on SingleColumnConfig (independent of SkipConfig). The design decisions table documents this: null-tolerant columns like expressions with fallback logic can opt out. The expression generator no longer has a defensive None guard — skip handling is done at the engine dispatch layer via pre-filtering.

plans/479/skip-when-conditional-generation.md

andreatgretel · 2026-03-31T16:30:39Z

plans/479/skip-when-conditional-generation.md

+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Where does `skip_when` live? | `SingleColumnConfig` base class | Cross-cutting; applies to all column types |


this says skip_when applies to all column types via SingleColumnConfig, but sampler/seed columns get collapsed into shared multi-column configs before execution - the sampler generator emits all sibling columns together, so there's no place to skip only one of them. Maybe scope v1 to generated single-column configs and add validation that rejects skip_when on sampler/seed types?

Addressed in 351bc29. skip is now scoped to generated single-column configs only. A @model_validator on SingleColumnConfig rejects skip on sampler/seed column types, and a belt-and-suspenders SKIP_ON_SAMPLER_SEED validation rule catches it at the engine level.

andreatgretel · 2026-03-31T16:30:41Z

plans/479/skip-when-conditional-generation.md:172-195 (async engine sections)

worth noting the overlap with #362 (keep failed-parse fields as null). Both features need a "cell became unavailable, propagate through the DAG" mechanism - skip_when is intentional/config-driven, #362 is runtime failure. Might be worth designing one shared propagation path (especially on the async side where CompletionTracker already has dependency-aware state) instead of building two separate cascades. Not blocking for the plan, just something to keep in mind before the implementation PRs land.

johnnygreco

looks great! only question to sort out is the skip value

…e propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor

greptile-apps · 2026-04-01T21:10:18Z

plans/479/skip-when-conditional-generation.md

+Add a private method on `DatasetBuilder` that centralizes the skip decision for one cell. Propagation and expression gating are evaluated independently:
+
+```python
+def _should_skip_cell(
+    self, column_name: str, record: dict
+) -> bool:
+    skipped_cols: set[str] = record.get("__skipped__", set())
+
+    # 1. Propagation — independent of SkipConfig
+    propagate = self._graph.should_propagate_skip(column_name)
+    if propagate:
+        required = self._graph.get_required_columns(column_name)
+        if should_skip_by_propagation(required, skipped_cols, propagate):
+            return True
+
+    # 2. Expression gate — only if SkipConfig exists
+    skip_config = self._graph.get_skip_config(column_name)
+    if skip_config is not None:
+        return evaluate_skip_when(skip_config.when, record)
+
+    return False
+```


get_required_columns() does not exist on ExecutionGraph and is semantically distinct from get_upstream_columns()

_should_skip_cell() calls self._graph.get_required_columns(column_name) (line 338), but the current ExecutionGraph has no such method (verified against the live source). The Files Modified section for execution_graph.py lists get_skip_config() and should_propagate_skip() as new accessors but omits get_required_columns().

This omission creates a correctness problem, not just a missing method. Using get_upstream_columns() as a drop-in replacement would be semantically wrong for propagation: the plan adds skip.when column references as DAG edges (Section 2a/2b), so get_upstream_columns() includes both required_columns edges and skip.when edges. For propagation purposes only required_columns relationships should matter.

Concrete failure case: Column A has skip.when="{{ gating_col == 0 }}" and no required_columns. Column B has required_columns=["A"] and no SkipConfig. If gating_col itself gets skipped for a row:

gating_col is in the row's __skipped__ set.

get_upstream_columns("A") returns {gating_col} (the skip.when edge).

Propagation check sees gating_col ∈ __skipped__ → Column A is propagation-skipped.

But the intent is that A should evaluate its expression (which would get gating_col=None, making None == 0 → False, so A should NOT be skipped).

The fix is to add _required_columns: dict[str, list[str]] to ExecutionGraph.__init__, populate it in the second pass (parallel to _skip_configs), and expose it via get_required_columns(column: str) -> list[str]. This should be added to both the class description in Section 2b and the Files Modified table.

Prompt To Fix With AI

This is a comment left during a code review. Path: plans/479/skip-when-conditional-generation.md Line: 327-348 Comment: **`get_required_columns()` does not exist on `ExecutionGraph` and is semantically distinct from `get_upstream_columns()`** `_should_skip_cell()` calls `self._graph.get_required_columns(column_name)` (line 338), but the current `ExecutionGraph` has no such method (verified against the live source). The Files Modified section for `execution_graph.py` lists `get_skip_config()` and `should_propagate_skip()` as new accessors but omits `get_required_columns()`. This omission creates a correctness problem, not just a missing method. Using `get_upstream_columns()` as a drop-in replacement would be **semantically wrong** for propagation: the plan adds `skip.when` column references as DAG edges (Section 2a/2b), so `get_upstream_columns()` includes both `required_columns` edges *and* `skip.when` edges. For propagation purposes only `required_columns` relationships should matter. **Concrete failure case:** Column A has `skip.when="{{ gating_col == 0 }}"` and no `required_columns`. Column B has `required_columns=["A"]` and no `SkipConfig`. If `gating_col` itself gets skipped for a row: - `gating_col` is in the row's `__skipped__` set. - `get_upstream_columns("A")` returns `{gating_col}` (the `skip.when` edge). - Propagation check sees `gating_col ∈ __skipped__` → Column A is propagation-skipped. - But the intent is that A should *evaluate* its expression (which would get `gating_col=None`, making `None == 0` → `False`, so A should NOT be skipped). The fix is to add `_required_columns: dict[str, list[str]]` to `ExecutionGraph.__init__`, populate it in the second pass (parallel to `_skip_configs`), and expose it via `get_required_columns(column: str) -> list[str]`. This should be added to both the class description in Section 2b and the Files Modified table. How can I resolve this? If you propose a fix, please make it concise.

nabinchha and others added 4 commits March 30, 2026 14:19

plan: remove HopChain example from skip_when plan

3f415c7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

plan: replace HopChain example with generic product review example

6ee4c70

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

plan: add open questions on skip sentinel value and row filtering

14ab39b

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nabinchha requested a review from a team as a code owner March 30, 2026 20:26

nabinchha changed the title ~~plan: add skip_when for conditional column generation (#479)~~ docs: add skip_when for conditional column generation (#479) Mar 30, 2026

greptile-apps bot reviewed Mar 30, 2026

View reviewed changes

nabinchha added the plan Agent-assisted development plan label Mar 31, 2026

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Show resolved Hide resolved

johnnygreco reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

andreatgretel reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

andreatgretel reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Outdated Show resolved Hide resolved

andreatgretel reviewed Mar 31, 2026

View reviewed changes

plans/479/skip-when-conditional-generation.md Show resolved Hide resolved

andreatgretel reviewed Mar 31, 2026

View reviewed changes

johnnygreco requested changes Mar 31, 2026

View reviewed changes

greptile-apps bot reviewed Apr 1, 2026

View reviewed changes

Conversation

nabinchha commented Mar 30, 2026

📋 Summary

🔄 Changes

✨ Added

Plan Highlights

Uh oh!

greptile-apps bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andreatgretel Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreatgretel Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreatgretel Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

andreatgretel commented Mar 31, 2026

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps bot commented Mar 30, 2026 •

edited

Loading