docs: add skip_when for conditional column generation (#479)#480
docs: add skip_when for conditional column generation (#479)#480
Conversation
Adds implementation plan for a `skip_when` field on `SingleColumnConfig` that enables conditional column generation. When the Jinja2 expression evaluates truthy, the cell is set to None and the generator is skipped. Skips auto-propagate through the DAG to downstream columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR adds a detailed implementation plan ( All five concerns raised in prior review rounds have been addressed:
Two new issues remain:
|
| Filename | Overview |
|---|---|
| plans/479/skip-when-conditional-generation.md | Implementation plan for skip_when conditional column generation; addresses all prior review comments but contains a concrete omission in the ExecutionGraph modifications and an architectural ambiguity around DropSkippedRowsProcessorConfig. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["evaluate cell (col, row)"] --> B{dropped?}
B -->|yes| C["return None"]
B -->|no| D["read row_data from buffer"]
D --> E{"propagate_skip=True?\ncheck graph.get_required_columns()"}
E -->|yes| E2{"any required_col\nin __skipped__?"}
E2 -->|yes| H
E2 -->|no| F
E -->|no| F{"SkipConfig\npresent?"}
F -->|no| G["generate(row_data)\nLLM / expression / sampler"]
F -->|yes| I["evaluate_skip_when()\nNativeSandboxedEnvironment\nbool(native result)"]
I --> J{truthy?}
J -->|no| G
J -->|yes| H["SKIP:\n__skipped__.add(col)\nrecord[col] = skip.value\nside-effect cols = None"]
H --> K["return None\n(counts as success)"]
G --> L["write result to buffer"]
subgraph serialization["Serialization boundary (Step 8)"]
S1["get_dataframe() / write()\nstrip __skipped__ key"] --> S2["Parquet output\n(no __skipped__ column)"]
end
L --> serialization
K --> serialization
Prompt To Fix All With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 327-348
Comment:
**`get_required_columns()` does not exist on `ExecutionGraph` and is semantically distinct from `get_upstream_columns()`**
`_should_skip_cell()` calls `self._graph.get_required_columns(column_name)` (line 338), but the current `ExecutionGraph` has no such method (verified against the live source). The Files Modified section for `execution_graph.py` lists `get_skip_config()` and `should_propagate_skip()` as new accessors but omits `get_required_columns()`.
This omission creates a correctness problem, not just a missing method. Using `get_upstream_columns()` as a drop-in replacement would be **semantically wrong** for propagation: the plan adds `skip.when` column references as DAG edges (Section 2a/2b), so `get_upstream_columns()` includes both `required_columns` edges *and* `skip.when` edges. For propagation purposes only `required_columns` relationships should matter.
**Concrete failure case:** Column A has `skip.when="{{ gating_col == 0 }}"` and no `required_columns`. Column B has `required_columns=["A"]` and no `SkipConfig`. If `gating_col` itself gets skipped for a row:
- `gating_col` is in the row's `__skipped__` set.
- `get_upstream_columns("A")` returns `{gating_col}` (the `skip.when` edge).
- Propagation check sees `gating_col ∈ __skipped__` → Column A is propagation-skipped.
- But the intent is that A should *evaluate* its expression (which would get `gating_col=None`, making `None == 0` → `False`, so A should NOT be skipped).
The fix is to add `_required_columns: dict[str, list[str]]` to `ExecutionGraph.__init__`, populate it in the second pass (parallel to `_skip_configs`), and expose it via `get_required_columns(column: str) -> list[str]`. This should be added to both the class description in Section 2b and the Files Modified table.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 566-572
Comment:
**`DropSkippedRowsProcessorConfig` cannot read `__skipped__` if processors consume DataFrames**
The Resolved Questions section says the processor "can read `record.get("__skipped__")` directly" from the record dict, but Section 8 strips `__skipped__` at *every* point where records are converted to DataFrames. The plan does not specify at which stage this processor runs relative to the stripping boundary.
If processors (consistent with the existing codebase pattern) receive DataFrames, the `__skipped__` key will already have been stripped and the processor would have no way to identify which rows were skipped — defeating the described mechanism.
The plan needs to clarify one of:
1. The processor runs on raw record dicts before DataFrame conversion (and the pipeline is designed to support this), **or**
2. `__skipped__` is kept as a proper DataFrame column through the processor stage and stripped only in the final parquet `write()`, **or**
3. The processor uses an alternative mechanism (e.g., checking whether all skippable columns are `None`/`NaN`) rather than reading `__skipped__` from the raw record.
Without this clarification the `DropSkippedRowsProcessorConfig` is underdefined and the implementation could silently produce a processor that can never identify skipped rows.
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "plan: major revision — SkipConfig model,..." | Re-trigger Greptile
| #### 4d. Modify `_fan_out_with_async()` (line 621) | ||
| Convert list comprehension to explicit loop with same skip check. | ||
|
|
||
| #### 4e. Modify `_run_full_column_generator()` (line 503) |
There was a problem hiding this comment.
for FULL_COLUMN generators, the plan runs generate() first then overwrites skipped rows with None. That means batch LLM calls still happen for rows that should be skipped - which kind of defeats the main value prop of skip_when. Maybe pre-filter the DataFrame to exclude skipped rows before calling generate(), then merge back with Nones? That's closer to how _records_to_drop filtering works.
There was a problem hiding this comment.
Addressed in 351bc29. Both sync (Step 4d _run_full_column_generator) and async (Step 5c _run_batch) now pre-filter the DataFrame to exclude skipped rows before calling the generator, then merge results back at original indices. Generator never sees skipped rows.
| def evaluate_skip_when(expression: str, record: dict) -> bool: | ||
| """Render expression against record; return True if truthy.""" | ||
|
|
||
| def should_skip_by_propagation( |
There was a problem hiding this comment.
I think the propagation rule is too aggressive as written. required_columns here just means "variable name appears in the template", not "the generator can't handle nulls". Templates like {{ 'unknown' if country is none else country|upper }} or {{ 0 if score is none else score + 1 }} handle missing data fine but would get auto-skipped. Same issue shows up in step 6 where the expression generator blanket-returns None when any input is None - that overrides the user's own fallback logic. Maybe make propagation opt-out per column, or only auto-propagate for LLM generators where null inputs are genuinely unsafe?
There was a problem hiding this comment.
Addressed in 351bc29. Propagation is now opt-out per column via propagate_skip=False on SingleColumnConfig (independent of SkipConfig). The design decisions table documents this: null-tolerant columns like expressions with fallback logic can opt out. The expression generator no longer has a defensive None guard — skip handling is done at the engine dispatch layer via pre-filtering.
|
|
||
| | Decision | Choice | Rationale | | ||
| |---|---|---| | ||
| | Where does `skip_when` live? | `SingleColumnConfig` base class | Cross-cutting; applies to all column types | |
There was a problem hiding this comment.
this says skip_when applies to all column types via SingleColumnConfig, but sampler/seed columns get collapsed into shared multi-column configs before execution - the sampler generator emits all sibling columns together, so there's no place to skip only one of them. Maybe scope v1 to generated single-column configs and add validation that rejects skip_when on sampler/seed types?
There was a problem hiding this comment.
Addressed in 351bc29. skip is now scoped to generated single-column configs only. A @model_validator on SingleColumnConfig rejects skip on sampler/seed column types, and a belt-and-suspenders SKIP_ON_SAMPLER_SEED validation rule catches it at the engine level.
worth noting the overlap with #362 (keep failed-parse fields as null). Both features need a "cell became unavailable, propagate through the DAG" mechanism - |
johnnygreco
left a comment
There was a problem hiding this comment.
looks great! only question to sort out is the skip value
…e propagation - Introduce SkipConfig(when, value) as nested model on SingleColumnConfig - Move propagate_skip to SingleColumnConfig as independent field, fixing bug where columns with no SkipConfig couldn't participate in propagation - Add full sync engine implementation (Steps 4a-4d) covering both _fan_out_with_threads and _run_full_column_generator dispatch paths - Add serialization boundary stripping for both DatasetBatchManager (sync) and RowGroupBufferManager (async) - Simplify architecture diagrams for readability - Update all references, design decisions, verification plan Made-with: Cursor
| Add a private method on `DatasetBuilder` that centralizes the skip decision for one cell. Propagation and expression gating are evaluated independently: | ||
|
|
||
| ```python | ||
| def _should_skip_cell( | ||
| self, column_name: str, record: dict | ||
| ) -> bool: | ||
| skipped_cols: set[str] = record.get("__skipped__", set()) | ||
|
|
||
| # 1. Propagation — independent of SkipConfig | ||
| propagate = self._graph.should_propagate_skip(column_name) | ||
| if propagate: | ||
| required = self._graph.get_required_columns(column_name) | ||
| if should_skip_by_propagation(required, skipped_cols, propagate): | ||
| return True | ||
|
|
||
| # 2. Expression gate — only if SkipConfig exists | ||
| skip_config = self._graph.get_skip_config(column_name) | ||
| if skip_config is not None: | ||
| return evaluate_skip_when(skip_config.when, record) | ||
|
|
||
| return False | ||
| ``` |
There was a problem hiding this comment.
get_required_columns() does not exist on ExecutionGraph and is semantically distinct from get_upstream_columns()
_should_skip_cell() calls self._graph.get_required_columns(column_name) (line 338), but the current ExecutionGraph has no such method (verified against the live source). The Files Modified section for execution_graph.py lists get_skip_config() and should_propagate_skip() as new accessors but omits get_required_columns().
This omission creates a correctness problem, not just a missing method. Using get_upstream_columns() as a drop-in replacement would be semantically wrong for propagation: the plan adds skip.when column references as DAG edges (Section 2a/2b), so get_upstream_columns() includes both required_columns edges and skip.when edges. For propagation purposes only required_columns relationships should matter.
Concrete failure case: Column A has skip.when="{{ gating_col == 0 }}" and no required_columns. Column B has required_columns=["A"] and no SkipConfig. If gating_col itself gets skipped for a row:
gating_colis in the row's__skipped__set.get_upstream_columns("A")returns{gating_col}(theskip.whenedge).- Propagation check sees
gating_col ∈ __skipped__→ Column A is propagation-skipped. - But the intent is that A should evaluate its expression (which would get
gating_col=None, makingNone == 0→False, so A should NOT be skipped).
The fix is to add _required_columns: dict[str, list[str]] to ExecutionGraph.__init__, populate it in the second pass (parallel to _skip_configs), and expose it via get_required_columns(column: str) -> list[str]. This should be added to both the class description in Section 2b and the Files Modified table.
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/479/skip-when-conditional-generation.md
Line: 327-348
Comment:
**`get_required_columns()` does not exist on `ExecutionGraph` and is semantically distinct from `get_upstream_columns()`**
`_should_skip_cell()` calls `self._graph.get_required_columns(column_name)` (line 338), but the current `ExecutionGraph` has no such method (verified against the live source). The Files Modified section for `execution_graph.py` lists `get_skip_config()` and `should_propagate_skip()` as new accessors but omits `get_required_columns()`.
This omission creates a correctness problem, not just a missing method. Using `get_upstream_columns()` as a drop-in replacement would be **semantically wrong** for propagation: the plan adds `skip.when` column references as DAG edges (Section 2a/2b), so `get_upstream_columns()` includes both `required_columns` edges *and* `skip.when` edges. For propagation purposes only `required_columns` relationships should matter.
**Concrete failure case:** Column A has `skip.when="{{ gating_col == 0 }}"` and no `required_columns`. Column B has `required_columns=["A"]` and no `SkipConfig`. If `gating_col` itself gets skipped for a row:
- `gating_col` is in the row's `__skipped__` set.
- `get_upstream_columns("A")` returns `{gating_col}` (the `skip.when` edge).
- Propagation check sees `gating_col ∈ __skipped__` → Column A is propagation-skipped.
- But the intent is that A should *evaluate* its expression (which would get `gating_col=None`, making `None == 0` → `False`, so A should NOT be skipped).
The fix is to add `_required_columns: dict[str, list[str]]` to `ExecutionGraph.__init__`, populate it in the second pass (parallel to `_skip_configs`), and expose it via `get_required_columns(column: str) -> list[str]`. This should be added to both the class description in Section 2b and the Files Modified table.
How can I resolve this? If you propose a fix, please make it concise.
📋 Summary
Implementation plan for
skip_when— a new field onSingleColumnConfigthat enables conditional column generation. When the Jinja2 expression evaluates truthy for a row, the cell is set toNoneand the generator is skipped. Skips auto-propagate through the DAG to downstream columns.Closes #479
🔄 Changes
✨ Added
plans/479/skip-when-conditional-generation.md— detailed implementation plan covering config, DAG, sync engine, async engine, validation, and test strategyPlan Highlights
skip_whenfield + Jinja2 validator +skip_when_columnsproperty onSingleColumnConfigskip_whenreferenced columns become edges ensuring correct execution orderrequired_columnsinclude a skipped cell auto-skipNonevs dedicated type) and auto-removal of skipped rows🤖 Generated with AI