Skip to content

use full column set on empty post-filter projection list in duckdb#6831

Open
myrrc wants to merge 1 commit intodevelopfrom
myrrc/fix-empty-projection-list-duckdb
Open

use full column set on empty post-filter projection list in duckdb#6831
myrrc wants to merge 1 commit intodevelopfrom
myrrc/fix-empty-projection-list-duckdb

Conversation

@myrrc
Copy link
Contributor

@myrrc myrrc commented Mar 6, 2026

  • Use full column list if input post filter projection list is empty as
    requested by DuckDB. Before we used the empty list which resulted in
    uninitialized read (columns requested but not populated).
  • Reset DataChunk in ArrayExporter as done by Parquet reader.
    Not needed right now, but must be done once we start estimating cardinalities
  • In Duckdb VTable input, rename column_ids -> projection_ids,
    projection_ids -> post_filter_projection_ids to better reflect what they are
  • generate comments for Rust bindings in bindgen
  • In duckdb_vx_node_statistics, reorder fields to reduce size 32->24.
  • Explain why and how we use EMPTY_COLUMN_ID in comments
  • Add tests for projection pushdown, including zero projections

@myrrc myrrc requested review from 0ax1 and joseph-isaacs March 6, 2026 17:43
@myrrc myrrc added action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/fix A bug fix ext/duckdb Relates to the DuckDB integration labels Mar 6, 2026
Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
@myrrc myrrc force-pushed the myrrc/fix-empty-projection-list-duckdb branch from bcd00d7 to d764aed Compare March 6, 2026 17:43
@myrrc myrrc enabled auto-merge (squash) March 6, 2026 17:44
@codspeed-hq
Copy link

codspeed-hq bot commented Mar 6, 2026

Merging this PR will improve performance by 23.63%

⚡ 3 improved benchmarks
✅ 977 untouched benchmarks
🆕 2 new benchmarks
⏩ 1466 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation take_map[(0.1, 0.5)] 2.6 ms 2.1 ms +23.63%
Simulation take_map[(0.1, 0.1)] 1,007.5 µs 908.9 µs +10.84%
Simulation take_map[(0.1, 1.0)] 4.2 ms 3.5 ms +20.81%
🆕 Simulation sequence_compress_u32 N/A 5.2 ms N/A
🆕 Simulation sequence_decompress_u32 N/A 4.1 ms N/A

Comparing myrrc/fix-empty-projection-list-duckdb (d764aed) with develop (5d6a3c8)2

Open in CodSpeed

Footnotes

  1. 1466 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

  2. No successful run was found on develop (a8b340a) during the generation of this report, so 5d6a3c8 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

@myrrc myrrc mentioned this pull request Mar 6, 2026
Copy link
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we (where to) always remove all filter column then we never need to differentiate between column and projection? right

I am still a little unclear about the email of columns_ids and projection_ids

* Example usage:
* https://github.com/duckdb/duckdb/blob/dc11eadd8f0a7c600f0034810706605ebe10d5b9/src/include/duckdb/function/table_function.hpp#L147
*/
const idx_t *post_filter_projection_ids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really think we should keep the naming here and explain the semantics.

It nice to keep the duckdb name

let filter_expr_str = filter_expr
.as_ref()
.map_or_else(|| "true".to_string(), |f| f.to_string());
debug!("Global init Vortex scan SELECT {projection_expr} WHERE {filter_expr_str}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this not alloc the str on even if debug ignores it?

table_filter_set: Option<&TableFilterSetRef>,
column_ids: &[u64],
column_names: &[String],
projection_ids: &[u64],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i find this naming more confusing, column seems to make more sense here. i think

///
/// Returns `true` if a chunk was exported, `false` if all rows have been exported.
pub fn export(&mut self, chunk: &mut DataChunkRef) -> VortexResult<bool> {
chunk.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure we need this?

why if we estimate card

ctx: ExecutionCtx,
/// Columns DuckDB requested to read from file. If empty, it's a zero-column
/// projection and should be handled accordingly, see ArrayExporter::export.
fields: Vec<Box<dyn ColumnExporter>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could Model this with an enum? It might be easier to understand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've thought about this, but then we need something like NonzeroVec, and I believe it's an overkill.

// Post filter projection ids may be empty, in which case
// you need to use projection_ids
// https://github.com/duckdb/duckdb/blob/6e211da91657a94803c465fd0ce585f4c6754b54/src/planner/operator/logical_get.cpp#L168
let (post_filter_projection_ids, has_projection_ids) = match post_filter_projection_ids {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is a projection_id then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/fix A bug fix ext/duckdb Relates to the DuckDB integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants