Support columns_sorted in row_filters#20497
Conversation
|
@alamb @Ted-Jiang please take a look when you are available. |
|
I think @adriangb also looked at using the sorted mentadata recently One big question I have is why are we proposing this change? Do we have any evidence this help performance (like benchmarks?) |
|
Do you know if any of the benchmark queries evaluate filters on sorted cols? If not, I'll make a new one. |
|
I think there are no optimizations for evaluating predicates on filtered arrays. It would be cool to be able to support binary search for comparison predicates. Locate the relevant rows once, then operate only on that range: |
If the columns is sorted, won't row group / page pruning basically be doing this already? |
You're right. Then, if a column is sorted it's a reason enough to use load page indices and prune. Due how data is distributed this could prune a lot right away. |
|
I'll work on adding a query to the existing bench to capture the benefits and send the results later. |
|
@sdf-jkl could you help me with some napkin math on how this optimization works? Is the idea that applying a row selection when a page index is present is more efficient? I'm not sure if that means we should filter columns that have a page index first or last, and how that would weigh against e.g. the size of the column or the selectivity of the filter 🤔 |
|
Sorry, I think I got things mixed up while working on this. We consider a column Given that, this column is usually a strong candidate for row group/page pruning. So we prune. After pruning, the remaining work goes to This should make the incremental benefit of using a predicate on this column early in Late Materialization likely marginal in many workloads, given most of the pruning value was already captured earlier. |
So the point is that these columns (sorted columns) were likely well pruned by row group / page min/max stats -> they're unlikely to be selective for row pruning -> they should be evaluated last? |
|
They are unlikely to be highly selective for row pruning, but we can't reliably assume they are always less selective than other predicates. My implementation here did the opposite and prioritized them in the evaluation order, which is a mistake. At this point, I think the rule itself might be unnecessary, and we could consider closing the issue. I can clean up the docs and the function placeholder in This is the current doc saying we should prioritized sorted columns: datafusion/datafusion/datasource-parquet/src/row_filter.rs Lines 45 to 60 in 9b7d092 |
|
I agree updating the docs and removing this un-implemented heuristic makes sense. |
| The sorted dataset is automatically generated from the ClickBench partitioned dataset. You can configure the memory used during the sorting process with the `DATAFUSION_MEMORY_GB` environment variable. The default memory limit is 12GB. | ||
| ```bash | ||
| ./bench.sh data data_sorted_clickbench | ||
| ./bench.sh data clickbench_sorted |
There was a problem hiding this comment.
I’m assuming the new version is correct 👍
There was a problem hiding this comment.
Which issue does this PR close?
columns_sortedin row_filters #3476.Rationale for this change
Improving predicate ordering for predicate pushdown
What changes are included in this PR?
Building on changes from #3477 and #7528
columns_sortedfunctionshould_enable_page_indexto use index when choose to reorder predicates in configAre these changes tested?
Yes, unit tests
Are there any user-facing changes?
No