Skip to content

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073

Open
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:add-non-top-k-benchmarks-to-compare
Open

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:add-non-top-k-benchmarks-to-compare

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Details are in #19713 but main idea is to compare non-topk and topk test results so that we can compare performances

What changes are included in this PR?

Added non topk benchmark tests.

Are these changes tested?

Only test changes

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Mar 20, 2026
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @buraksenn

Thanks for working on this.

let ctx = rt
.block_on(create_context(partitions, samples, asc, use_topk, use_view))
.unwrap();
c.bench_function(&name, |b| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new TopK-disabled string cases expand the matrix, but they still only go through run_string()/aggregate_string(), which currently checks row-count and whether the physical plan contains lim=[...]. That means this PR does not actually verify correctness across Utf8 and Utf8View group keys, even though that is part of the motivation. Can we strengthen the string benchmark path with an expected-result assertion (or add a dedicated helper/test that compares Utf8 vs Utf8View output for both TopK modes) so the new variants catch ordering/value regressions instead of only plan-shape changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrongly assumed we wanted to only check performance but I understand. I've added a assert to check results of each

let dir = if asc { "asc" } else { "desc" };
let topk_label = if use_topk { "TopK" } else { "no TopK" };
let name = format!("distinct {total_rows} rows {dir} [{topk_label}]");
let ctx = rt.block_on(async {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refactor now rebuilds the DISTINCT input/context once per (use_topk, asc) pair, even though asc only affects the query text and the old version shared one context per TopK mode. It is outside the timed loop, so not a benchmark-result bug, but it does add a lot of setup work for 10M rows. Could we hoist context creation by use_topk again, or extract a small helper that caches the two contexts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand I think I've resolved this by using and cloning the session (shallow copy afais) but can you check if possible

"top k={limit} aggregate {rows} worst-case rows [Utf8View]",
),
];
for &(asc, use_topk, use_view, run_asc, name_tpl) in numeric_cases {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this tuple be simplified a bit? run_asc currently mirrors asc in every entry, so carrying both booleans makes the benchmark matrix harder to audit and easier to desynchronize later. Passing asc straight through (or switching to a small case struct with named fields) would make the intent clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've got rid of run_asc and reordered cases. I can also introduce a struct if requested

@buraksenn
Copy link
Contributor Author

Thanks @kosiew for the detailed review

@buraksenn buraksenn changed the title [Minor] add non topk benchmarks for ut8/ut8view string aggregates [Minor] add non topk benchmarks for utf8/utf8view string aggregates Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add non-TopK benchmark variants for Utf8/Utf8View string aggregates

2 participants