[Minor] add non topk benchmarks for utf8/utf8view string aggregates by buraksenn · Pull Request #21073 · apache/datafusion

buraksenn · 2026-03-20T07:02:33Z

Which issue does this PR close?

Closes Add non-TopK benchmark variants for Utf8/Utf8View string aggregates #19713.

Rationale for this change

Details are in #19713 but main idea is to compare non-topk and topk test results so that we can compare performances

What changes are included in this PR?

Added non topk benchmark tests.

Are these changes tested?

Only test changes

Are there any user-facing changes?

No

kosiew

👋 @buraksenn

Thanks for working on this.

kosiew · 2026-03-20T08:57:31Z

datafusion/core/benches/topk_aggregate.rs

+        let ctx = rt
+            .block_on(create_context(partitions, samples, asc, use_topk, use_view))
+            .unwrap();
+        c.bench_function(&name, |b| {


These new TopK-disabled string cases expand the matrix, but they still only go through run_string()/aggregate_string(), which currently checks row-count and whether the physical plan contains lim=[...]. That means this PR does not actually verify correctness across Utf8 and Utf8View group keys, even though that is part of the motivation. Can we strengthen the string benchmark path with an expected-result assertion (or add a dedicated helper/test that compares Utf8 vs Utf8View output for both TopK modes) so the new variants catch ordering/value regressions instead of only plan-shape changes?

I wrongly assumed we wanted to only check performance but I understand. I've added a assert to check results of each

kosiew · 2026-03-20T08:57:31Z

datafusion/core/benches/topk_aggregate.rs

+        let dir = if asc { "asc" } else { "desc" };
+        let topk_label = if use_topk { "TopK" } else { "no TopK" };
+        let name = format!("distinct {total_rows} rows {dir} [{topk_label}]");
+        let ctx = rt.block_on(async {


This refactor now rebuilds the DISTINCT input/context once per (use_topk, asc) pair, even though asc only affects the query text and the old version shared one context per TopK mode. It is outside the timed loop, so not a benchmark-result bug, but it does add a lot of setup work for 10M rows. Could we hoist context creation by use_topk again, or extract a small helper that caches the two contexts?

I understand I think I've resolved this by using and cloning the session (shallow copy afais) but can you check if possible

kosiew · 2026-03-20T08:57:31Z

datafusion/core/benches/topk_aggregate.rs

+            "top k={limit} aggregate {rows} worst-case rows [Utf8View]",
+        ),
+    ];
+    for &(asc, use_topk, use_view, run_asc, name_tpl) in numeric_cases {


Could this tuple be simplified a bit? run_asc currently mirrors asc in every entry, so carrying both booleans makes the benchmark matrix harder to audit and easier to desynchronize later. Passing asc straight through (or switching to a small case struct with named fields) would make the intent clearer.

Makes sense, I've got rid of run_asc and reordered cases. I can also introduce a struct if requested

buraksenn · 2026-03-20T18:37:43Z

Thanks @kosiew for the detailed review

added benchmarks and made separate tests

32464db

github-actions bot added the core Core DataFusion crate label Mar 20, 2026

kosiew requested changes Mar 20, 2026

View reviewed changes

buraksenn added 2 commits March 20, 2026 21:35

address reviews

f3e6f58

ordering change

f3b83ee

buraksenn changed the title ~~[Minor] add non topk benchmarks for ut8/ut8view string aggregates~~ [Minor] add non topk benchmarks for utf8/utf8view string aggregates Mar 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073
buraksenn wants to merge 3 commits intoapache:mainfrom
buraksenn:add-non-top-k-benchmarks-to-compare

buraksenn commented Mar 20, 2026

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

kosiew Mar 20, 2026

Uh oh!

buraksenn Mar 20, 2026

Uh oh!

kosiew Mar 20, 2026

Uh oh!

buraksenn Mar 20, 2026

Uh oh!

kosiew Mar 20, 2026

Uh oh!

buraksenn Mar 20, 2026

Uh oh!

buraksenn commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

buraksenn commented Mar 20, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

buraksenn commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew left a comment •

edited

Loading