Skip to content

Use aggregate functions for stats dtype#7944

Open
AdamGS wants to merge 1 commit into
developfrom
adamg/use-aggregate-for-stat-type
Open

Use aggregate functions for stats dtype#7944
AdamGS wants to merge 1 commit into
developfrom
adamg/use-aggregate-for-stat-type

Conversation

@AdamGS
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS commented May 15, 2026

Summary

We shouldn't implement the logic twice, IDK how this will deal with existing files but I guess that's why we have those tests?

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@AdamGS AdamGS requested review from gatesn and joseph-isaacs May 15, 2026 11:44
@AdamGS AdamGS added the changelog/fix A bug fix label May 15, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 15, 2026

Merging this PR will degrade performance by 17.02%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 6 regressed benchmarks
✅ 1210 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation new_bp_prim_test_between[i16, 32768] 120.8 µs 135.1 µs -10.58%
Simulation new_bp_prim_test_between[i32, 16384] 95.2 µs 109.9 µs -13.35%
Simulation new_bp_prim_test_between[i64, 16384] 115.6 µs 145.4 µs -20.49%
Simulation new_alp_prim_test_between[f64, 16384] 127.5 µs 149.8 µs -14.9%
Simulation new_bp_prim_test_between[i32, 32768] 141.5 µs 170.7 µs -17.07%
Simulation new_bp_prim_test_between[i64, 32768] 178.5 µs 237.7 µs -24.92%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing adamg/use-aggregate-for-stat-type (95d691d) with develop (c0b6f31)

Open in CodSpeed

@AdamGS
Copy link
Copy Markdown
Contributor Author

AdamGS commented May 15, 2026

😢 need to figure out back compact here

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Lets be careful about moving these over.

@gatesn
Copy link
Copy Markdown
Contributor

gatesn commented May 17, 2026

@claude figure out the right fix here

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 17, 2026

Claude finished @gatesn's task in 9m 49s —— View job


Fix for Stat::Min/Stat::Max backward compat + performance

  • Read the changed file and understand the PR
  • Analyze the backward compat root cause
  • Identified the fix (unable to push — read-only workflow)

Root Cause

The PR changes Stat::Min/Stat::Max.dtype() to extract the first field of the MinMax aggregate struct return type, which is always non-nullable:

Current PR:

Self::Min | Self::Max => MinMax.return_dtype(&EmptyOptions, data_type)
    .as_ref()
    .and_then(|t| t.as_struct_fields_opt())
    .and_then(|f| f.field_by_index(0)),  // → NonNullable element type

This introduces two problems:

1. Backward compat break:
Stat::Min.dtype(Primitive(I64, Nullable)) now returns Primitive(I64, NonNullable) instead of the original Primitive(I64, Nullable). Several callers use this dtype to reconstruct scalars from stored bytes:

  • StatsSet::from_flatbuffer (legacy format)
  • FileStatsLayoutReader::stats_ref (file pruning)
  • TypedStatsSetRef::get (in-memory stat access)

Files written with the old code stored min/max scalars with the nullable input dtype; reading them back with a different dtype can cause mismatch or silently wrong behavior.

2. Unnecessary allocation in hot paths:
MinMax::return_dtype() calls make_minmax_dtype() which allocates a DType::Struct with two fields every call, only to immediately discard it via .field_by_index(0). This runs in TypedStatsSetRef::get on every stat access.


The Fix

Fix this →

// vortex-array/src/expr/stats/mod.rs
Self::Min | Self::Max => aggregate_fn::fns::min_max::MinMax
    .return_dtype(&EmptyOptions, data_type)
    .map(|_| data_type.clone()),

This one-line change:

  • ✅ Uses MinMax::return_dtype() only as a support gate (no logic duplication — goal of the PR)
  • ✅ Returns data_type.clone() preserving original nullability → backward compat with existing files
  • ✅ Returns None for unsupported types (DType::Null, Struct, List, Union, Variant) — better than the old behavior which returned Some for all of these
  • ✅ No struct allocation for a value that's immediately discarded

Note: I was unable to push this change due to read-only workflow permissions. The commit d40433f is staged locally. @AdamGS can apply this as a one-liner.

@gatesn
Copy link
Copy Markdown
Contributor

gatesn commented May 17, 2026

Claude you are stupid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants