Skip to content

Epic: Stats and AggregateFns #7707

@gatesn

Description

@gatesn

Goal

Make Vortex statistics pluggable by modeling pruning stats as aggregate-function state exposed through physical bound expressions. The concrete success case is demonstrating a Bloom-filter zone-map stat for UTF-8 equality pruning added through plugins: a custom aggregate function, scalar function, and rewrite rule, without changing built-in pruning logic.

Direction

Stats are aggregate-function partials/results when the aggregate has pruning semantics. Not every aggregate is a useful pruning stat; the rewrite path should depend only on aggregates that can prove bounds.

Keep expressions physical for this epic. Falsification turns concrete predicates into normal Vortex expressions over physical scalar functions and stat(expr, AggregateFnRef). We are not adding a logical expression layer yet.

Use stat(expr, AggregateFnRef) as the bound-expression primitive. It returns the stat value for the current stats scope, or null when unavailable. Falsification produces expressions containing stat(...); simplification/execution decides whether anything is proven.

Aggregate functions advertise whether a stored aggregate can satisfy a requested aggregate through AggregateFnRef::can_satisfy(...). Exact descriptor matches are preferred; compatible approximate aggregates, such as bounded max satisfying max, may be used when the stored aggregate is a sound bound.

Zone maps store aggregate-function descriptors, using Display for AggregateFnRef, and use those descriptors as stats-table column names. At read time, the zone map lowers bound expressions by matching available descriptors against requested aggregates.

Expression expansion is acceptable for now. Rewrites may produce multiple physical proof expressions, and each zone map can lower unavailable aggregates to null.

All new stats-facing APIs should live under vortex-array/src/stats/. Scalar function implementations may live with scalar functions, but should be re-exported through vortex_array::stats.

Phase 1: Bound Expressions and Pruning Aggregates

Phase 2: Rewrite Registry

Phase 3: Built-In Rewrite Rules

LIKE pruning is tracked separately in #8026.

Phase 4: Zoned Layout Migration

Phase 5: Aggregate-Function Zoned Stats

WARNING: this is the phase that changes the ZonedLayout serialized form

  • Replace new zoned-layout stats configuration with aggregate-function descriptors. Use aggregate descriptors for zoned stats #7938
    • Configure stored zone stats with AggregateFnRef, not Stat enum values.
    • Use Display for AggregateFnRef as the descriptor string.
    • Use the descriptor string as the zone-map stats-table column name.
    • Keep Stat only as a compatibility bridge for existing array stats and legacy zoned metadata.
  • Compute per-zone aggregate partials at write time. Use aggregate descriptors for zoned stats #7938
    • Build the auxiliary stats table from each aggregate function's partial/state dtype.
    • Use a custom strategy/configuration hook for selecting aggregates before adding broader policy machinery.
  • Add a new zoned metadata format for aggregate-function stats. Use aggregate descriptors for zoned stats #7938
    • Current zoned metadata is raw bytes, not protobuf: zone_len followed by a legacy Stat bitset.
    • Add a version/magic marker so new metadata can be recognized as protobuf.
    • Store zone_len and present_aggregates: repeated string.
    • Preserve legacy metadata decoding by translating old Stat bitsets into built-in aggregate descriptor strings.
  • Lower stat(expr, aggregate_fn) at read time by matching aggregate-function descriptors in the zone stats table. Use aggregate descriptors for zoned stats #7938
    • Match exact descriptors first.
    • Use AggregateFnRef::can_satisfy(...) when a stored aggregate is a sound exact or approximate substitute for the requested aggregate.
    • Unavailable aggregate stats continue to lower to nullable null results.
  • Remove zoned-layout schema special cases that are only needed because stats are modeled as Stat enum values. Use aggregate descriptors for zoned stats #7938
    • Any auxiliary proof state should be represented by the aggregate partial itself or by a dedicated aggregate.

Phase 6: Plugin Bloom Proof

  • Add a plugin-provided Bloom aggregate for UTF-8 values.
  • Add the Bloom filter extension/storage type needed by the aggregate output.
  • Add a plugin-provided bloom_might_contain(filter, value) scalar function.
  • Register a plugin-provided rewrite for UTF-8 equality.
  • Store/load the Bloom stat through the aggregate-function zone-map path.
  • Demonstrate UTF-8 equality pruning without modifying built-in binary-expression pruning.

Phase 7: Satisfaction Follow-Up

  • Add satisfaction rewrite APIs and rules as new behavior.
    • Combine independent satisfiers with OR.
  • Teach filtering to use satisfied-zone masks to skip residual predicate evaluation for zones proven true.

Phase 8: Cleanup

  • Remove duplicated legacy stat propagation once the new rewrite path is complete.
  • Retire the old StatsCatalog pruning path.
  • Move broader generic stats storage work to a follow-up epic if still needed.

Status

In progress.

Metadata

Metadata

Assignees

No one assigned

    Labels

    epicPublic roadmap umbrella for a major initiative, with work tracked in sub-issues.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions