perf: Optimize some decimal expressions#3619
Conversation
Replace the 4-node expression tree (Cast→BinaryExpr→Cast→Cast) used for Decimal128 arithmetic that may overflow with a single fused expression that performs i256 register arithmetic directly. This reduces per-batch allocation from 4 intermediate arrays (112 bytes/elem) to 1 output array (16 bytes/elem). The new WideDecimalBinaryExpr evaluates children, performs add/sub/mul using i256 intermediates via try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a single Decimal128 array. Follows the same pattern as decimal_div.
Add benchmark comparing old Cast->BinaryExpr->Cast chain vs fused WideDecimalBinaryExpr for Decimal128 add/sub/mul. Covers four cases: add with same scale, add with different scales, multiply, and subtract.
cb52636 to
d7495bd
Compare
|
@sqlbenchmark run tpch --iterations 3 |
1 similar comment
|
@sqlbenchmark run tpch --iterations 3 |
Eliminate redundant CheckOverflow when wrapping WideDecimalBinaryExpr (which already handles overflow). Fuse Cast(Decimal128→Decimal128) + CheckOverflow into a single DecimalRescaleCheckOverflow expression that rescales and validates precision in one pass.
5a21500 to
91092a6
Compare
|
@sqlbenchmark run tpch --iterations 3 |
|
Benchmark job |
|
This is awesome ! |
| let rescale_divisor = if need_rescale { | ||
| i256_pow10((max_scale - s_out) as u32) | ||
| } else { | ||
| i256::ONE |
There was a problem hiding this comment.
Don't we need to scale up to s_out if s_out > max_scale?
There was a problem hiding this comment.
It is very possible. I will take another look on Monday. I will move to draft for now to avoid accidental merge.
native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs
Outdated
Show resolved
Hide resolved
native/spark-expr/src/math_funcs/internal/decimal_rescale_check.rs
Outdated
Show resolved
Hide resolved
- Handle scale-up when s_out > max(s1, s2) in add/subtract - Propagate errors in scalar path when fail_on_error=true - Guard against large scale delta (>38) overflow in rescale - Assert precision <= 38 in precision_bound - Assert exp <= 76 in i256_pow10 - Remove unnecessary _ prefix on used variables in planner - Use value.signum() instead of manual sign check - Verify Cast target type matches before fusing with CheckOverflow - Validate children count in with_new_children for both expressions - Add tests for scale-up, scalar error propagation, and large delta
|
Thanks for the review @martin-g and @parthchandra. I pushed a commit to address this feedback and also added some additional tests. |
- Validate WideDecimalBinaryExpr output type matches CheckOverflow data_type before bypassing the overflow check - Handle s_out > natural_scale (scale-up) in multiply path for consistency with add/subtract
|
Close+reopen to re-trigger CI. |
mbutrovich
left a comment
There was a problem hiding this comment.
LGTM, thanks @andygrove! I've got some ideas to optimize the fixed-point decimal stuff but that belongs as arrow-rs kernels. This optimization makes sense for Comet!
|
Moving to draft until I can address #3619 (comment) |
Summary
For some decimal operations, we cast the inputs to Decimal256, perform the math operation, and then cast the result back to Decimal128:
This causes intermediate allocations and can be optimized with a custom
WideDecimalBinaryExprthat performs i256 register arithmetic directly, reducing per-batch allocation from 4 intermediate arrays (3 Decimal256 @ 32 bytes/elem + 1 Decimal128 @ 16 bytes/elem = 112 bytes/elem) to 1 output array (16 bytes/elem)TPC-H q1
Before:
After:
Criterion benchmark results (8192 element batches)
How it works
WideDecimalBinaryExprevaluates left/right children, performs add/sub/mul usingi256intermediates viaarrow::compute::kernels::arity::try_binary, applies scale adjustment with HALF_UP rounding, checks precision bounds, and outputs a singleDecimal128array. Follows the same pattern asdecimal_divindiv.rs.Overflow handling matches existing behavior:
ArrowError::ComputeErrori128::MAXsentinel +null_if_overflow_precision