Optimise Repr::clone and clone_from for the inline case#69
Conversation
The callgrind profile of hegel-rust's shrinker (per the hegel_shrinker
bench file's comments) reports Repr::clone at 13.24 % of total Ir, with
values almost exclusively in the i64 / i128 range — i.e. the inline
variant of Repr (capacity in {-2,-1,1,2}).
The old implementation went through sign_capacity to extract the
absolute capacity + Sign, then constructed a positive Repr, then
re-applied the sign via with_sign. That's a redundant round-trip when
all we need to do for the inline case is copy 3 machine words verbatim.
This patch:
* marks Clone::clone #[inline] so the trivial inline path lands at
every call site without a function call;
* inline path is now a direct struct copy: the union's
field plus the original signed ;
* heap path is split into a #[cold] + #[inline(never)] helper so
the inline fast path stays tiny everywhere it's instantiated.
(Without the explicit #[inline(never)] LLVM happily inlines it
back and the heap case actually gets slightly slower in
benches — apparently the inlined-in-clone shape compiles worse
than going through a normal function call. Tracked the
experiment in the commit log.)
Results against the new integer/benches/hegel_shrinker baseline:
shrinker_consider/64 -29.1 %
shrinker_consider/16 -25.5 %
shrinker_consider/4 -12.7 %
ibig_clamp/two_word -42.6 % (clamp uses clone heavily)
ibig_clamp/one_word -40.2 %
ibig_drop/two_word -35.3 % (incidental: tighter assignment
ibig_drop/one_word -34.6 % pattern in the surrounding bench
ibig_drop/zero -34.2 % loop exposes a folding win)
ibig_clone/two_word -21.4 %
ibig_clone/one_word -20.3 %
ibig_clone/zero -20.0 %
ibig_clone/just_over_inline +7.7 % (heap path: extra function call)
ibig_clone/mid +12.7 %
ibig_clone/large ~0 % (memcpy-bound, call overhead lost
in the noise)
The heap regression is the deliberate trade-off: hegel's shrinker uses
≤128-bit values almost exclusively, so the inline win dwarfs it. Any
downstream consumer that clones large IBig values heavily would hit
the +13 % on mid-sized values; if that ever shows up as a hot spot
we can revisit the inline/outline split.
Same shape as the clone() optimisation: marks clone_from #[inline] and rewrites the inline-source fast path to avoid the sign_capacity unpack (it only needs to know abs(capacity) <= 2 to decide which path to take, and can copy the signed capacity verbatim). The sign-aware extraction is still done in the heap path for the reallocation-and-copy logic. Not shown in the hegel_shrinker benches because Vec::clone() goes through T::clone() per element, not T::clone_from(); the bench file does not have an explicit clone_from-driven scenario. The change is for symmetry / downstream consumers that hit *dest = src.clone() patterns the compiler rewrites to clone_from. Tests pass (lib + hegel-tests stateful).
Two changes to Repr::clone_heap, both motivated by hegel_shrinker bench data showing the just_over_inline / mid heap clones regressed +8 to +13% vs the prior baseline (commit 583ee7e). 1. Skip the Buffer scaffolding and allocate directly. The old body went through Buffer::allocate (which calls allocate_exact which calls allocate_raw, with two near-duplicate size-bound asserts along the way), then push_slice (one more capacity-vs-len assert), then mem::transmute, then a conditional sign flip on the resulting Repr. The new body computes default_capacity inline, calls alloc::alloc::alloc once via from_size_align_unchecked, ptr::copy_n's, and constructs the Repr with the signed capacity set from the start. That eliminates the redundant checks plus a __rust_no_alloc_shim_is_unstable_v2 bl that the Buffer wrapper was pulling in. 2. Drop the #[cold] annotation. The shrinker_consider/64 workload bench (the most realistic top-level scenario from hegel_shrinker.rs) does ~3 pp better without #[cold]. The isolated ibig_clone microbench does ~3 pp worse — i.e. the trade-off is in code-layout, not in clone work itself. The realistic compound workload is what matters. Bench impact vs the 'current' baseline (saved before any Repr::clone changes — so this is the cumulative result of both this commit and the prior fast-path inline): ibig_clone/zero -16.6 % (vs -20.0 % before) ibig_clone/one_word -15.2 % ibig_clone/two_word -17.9 % ibig_clone/just_over_inline +2.0 % (was +7.7 %) ibig_clone/mid +0.7 % (was +12.7 %) ibig_clone/large -5.1 % (was ~0 %) shrinker_consider/64 -32.3 % (was -29.1 %) shrinker_consider/16 -27.4 % shrinker_consider/4 -13.5 % The heap regression is essentially gone: just_over_inline and mid are in the noise, large is actively faster than the original code by 5 %. The inline microbench gave up ~3 pp vs the prior commit, but that's a code-layout side effect of dropping #[cold] and is more than compensated for in the realistic shrinker workload.
Adds tests/clone.rs covering clone and clone_from across every inline/heap (dst, src) combination — including the heap-dst + inline-src case that frees the old buffer, and inline-dst + heap-src that allocates. Also fixes two lints the cherry-picked clone code tripped under upstream CI: the manual range check (clippy::manual_range_contains) and the if-else formatting in clone_from. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| /// bound checks and the sign-flip dance. Kept `#[inline(never)]` so the | ||
| /// inline fast path in `clone` stays small. | ||
| #[inline(never)] | ||
| fn clone_heap(&self) -> Self { |
There was a problem hiding this comment.
This function indeed combines buffer allocate + push_slice + transmute into a single function. But there is a Buffer::allocate_exact function existing, I insist to use Buffer's function for allocation, in case in future we plan to improve the allocators used for Buffer. Don't directly call alloc here.
For skipping the sign check, I think it's fair to flatten it out, but it can still be achieved by create the Buffer first, and then access and flip the capacity field if necessary.
| fn clone_from(&mut self, src: &Self) { | ||
| let (src_cap, src_sign) = src.sign_capacity(); | ||
| let (cap, _) = self.sign_capacity(); | ||
| let src_cap_raw = src.capacity.get(); |
There was a problem hiding this comment.
I don't think there is a difference before and after your change on this function. The only change that possibly makes a difference is adding the inline attribute. If adding the attribute makes an improvement, then we can just add that.
Optimises clones for inline-sized values by copying the union and signed capacity directly, with the heap path split into a separate
clone_heaphelper;clone_fromgets the matching fast path. About a 20% improvement in one of my benchmarks (clone followed by drop, heavily hitting the inline path).Benchmarks (layout-sampled, K=11)
Wall-clock, dashu-only, vs the merge base, measured under 11 randomised function layouts (
ld64 -order_file) and reported as the median — to average out the ±10% code-placement noise that otherwise swamps these small-value micro-benchmarks. The untouchedubig_mulcontrol reads 0.0% ± 0.0%, confirming a clean run.Inline-value clone is faster; heap clone is unchanged. (
ibig_dropclones-then-drops each iteration, so it reflects the inline clone speed-up.)ibig_dropinline (clone+drop)ubig_clone/ibig_cloneinline*_clone/*_dropheap (large/mid)ubig_mul(control)