Optimise Repr::clone and clone_from for the inline case by DRMacIver · Pull Request #69 · cmpute/dashu

DRMacIver · 2026-06-02T06:20:31Z

Optimises clones for inline-sized values by copying the union and signed capacity directly, with the heap path split into a separate clone_heap helper; clone_from gets the matching fast path. About a 20% improvement in one of my benchmarks (clone followed by drop, heavily hitting the inline path).

Benchmarks (layout-sampled, K=11)

Wall-clock, dashu-only, vs the merge base, measured under 11 randomised function layouts (ld64 -order_file) and reported as the median — to average out the ±10% code-placement noise that otherwise swamps these small-value micro-benchmarks. The untouched ubig_mul control reads 0.0% ± 0.0%, confirming a clean run.

Inline-value clone is faster; heap clone is unchanged. (ibig_drop clones-then-drops each iteration, so it reflects the inline clone speed-up.)

Benchmark	Δ
`ibig_drop` inline (clone+drop)	−20%
`ubig_clone` / `ibig_clone` inline	~−5%
`_clone` / `_drop` heap (large/mid)	unchanged
`ubig_mul` (control)	0.0%

The callgrind profile of hegel-rust's shrinker (per the hegel_shrinker bench file's comments) reports Repr::clone at 13.24 % of total Ir, with values almost exclusively in the i64 / i128 range — i.e. the inline variant of Repr (capacity in {-2,-1,1,2}). The old implementation went through sign_capacity to extract the absolute capacity + Sign, then constructed a positive Repr, then re-applied the sign via with_sign. That's a redundant round-trip when all we need to do for the inline case is copy 3 machine words verbatim. This patch: * marks Clone::clone #[inline] so the trivial inline path lands at every call site without a function call; * inline path is now a direct struct copy: the union's field plus the original signed ; * heap path is split into a #[cold] + #[inline(never)] helper so the inline fast path stays tiny everywhere it's instantiated. (Without the explicit #[inline(never)] LLVM happily inlines it back and the heap case actually gets slightly slower in benches — apparently the inlined-in-clone shape compiles worse than going through a normal function call. Tracked the experiment in the commit log.) Results against the new integer/benches/hegel_shrinker baseline: shrinker_consider/64 -29.1 % shrinker_consider/16 -25.5 % shrinker_consider/4 -12.7 % ibig_clamp/two_word -42.6 % (clamp uses clone heavily) ibig_clamp/one_word -40.2 % ibig_drop/two_word -35.3 % (incidental: tighter assignment ibig_drop/one_word -34.6 % pattern in the surrounding bench ibig_drop/zero -34.2 % loop exposes a folding win) ibig_clone/two_word -21.4 % ibig_clone/one_word -20.3 % ibig_clone/zero -20.0 % ibig_clone/just_over_inline +7.7 % (heap path: extra function call) ibig_clone/mid +12.7 % ibig_clone/large ~0 % (memcpy-bound, call overhead lost in the noise) The heap regression is the deliberate trade-off: hegel's shrinker uses ≤128-bit values almost exclusively, so the inline win dwarfs it. Any downstream consumer that clones large IBig values heavily would hit the +13 % on mid-sized values; if that ever shows up as a hot spot we can revisit the inline/outline split.

Same shape as the clone() optimisation: marks clone_from #[inline] and rewrites the inline-source fast path to avoid the sign_capacity unpack (it only needs to know abs(capacity) <= 2 to decide which path to take, and can copy the signed capacity verbatim). The sign-aware extraction is still done in the heap path for the reallocation-and-copy logic. Not shown in the hegel_shrinker benches because Vec::clone() goes through T::clone() per element, not T::clone_from(); the bench file does not have an explicit clone_from-driven scenario. The change is for symmetry / downstream consumers that hit *dest = src.clone() patterns the compiler rewrites to clone_from. Tests pass (lib + hegel-tests stateful).

Two changes to Repr::clone_heap, both motivated by hegel_shrinker bench data showing the just_over_inline / mid heap clones regressed +8 to +13% vs the prior baseline (commit 583ee7e). 1. Skip the Buffer scaffolding and allocate directly. The old body went through Buffer::allocate (which calls allocate_exact which calls allocate_raw, with two near-duplicate size-bound asserts along the way), then push_slice (one more capacity-vs-len assert), then mem::transmute, then a conditional sign flip on the resulting Repr. The new body computes default_capacity inline, calls alloc::alloc::alloc once via from_size_align_unchecked, ptr::copy_n's, and constructs the Repr with the signed capacity set from the start. That eliminates the redundant checks plus a __rust_no_alloc_shim_is_unstable_v2 bl that the Buffer wrapper was pulling in. 2. Drop the #[cold] annotation. The shrinker_consider/64 workload bench (the most realistic top-level scenario from hegel_shrinker.rs) does ~3 pp better without #[cold]. The isolated ibig_clone microbench does ~3 pp worse — i.e. the trade-off is in code-layout, not in clone work itself. The realistic compound workload is what matters. Bench impact vs the 'current' baseline (saved before any Repr::clone changes — so this is the cumulative result of both this commit and the prior fast-path inline): ibig_clone/zero -16.6 % (vs -20.0 % before) ibig_clone/one_word -15.2 % ibig_clone/two_word -17.9 % ibig_clone/just_over_inline +2.0 % (was +7.7 %) ibig_clone/mid +0.7 % (was +12.7 %) ibig_clone/large -5.1 % (was ~0 %) shrinker_consider/64 -32.3 % (was -29.1 %) shrinker_consider/16 -27.4 % shrinker_consider/4 -13.5 % The heap regression is essentially gone: just_over_inline and mid are in the noise, large is actively faster than the original code by 5 %. The inline microbench gave up ~3 pp vs the prior commit, but that's a code-layout side effect of dropping #[cold] and is more than compensated for in the realistic shrinker workload.

Adds tests/clone.rs covering clone and clone_from across every inline/heap (dst, src) combination — including the heap-dst + inline-src case that frees the old buffer, and inline-dst + heap-src that allocates. Also fixes two lints the cherry-picked clone code tripped under upstream CI: the manual range check (clippy::manual_range_contains) and the if-else formatting in clone_from. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cmpute · 2026-06-03T02:10:27Z

+    /// bound checks and the sign-flip dance. Kept `#[inline(never)]` so the
+    /// inline fast path in `clone` stays small.
+    #[inline(never)]
+    fn clone_heap(&self) -> Self {


This function indeed combines buffer allocate + push_slice + transmute into a single function. But there is a Buffer::allocate_exact function existing, I insist to use Buffer's function for allocation, in case in future we plan to improve the allocators used for Buffer. Don't directly call alloc here.

For skipping the sign check, I think it's fair to flatten it out, but it can still be achieved by create the Buffer first, and then access and flip the capacity field if necessary.

cmpute · 2026-06-03T02:13:33Z

    fn clone_from(&mut self, src: &Self) {
-        let (src_cap, src_sign) = src.sign_capacity();
-        let (cap, _) = self.sign_capacity();
+        let src_cap_raw = src.capacity.get();


I don't think there is a difference before and after your change on this function. The only change that possibly makes a difference is adding the inline attribute. If adding the attribute makes an improvement, then we can just add that.

DRMacIver and others added 5 commits June 1, 2026 11:46

Trim verbose and profiling-context comments in clone path

59f6344

DRMacIver marked this pull request as ready for review June 2, 2026 09:17

cmpute requested changes Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise Repr::clone and clone_from for the inline case#69

Optimise Repr::clone and clone_from for the inline case#69
DRMacIver wants to merge 5 commits into
cmpute:masterfrom
DRMacIver:opt/p6-clone-fast-path

DRMacIver commented Jun 2, 2026 •

edited

Loading

Uh oh!

cmpute Jun 3, 2026

Uh oh!

cmpute Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DRMacIver commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmpute Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

cmpute Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DRMacIver commented Jun 2, 2026 •

edited

Loading