Skip to content

Optimise Repr::clone and clone_from for the inline case#69

Open
DRMacIver wants to merge 5 commits into
cmpute:masterfrom
DRMacIver:opt/p6-clone-fast-path
Open

Optimise Repr::clone and clone_from for the inline case#69
DRMacIver wants to merge 5 commits into
cmpute:masterfrom
DRMacIver:opt/p6-clone-fast-path

Conversation

@DRMacIver

@DRMacIver DRMacIver commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Optimises clones for inline-sized values by copying the union and signed capacity directly, with the heap path split into a separate clone_heap helper; clone_from gets the matching fast path. About a 20% improvement in one of my benchmarks (clone followed by drop, heavily hitting the inline path).

Benchmarks (layout-sampled, K=11)

Wall-clock, dashu-only, vs the merge base, measured under 11 randomised function layouts (ld64 -order_file) and reported as the median — to average out the ±10% code-placement noise that otherwise swamps these small-value micro-benchmarks. The untouched ubig_mul control reads 0.0% ± 0.0%, confirming a clean run.

Inline-value clone is faster; heap clone is unchanged. (ibig_drop clones-then-drops each iteration, so it reflects the inline clone speed-up.)

Benchmark Δ
ibig_drop inline (clone+drop) −20%
ubig_clone / ibig_clone inline ~−5%
*_clone / *_drop heap (large/mid) unchanged
ubig_mul (control) 0.0%

DRMacIver and others added 5 commits June 1, 2026 11:46
The callgrind profile of hegel-rust's shrinker (per the hegel_shrinker
bench file's comments) reports Repr::clone at 13.24 % of total Ir, with
values almost exclusively in the i64 / i128 range — i.e. the inline
variant of Repr (capacity in {-2,-1,1,2}).

The old implementation went through sign_capacity to extract the
absolute capacity + Sign, then constructed a positive Repr, then
re-applied the sign via with_sign. That's a redundant round-trip when
all we need to do for the inline case is copy 3 machine words verbatim.

This patch:

  * marks Clone::clone #[inline] so the trivial inline path lands at
    every call site without a function call;
  * inline path is now a direct struct copy: the union's
    field plus the original signed ;
  * heap path is split into a #[cold] + #[inline(never)] helper so
    the inline fast path stays tiny everywhere it's instantiated.
    (Without the explicit #[inline(never)] LLVM happily inlines it
    back and the heap case actually gets slightly slower in
    benches — apparently the inlined-in-clone shape compiles worse
    than going through a normal function call. Tracked the
    experiment in the commit log.)

Results against the new integer/benches/hegel_shrinker baseline:

  shrinker_consider/64       -29.1 %
  shrinker_consider/16       -25.5 %
  shrinker_consider/4        -12.7 %
  ibig_clamp/two_word        -42.6 %  (clamp uses clone heavily)
  ibig_clamp/one_word        -40.2 %
  ibig_drop/two_word         -35.3 %  (incidental: tighter assignment
  ibig_drop/one_word         -34.6 %   pattern in the surrounding bench
  ibig_drop/zero             -34.2 %   loop exposes a folding win)
  ibig_clone/two_word        -21.4 %
  ibig_clone/one_word        -20.3 %
  ibig_clone/zero            -20.0 %
  ibig_clone/just_over_inline +7.7 %  (heap path: extra function call)
  ibig_clone/mid             +12.7 %
  ibig_clone/large            ~0   %  (memcpy-bound, call overhead lost
                                       in the noise)

The heap regression is the deliberate trade-off: hegel's shrinker uses
≤128-bit values almost exclusively, so the inline win dwarfs it. Any
downstream consumer that clones large IBig values heavily would hit
the +13 % on mid-sized values; if that ever shows up as a hot spot
we can revisit the inline/outline split.
Same shape as the clone() optimisation: marks clone_from #[inline] and
rewrites the inline-source fast path to avoid the sign_capacity unpack
(it only needs to know abs(capacity) <= 2 to decide which path to take,
and can copy the signed capacity verbatim). The sign-aware extraction
is still done in the heap path for the reallocation-and-copy logic.

Not shown in the hegel_shrinker benches because Vec::clone() goes
through T::clone() per element, not T::clone_from(); the bench file
does not have an explicit clone_from-driven scenario. The change is
for symmetry / downstream consumers that hit *dest = src.clone()
patterns the compiler rewrites to clone_from.

Tests pass (lib + hegel-tests stateful).
Two changes to Repr::clone_heap, both motivated by hegel_shrinker bench
data showing the just_over_inline / mid heap clones regressed +8 to +13%
vs the prior baseline (commit 583ee7e).

1. Skip the Buffer scaffolding and allocate directly.

   The old body went through Buffer::allocate (which calls allocate_exact
   which calls allocate_raw, with two near-duplicate size-bound asserts
   along the way), then push_slice (one more capacity-vs-len assert),
   then mem::transmute, then a conditional sign flip on the resulting
   Repr. The new body computes default_capacity inline, calls
   alloc::alloc::alloc once via from_size_align_unchecked, ptr::copy_n's,
   and constructs the Repr with the signed capacity set from the start.
   That eliminates the redundant checks plus a __rust_no_alloc_shim_is_unstable_v2
   bl that the Buffer wrapper was pulling in.

2. Drop the #[cold] annotation.

   The shrinker_consider/64 workload bench (the most realistic top-level
   scenario from hegel_shrinker.rs) does ~3 pp better without #[cold].
   The isolated ibig_clone microbench does ~3 pp worse — i.e. the
   trade-off is in code-layout, not in clone work itself. The realistic
   compound workload is what matters.

Bench impact vs the 'current' baseline (saved before any Repr::clone
changes — so this is the cumulative result of both this commit and the
prior fast-path inline):

  ibig_clone/zero            -16.6 %  (vs -20.0 % before)
  ibig_clone/one_word        -15.2 %
  ibig_clone/two_word        -17.9 %
  ibig_clone/just_over_inline +2.0 %  (was +7.7 %)
  ibig_clone/mid             +0.7 %   (was +12.7 %)
  ibig_clone/large           -5.1 %   (was ~0 %)
  shrinker_consider/64       -32.3 %  (was -29.1 %)
  shrinker_consider/16       -27.4 %
  shrinker_consider/4        -13.5 %

The heap regression is essentially gone: just_over_inline and mid are
in the noise, large is actively faster than the original code by 5 %.
The inline microbench gave up ~3 pp vs the prior commit, but that's a
code-layout side effect of dropping #[cold] and is more than
compensated for in the realistic shrinker workload.
Adds tests/clone.rs covering clone and clone_from across every inline/heap
(dst, src) combination — including the heap-dst + inline-src case that frees
the old buffer, and inline-dst + heap-src that allocates.

Also fixes two lints the cherry-picked clone code tripped under upstream CI:
the manual range check (clippy::manual_range_contains) and the if-else
formatting in clone_from.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DRMacIver DRMacIver marked this pull request as ready for review June 2, 2026 09:17
Comment thread integer/src/repr.rs
/// bound checks and the sign-flip dance. Kept `#[inline(never)]` so the
/// inline fast path in `clone` stays small.
#[inline(never)]
fn clone_heap(&self) -> Self {

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function indeed combines buffer allocate + push_slice + transmute into a single function. But there is a Buffer::allocate_exact function existing, I insist to use Buffer's function for allocation, in case in future we plan to improve the allocators used for Buffer. Don't directly call alloc here.

For skipping the sign check, I think it's fair to flatten it out, but it can still be achieved by create the Buffer first, and then access and flip the capacity field if necessary.

Comment thread integer/src/repr.rs
fn clone_from(&mut self, src: &Self) {
let (src_cap, src_sign) = src.sign_capacity();
let (cap, _) = self.sign_capacity();
let src_cap_raw = src.capacity.get();

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a difference before and after your change on this function. The only change that possibly makes a difference is adding the inline attribute. If adding the attribute makes an improvement, then we can just add that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants