Skip to content

Int overflow check with a faster approach#394

Open
lemire wants to merge 3 commits into
mainfrom
int-overflow-simdjson-approach
Open

Int overflow check with a faster approach#394
lemire wants to merge 3 commits into
mainfrom
int-overflow-simdjson-approach

Conversation

@lemire

@lemire lemire commented Jun 14, 2026

Copy link
Copy Markdown
Member

PR #393 by @sahvx655-wq found a bug in the overflow check when parsing integers. However, the check was not optimal. This updated PR brings a cheaper approach.

sahvx655-wq and others added 3 commits June 13, 2026 21:21
The previous commit detects multi-wrap u64 overflow at the max_digits
boundary by re-parsing the digits through a checked multiply-add loop
(O(max_digits)). Replace that with the constant-time check used in
simdjson: the leading digit plus a single threshold comparison.

For a max_digits-length value, min_safe_u64(base) == base^(max_digits-1)
is the smallest such value and also the width of each leading-digit band
[d*ms, (d+1)*ms). Since that width is < 2^64, the only band that can
straddle 2^64 is d == dmax (the largest leading digit that still fits),
and there it straddles at most once, so a single threshold dmax*ms
separates wrapped from non-wrapped values. A leading digit above dmax
always overflows; below dmax always fits. dmax and the threshold derive
from the existing min_safe_u64 table, so no new tables are needed and
dmax*ms cannot itself overflow.

Add a programmatic, self-verifying test for parse_int_string overflow
detection covering bases 2..36, complementing the hand-picked strings
added earlier. Every generated input is cross-checked against an
independent trusted oracle (a plain 64-bit checked multiply-add); on
success the parsed value is also compared exactly and full consumption
of the input is asserted.

Per base it exercises:
  - an exact-boundary sweep of the 64 values straddling 2^64
    (UINT64_MAX-31 .. 2^64+31), built by walking the digit string;
  - UINT64_MAX, 2^64 and the all-max-digit value, each also with
    leading zeros;
  - random max_digits-length values across every leading digit, with
    the heaviest sampling on the lead == dmax band that straddles 2^64,
    and full coverage of lead > dmax (the multi-wrap region the naive
    min_safe check accepted by mistake);
  - max_digits-1 (never overflows) and max_digits+1 (always overflows).
A small signed (int64_t) section checks the exact INT64_MIN/INT64_MAX
limits round-trip and that INT64_MAX+1 / INT64_MIN-1 are rejected in
every base.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants