Skip to content

refactor: SVS shared thread pool with rental model (MOD-9881)#925

Open
meiravgri wants to merge 23 commits intomainfrom
meiravg_adjustable_thpool
Open

refactor: SVS shared thread pool with rental model (MOD-9881)#925
meiravgri wants to merge 23 commits intomainfrom
meiravg_adjustable_thpool

Conversation

@meiravgri
Copy link
Copy Markdown
Collaborator

@meiravgri meiravgri commented Mar 31, 2026

Overview

Replaces per-index SVS thread pools with a single shared thread pool that can be physically resized at runtime, and introduces a rental-based execution model for parallel_for. This eliminates over-provisioning (previously K indexes × N threads = K×N OS threads) and enables the SVS pool to actually grow/shrink when the external pool capacity changes — something the old design could not do.

Also adds a new C API VecSim_UpdateThreadPoolSize() that combines write-mode toggling with pool resizing, and deprecates SVSParams::num_threads.

Key Guarantees and Assumptions

The fundamental scheduling invariant

The shared SVS thread pool has the same size as the RediSearch worker thread pool (search-workers). Every SVS multi-threaded operation (update, GC) is dispatched via SVSMultiThreadJob, which submits N reserve jobs to the RediSearch worker pool — one per worker thread. Only the threads that actually check in (are not busy with other work) participate in the SVS operation. This means:

  • The sum of all concurrent rent() requests can never exceed the pool capacity, because each rented SVS thread corresponds to a RediSearch worker that checked in via a reserve job. If a worker is busy, it doesn't check in, so it's never counted in availableThreads and never requested from the SVS pool.
  • setParallelism(n) is always called with n = min(availableThreads, problemSize), where availableThreads ≤ poolSize(). This is why the assertion n <= pool->size() is safe.
  • Shrink while threads are rented is safe by design — rented ThreadSlots are held via shared_ptr; shrinking drops the pool's reference but the renter keeps the slot (and its OS thread) alive until ~RentedThreads releases it. Resize never blocks on in-flight operations.

Other invariants

  • Pool is always valid — the shared pool singleton is initialized with size 1 on first access (VecSimSVSThreadPoolImpl::instance()). There is no nullptr state; write-in-place mode uses a pool of size 1 (0 worker threads, calling thread only).
  • parallelism_ >= 1 — the calling thread always participates, so parallelism can never be 0.
  • Rental is safe during resize — rented ThreadSlots are held via shared_ptr; shrinking drops the pool's reference but the renter keeps the slot (and OS thread) alive until ~RentedThreads.
  • No cross-index state leakage — each index has its own shared_ptr<atomic<size_t>> parallelism_; setParallelism() on one index never affects another.

Relationship with RediSearch

  • RediSearch owns the scheduling — it decides when to dispatch SVS jobs and how many workers are available.
  • VecSim owns the threads — the SVS thread pool is a separate set of OS threads inside VecSim, distinct from RediSearch workers.
  • RediSearch's only responsibility toward the SVS pool is calling VecSim_UpdateThreadPoolSize(N) when the worker count changes, so VecSim can create/destroy the matching number of OS threads.
  • The two pools are always the same size: RediSearch workers = N, SVS threads = N. This 1:1 sizing is what makes the "sum of rents ≤ capacity" guarantee work.

Changes from Old Implementation

Aspect Old (per-index pools) New (shared pool + rental)
Pool ownership Each SVS index creates its own VecSimSVSThreadPoolImpl Singleton VecSimSVSThreadPoolImpl shared across all indexes via shared_ptr
Thread lifecycle Pre-allocated at max capacity; resize() clamped a logical counter Physical grow/shrink: resize() spawns/destroys OS threads
Pool could be null Yes (write-in-place = no pool) Never null; write-in-place = pool with size 1
parallel_for threading Threads used directly from the per-index pool Threads rented from shared pool under mutex, released lock-free via atomic<bool>
Concurrent index operations Each index had dedicated threads (K×N total) Indexes rent disjoint subsets from N shared threads
Per-index parallelism control setNumThreads(n) / resize(n) — mutated the pool setParallelism(n) — sets a per-index atomic<size_t>, pool unchanged
Pool size query capacity() (max pre-allocated) poolSize() (current shared pool size)
Error handling manage_exception_during_run() threw unconditionally manage_workers_after_run() only throws if an error occurred

API Changes

Renamed methods

Old New Scope
setNumThreads(n) setParallelism(n) SVSIndex, SVSIndexBase virtual interface
getNumThreads() getParallelism() SVSIndex, SVSIndexBase virtual interface
getThreadPoolCapacity() getPoolSize() SVSIndex, SVSIndexBase virtual interface

Removed methods

Method Reason
VecSimSVSThreadPool::capacity() No pre-allocation; use poolSize() for shared pool size
VecSimSVSThreadPool::resize() Wrappers don't own threads; use setParallelism() for per-index control, VecSimSVSThreadPool::resize(size_t) (static) for global pool

New C API

// Combines write-mode toggling + physical SVS pool resize.
// new_size == 0 → WriteInPlace mode, pool resized to 1
// new_size >  0 → WriteAsync mode, pool resized to new_size
void VecSim_UpdateThreadPoolSize(size_t new_size);

New internal APIs

API Purpose
VecSimSVSThreadPoolImpl::instance() Static singleton accessor; returns shared_ptr<VecSimSVSThreadPoolImpl> by value
VecSimSVSThreadPool(void* log_ctx) Constructor — uses shared pool singleton internally
VecSimSVSThreadPool::resize(size_t) Static method — resizes the shared pool singleton
VecSimSVSThreadPool::poolSize() Returns shared pool size (for scheduling reserve jobs)
VecSimSVSThreadPoolImpl::rent(count, log_ctx) Rents worker threads with RAII guard (RentedThreads)

Deprecated

API Replacement Notes
VecSim_SetWriteMode() VecSim_UpdateThreadPoolSize() Kept for internal VecSim usage (called inside VecSim_UpdateThreadPoolSize) and unit tests. Not called by RediSearch — RediSearch will use only VecSim_UpdateThreadPoolSize
SVSParams::num_threads Shared pool singleton Deprecated and ignored. Setting it to a non-zero value emits a deprecation warning log. Kept for backward compatibility with existing callers/tests. Pool size is controlled globally via VecSim_UpdateThreadPoolSize()

New types

  • ThreadSlot — wraps svs::threads::Thread + atomic<bool> occupied; non-copyable, non-movable
  • RentedThreads — RAII guard; move-only; releases slots via lock-free atomic stores in destructor
  • Deferred resize lifecycle — managed via a bool isScheduled flag on SVSMultiThreadJob; its destructor calls endScheduledJob() on the shared pool (no separate ScheduledJobToken class)

Behavioral Changes

Area Old behavior New behavior
setParallelism(0) resize(0) silently clamped to 1 Asserts (debug) / undefined (release) — reserving 0 threads is a bug
setParallelism(n > poolSize) Grew the per-index pool beyond search-workers Asserts n <= pool->size() — scheduling bug
n > parallelism in parallel_for Threw ThreadingException if n > size_ Allowed (no assertion) — supports deferred resize scenarios where the pool may have shrunk after job scheduling
parallelism_ initial value Implicitly matched pool size at construction Starts at 1 — safe for immediate write-in-place use
Fewer threads rented than requested N/A (no rental) Logs a warning via logCallback, asserts in debug, uses rented.count() for graceful degradation in release
Exception in parallel_for manage_exception_during_run() always threw manage_workers_after_run() only throws if an error actually occurred
updateSVSIndexWrapper Always locked backend even if no vectors to move Skips lock when labels_to_move is empty
SVSParams::num_threads set to non-zero Used as per-index pool size Ignored; logs a deprecation warning

Public info API fields — unchanged

The debug info field names numThreads and lastReservedThreads (and their string keys NUM_THREADS, LAST_RESERVED_NUM_THREADS) are unchanged. Their semantics shift slightly:

  • numThreads → now reports getPoolSize() (shared pool size, not per-index capacity)
  • lastReservedThreads → now reports getParallelism() (per-index parallelism for last operation)

What Does NOT Change

  • ScalableVectorSearch (deps/ScalableVectorSearch/) — no modifications
  • SVSMultiThreadJob pattern — still submits N reserve jobs to the external worker pool
  • HNSW tiered indexes — unaffected
  • SVS Thread state machine — used as-is; lifecycle managed via shared_ptr

Files Changed

File Changes
src/VecSim/algorithms/svs/svs_utils.h Core refactor: ThreadSlot, RentedThreads, rental-based VecSimSVSThreadPoolImpl, per-index VecSimSVSThreadPool wrapper
src/VecSim/algorithms/svs/svs.h Singleton accessor, API renames, constructor takes shared pool, num_threads deprecation warning
src/VecSim/algorithms/svs/svs_tiered.h API renames, empty-labels guard in updateSVSIndexWrapper, SVSMultiThreadJob uses bool isScheduled + destructor-based endScheduledJob() instead of ScheduledJobToken, createJobsImpl merged into createJobs
src/VecSim/vec_sim.h / vec_sim.cpp New VecSim_UpdateThreadPoolSize() C API
src/VecSim/vec_sim_common.h SVSParams::num_threads marked as deprecated in comments
tests/unit/test_svs_tiered.cpp Adapted to shared pool: VecSim_UpdateThreadPoolSize in setup, removed num_threads from params, updated assertions
tests/unit/test_svs_fp16.cpp Same test adaptations
tests/unit/test_svs.cpp numThreads info field expects default; new NumThreadsParamIgnored test
tests/benchmark/bm_utils.h / bm_vecsim_svs.h API renames in benchmark helpers

Testing

  • All existing SVS tiered tests updated to use the shared pool model
  • testThreadPool rewritten to cover: default parallelism, setParallelism boundary assertions (ASSERT_DEATH), parallel_for with n < parallelism, write-in-place mode, exception handling
  • TestDebugInfoThreadCount / TestDebugInfoThreadCountWriteInPlace updated: lastReservedThreads now starts at 1 (not num_threads)
  • ThreadsReservation test uses tieredIndexMock(num_threads) to properly size the shared pool
  • New: NumThreadsParamIgnored — verifies that setting SVSParams::num_threads emits a deprecation warning, does not affect the shared pool size, and that omitting it produces no warning
  • Tests that previously set num_threads = 1 in SVSParams now use tieredIndexMock(1) to resize the shared pool instead

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

Note

High Risk
High risk because it rewrites SVS threading and job scheduling, introducing shared mutable concurrency primitives (rental, deferred shrink, per-index parallelism) that can affect correctness and stability under load.

Overview
SVS threading is refactored to a shared, physically resizable thread pool with a rental-based parallel_for, allowing multiple indexes/jobs to concurrently rent disjoint worker threads and making shrink/grow safe during active operations.

The SVS public/virtual threading knobs are renamed from threads to parallelism (e.g., get/setParallelism, getPoolSize), SVSParams::num_threads is deprecated/ignored (with warning log when set), and SVS debug info now reports shared pool size as numThreads and last-used per-index parallelism as lastReservedThreads.

Adds VecSim_UpdateThreadPoolSize() C API to toggle write mode (0 → in-place, >0 → async) while resizing the shared SVS pool, updates tiered SVS update/GC scheduling to snapshot pool size and defer pool shrink until scheduled jobs complete, and expands/adjusts benchmarks and unit tests (including new test_svs_threadpool.cpp) to cover resize/rental/deferred-shrink behavior and the deprecated param handling.

Reviewed by Cursor Bugbot for commit df969a1. Bugbot is set up for automated code reviews on this repo. Configure here.

…l terminology**

Rename `setNumThreads`/`getNumThreads` → `setParallelism`/`getParallelism` and `getThreadPoolCapacity` → `getPoolSize` across VectorSimilarity. Public info API fields (`numThreads`, `lastReservedThreads`, `NUM_THREADS`, `LAST_RESERVED_NUM_THREADS`) remain unchanged. No behavioral changes.
@jit-ci
Copy link
Copy Markdown

jit-ci bot commented Mar 31, 2026

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 31, 2026

Codecov Report

❌ Patch coverage is 94.73684% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.91%. Comparing base (f177c40) to head (df969a1).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/VecSim/algorithms/svs/svs_utils.h 92.85% 7 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #925      +/-   ##
==========================================
- Coverage   96.98%   96.91%   -0.08%     
==========================================
  Files         129      129              
  Lines        7574     7651      +77     
==========================================
+ Hits         7346     7415      +69     
- Misses        228      236       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

thpool tests
fix fp16 tests
@meiravgri meiravgri changed the title SVS Shared Thread Pool Refactor (MOD-9881) refactor: SVS shared thread pool with rental model (MOD-9881) Apr 7, 2026
rfsaliev
rfsaliev previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Collaborator

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM, but some suggestions.

rfsaliev
rfsaliev previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM

@meiravgri meiravgri enabled auto-merge April 14, 2026 15:27
@meiravgri meiravgri disabled auto-merge April 14, 2026 15:27
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit df969a1. Configure here.

size_t num_threads = VecSimSVSThreadPoolImpl::instance()->beginScheduledJob();
return createJobs(allocator, jobType, callback, index, num_threads, threads_wait_timeout,
registry, /*scheduled=*/true);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaked pending_jobs_ counter if createJobs allocation fails

Medium Severity

createScheduledJobs calls beginScheduledJob() (incrementing pending_jobs_) before createJobs. If createJobs throws (e.g., std::bad_alloc from vector or new (allocator) for job objects), no SVSMultiThreadJob is created, so endScheduledJob() is never called. The pending_jobs_ counter stays permanently elevated, causing all future pool shrink operations to be deferred indefinitely — the pool can never shrink again for the lifetime of the process.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit df969a1. Configure here.

// Resize the shared SVS thread pool. resize() clamps to minimum size 1.
// new_size == 0 → pool size 1 (only calling thread, write-in-place).
// new_size > 0 → pool size new_size (new_size - 1 worker threads).
VecSimSVSThreadPool::resize(std::max(new_size, size_t{1}));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant std::max clamping before resize call

Low Severity

VecSim_UpdateThreadPoolSize applies std::max(new_size, size_t{1}) before calling VecSimSVSThreadPool::resize(), but VecSimSVSThreadPoolImpl::resize() already applies the identical clamp internally. The outer clamp is redundant and slightly obscures which layer is responsible for the minimum-size-1 invariant.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit df969a1. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants