Support IVF-RaBitQ in cuVS Library by Stardust-SJF · Pull Request #1866 · rapidsai/cuvs

Stardust-SJF · 2026-03-03T11:36:04Z

This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.

- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.

- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite

- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive

* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh

…q' into jamxia_cuvs_ivf_rabitq

* Download Eigen automatically by rapids-cmake * Disable FAISS and DISKANN benchmarks * add config files and update readme * Update Readme and openai_1M config * Update python bench command line * update README * update README --------- Co-authored-by: James Xia <jamxia@nvidia.com>

- Error-checking - Stream-ordered CUDA calls

tfeher · 2026-03-04T14:11:42Z

/ok to test 994e951

tfeher

Thanks @Stardust-SJF for opening the PR! We are excited to have a GPU accelerated IVF-RaBitQ method in cuVS. Also thanks to @jamxia155 for working on the cuVS integration. Here is my first batch of comments (focusing on public API and benchmark wrappers).

tfeher

A few more comments for the build method.

tfeher · 2026-03-17T17:54:15Z

+      d_dataset_array = raft::make_device_mdarray<T>(
+        handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim));


We should not copy the whole dataset to the index. I would prefer if we can process similarly how other IVF methods in cuVS work:

clusters trained on a subset of the data. Only subset needs to be copied to the GPU.

We compress the dataset batch-wise. Only copy one batch at a time to the GPU.

Can the compression step be done batch-wise, or do we need to see the whole dataset for that?

We had previously aligned on the need for out-of-core building but did not find it realistic to target for the initial release.

In the meantime, would it make sense to re-include the CPU-based index construction function as a stopgap? It won't be accelerated but at least enables building of large datasets.

From my point of view, the build algorithm is designed to quantize data cluster by cluster, so it would be OK to transfer data between CPU and GPU at the cluster level (at the cost of making it a bit slower). Can we target this feature in the following releases (rather than the initial release)?

At the same time, the IVF-RaBitQ (GPU) has a redesigned data layout and quantization pipeline for GPUs, and, unfortunately, there is currently no CPU-based index construction for it. For CPU-based index construction, we need to reorganize the parallel granularity and rewrite the build process with SIMD-accelerated instructions.

Sorry, I was confused by my recollection of the older construct method that required the dataset to be on host. But @Stardust-SJF is right that even that method would internally copy the entire dataset to the GPU so it won't serve as a stopgap for out-of-core building.

The older construct method is written for cases where the clustering results are on disk or in main memory. Sorry for the inconvenience caused by the lack of necessary comments.

I pushed a construct_on_gpu_streaming method that streams in batches of vectors from a dataset on host for index construction. The index construction is slower by about 3X with multithreading for the host-side gathering step (as tested on a CPU with 24/48 physical/logical cores). However, the subsampling for kmeans clustering is much slower (by around 10X) when runinng from host data. Having said that, I think this bottleneck will be resolved once out-of-core clustering is available.

tfeher · 2026-03-17T18:02:25Z

+    // TODO: a proper sampling
+    if constexpr (std::is_same_v<T, float>) {
+      raft::matrix::sample_rows<T, int64_t>(handle, random_state, dataset, trainset.view());
+    } else {


Do we need the else branch? k-means shall support different input types.

Thanks for pointing this out! Removed unnecessary casting.

- Remove commented-out code - Rename a variable - Update index type for device matrix views

Replace `kmeans_trainset_fraction` with `max_points_per_cluster`

Implement streaming index construction for IVF-RaBitQ to handle datasets that exceed available GPU memory. This enables building indices for large datasets by processing data in batches streamed from host memory. Key features: - Automatic detection based on dataset size vs available workspace - Complete-cluster batching strategy (no partial clusters across batches) - OpenMP parallel host data gathering with persistent thread pool - Contiguous data handling in quantizer for improved performance - Configurable batch size via streaming_batch_size parameter The implementation uses omp_get_max_threads() to scale with available hardware while maintaining efficient memory bandwidth utilization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add an optional parameter to force streaming construction regardless of dataset size. This provides users with explicit control over the construction method for testing or specific use cases. When force_streaming is enabled: - Streaming construction is used even if dataset fits in GPU memory - Distinct log message indicates explicit vs automatic decision to use streaming construction Default behavior (force_streaming=false) remains unchanged, with automatic detection based on dataset size vs available workspace. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Enable control of the force_streaming parameter through JSON benchmark configuration files. Users can now specify force_streaming in the build_param section of their benchmark configs. Example usage: "build_param": { "nlist": 10000, "force_streaming": true, ... } This allows benchmark configurations to explicitly control streaming construction for performance testing and comparison. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.

Adds build_forced_streaming test case that explicitly enables streaming construction even for small datasets that fit in GPU memory. This validates the streaming code path with dynamic batch sizing and ensures compatibility with serialization/deserialization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove batch_flag member variables and associated dead code, then refactor DataQuantizerGPU to move private methods to free functions for better encapsulation. Part 1: Remove batch_flag dead code - Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq - Remove dual code path conditionals (AoS vs SoA layouts) - Simplify helper methods: first_block_batch() → first_block(), ex_factor_batch() → ex_factor() - Simplify GetExFactorBytes() and block_bytes() to single return - Maintain backward compatibility in save/load (legacy flag handling) - Remove dead methods: quantize(), quantize_contiguous(), data_transformation(), data_transformation_contiguous() - Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction) Part 2: Move private methods to free functions - Convert 5 private methods to free functions in anonymous namespace: * data_transformation_batch_opt() * data_transformation_batch_opt_contiguous() * rabitq_codes_and_factors_fused() * exrabitq_codes_and_factors_fused() * exrabitq_codes_and_factors_fused_ori() - Pass all needed class members as explicit parameters - Remove ~40 lines from public header (quantizer_gpu.cuh) - Clean up unused variables Benefits: - Eliminates confusing dual code paths - Cleaner public API with implementation details hidden - Faster compilation for files including headers - Better separation of interface and implementation Files modified: - cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu} - cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu} - cpp/src/neighbors/ivf_rabitq.cu Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

tfeher · 2026-03-25T19:28:05Z

/ok to test 52793fc

jamxia155 and others added 30 commits November 3, 2025 08:03

Add library for IVF-RaBitQ

e90929a

- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.

Add benchmarking executables for IVF-RaBitQ

4a5c6dc

- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite

Add --executable-dir option

ec88c25

Add documentation about how to describe new datasets

bcc80e2

fix style

4db552f

update docstring

fbfe242

Do not prompt for executable-dir

5c53816

Enable IVF-RaBitQ in cuvs_bench python wrapper

285a9dd

Use SPDX for copyright headers

9f5a3d2

Add documentation for 3rd-party dependency

133f808

Add FAISS CPU IVF-RaBitQ algorithm

09075bc

Enable FAISS CPU IVF-RaBitQ in cuvs_bench python wrapper

4884e89

Rename parameter for consistency

cbd2a05

- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive

Fix cuVS build issues with RaBitQ (rapidsai#4)

7249455

* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh

Handle host and device data in build

1a26a71

Merge remote-tracking branch 'Stardust-SJF_fork/jamxia_cuvs_ivf_rabit…

d49bd0b

…q' into jamxia_cuvs_ivf_rabitq

Disable separable compilation for IVF-RaBitQ code

794b421

Remove outdated instructions

928945b

Plumbing for passing raft handle to IVF-RaBitQ

9a5c0ef

Update rotator_gpu class

389c917

- Error-checking - Stream-ordered CUDA calls

Migrate RotatorGPU class to RAFT

114d560

Remove cuBLAS from RotatorGPU class

db8a437

Remove Eigen dependency in DataQuantizerGPU

b263628

Remove uses of Eigen library

8404b99

Remove dependency Eigen

035e978

(WIP) Add tests for IVF-RaBitQ

c73ef60

Merge remote-tracking branch 'upstream/main' into jamxia_cuvs_ivf_rabitq

6cac52d

Replace header guards with #pragma once

f03ec47

Add namespace

f0be854

aamijar assigned Stardust-SJF Mar 4, 2026

aamijar moved this to In Progress in Unstructured Data Processing Mar 4, 2026

tfeher requested changes Mar 17, 2026

View reviewed changes

tfeher reviewed Mar 17, 2026

View reviewed changes

Comment thread cpp/src/neighbors/detail/vamana/robust_prune.cuh

tfeher requested changes Mar 17, 2026

View reviewed changes

jamxia155 and others added 20 commits March 18, 2026 06:13

Remove oudated algo name

5684f7b

Minor updates based on review comments

bbeb058

- Remove commented-out code - Rename a variable - Update index type for device matrix views

Merge remote-tracking branch 'upstream/main' into cuvs_ivf_rabitq

75b7b19

Update index build parameters

adad8bc

Replace `kmeans_trainset_fraction` with `max_points_per_cluster`

Update index build parameter default

dd6b4ba

Remove outdated parameter

088665b

Remove unnecessary overload

04ae19c

Let build API return the built index

4496d93

Merge remote-tracking branch 'upstream/main' into cuvs_ivf_rabitq

bef16e9

Revert change to unrelated file

485eade

Use public header for kmeans clustering

9761933

Remove unnecessary casting

09c42a5

Clarify force_streaming only applies to datasets on host

9e76b26

Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.

Code cleanup

24e1e3d

Consolidate quantizer_gpu implementation into single file

52793fc

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

lowener linked an issue Apr 13, 2026 that may be closed by this pull request

[FEA] Support RaBitQ quantization for CAGRA #851

Open

lowener mentioned this pull request Apr 13, 2026

[FEA] Support RaBitQ quantization for CAGRA #851

Open

lowener removed a link to an issue Apr 13, 2026

[FEA] Support RaBitQ quantization for CAGRA #851

Open

		d_dataset_array = raft::make_device_mdarray<T>(
		handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim));

Conversation

Stardust-SJF commented Mar 3, 2026

Uh oh!

tfeher commented Mar 4, 2026

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tfeher left a comment

Choose a reason for hiding this comment

Uh oh!

tfeher Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jamxia155 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Stardust-SJF Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

jamxia155 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Stardust-SJF Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamxia155 Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tfeher Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

jamxia155 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

tfeher commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Stardust-SJF Mar 19, 2026 •

edited

Loading