Skip to content

Support IVF-RaBitQ in cuVS Library#1866

Open
Stardust-SJF wants to merge 143 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq
Open

Support IVF-RaBitQ in cuVS Library#1866
Stardust-SJF wants to merge 143 commits intorapidsai:mainfrom
Stardust-SJF:cuvs_ivf_rabitq

Conversation

@Stardust-SJF
Copy link
Copy Markdown

This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.

jamxia155 and others added 30 commits November 3, 2025 08:03
- Currently built as a separate library.
- To be merged with existing `cuvs_objs` library.
- Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work
is completed.
- CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1
- Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ

* Align line formatting && Delete unused variables in robust_prune.cuh
* Download Eigen automatically by rapids-cmake

* Disable FAISS and DISKANN benchmarks

* add config files and update readme

* Update Readme and openai_1M config

* Update python bench command line

* update README

* update README

---------

Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking
- Stream-ordered CUDA calls
@tfeher
Copy link
Copy Markdown
Contributor

tfeher commented Mar 4, 2026

/ok to test 994e951

Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Stardust-SJF for opening the PR! We are excited to have a GPU accelerated IVF-RaBitQ method in cuVS. Also thanks to @jamxia155 for working on the cuVS integration. Here is my first batch of comments (focusing on public API and benchmark wrappers).

Comment thread cpp/bench/ann/src/cuvs/cuvs_benchmark.cu Outdated
Comment thread cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h Outdated
Comment thread cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h Outdated
Comment thread cpp/bench/ann/src/cuvs/cuvs_ivf_rabitq_wrapper.h Outdated
Comment thread cpp/include/cuvs/neighbors/ivf_rabitq.hpp Outdated
Comment thread cpp/include/cuvs/neighbors/ivf_rabitq.hpp Outdated
Comment thread cpp/include/cuvs/neighbors/ivf_rabitq.hpp Outdated
Comment thread cpp/include/cuvs/neighbors/ivf_rabitq.hpp Outdated
Comment thread cpp/src/neighbors/detail/vamana/robust_prune.cuh
Copy link
Copy Markdown
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments for the build method.

Comment on lines +63 to +64
d_dataset_array = raft::make_device_mdarray<T>(
handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not copy the whole dataset to the index. I would prefer if we can process similarly how other IVF methods in cuVS work:

  1. clusters trained on a subset of the data. Only subset needs to be copied to the GPU.
  2. We compress the dataset batch-wise. Only copy one batch at a time to the GPU.

Can the compression step be done batch-wise, or do we need to see the whole dataset for that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had previously aligned on the need for out-of-core building but did not find it realistic to target for the initial release.

In the meantime, would it make sense to re-include the CPU-based index construction function as a stopgap? It won't be accelerated but at least enables building of large datasets.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, the build algorithm is designed to quantize data cluster by cluster, so it would be OK to transfer data between CPU and GPU at the cluster level (at the cost of making it a bit slower). Can we target this feature in the following releases (rather than the initial release)?

At the same time, the IVF-RaBitQ (GPU) has a redesigned data layout and quantization pipeline for GPUs, and, unfortunately, there is currently no CPU-based index construction for it. For CPU-based index construction, we need to reorganize the parallel granularity and rewrite the build process with SIMD-accelerated instructions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was confused by my recollection of the older construct method that required the dataset to be on host. But @Stardust-SJF is right that even that method would internally copy the entire dataset to the GPU so it won't serve as a stopgap for out-of-core building.

Copy link
Copy Markdown
Author

@Stardust-SJF Stardust-SJF Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The older construct method is written for cases where the clustering results are on disk or in main memory. Sorry for the inconvenience caused by the lack of necessary comments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a construct_on_gpu_streaming method that streams in batches of vectors from a dataset on host for index construction. The index construction is slower by about 3X with multithreading for the host-side gathering step (as tested on a CPU with 24/48 physical/logical cores). However, the subsampling for kmeans clustering is much slower (by around 10X) when runinng from host data. Having said that, I think this bottleneck will be resolved once out-of-core clustering is available.

Comment thread cpp/src/neighbors/ivf_rabitq.cu Outdated
Comment thread cpp/src/neighbors/ivf_rabitq.cu Outdated
// TODO: a proper sampling
if constexpr (std::is_same_v<T, float>) {
raft::matrix::sample_rows<T, int64_t>(handle, random_state, dataset, trainset.view());
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the else branch? k-means shall support different input types.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! Removed unnecessary casting.

jamxia155 and others added 20 commits March 18, 2026 06:13
- Remove commented-out code
- Rename a variable
- Update index type for device matrix views
Replace `kmeans_trainset_fraction` with `max_points_per_cluster`
  Implement streaming index construction for IVF-RaBitQ to handle datasets
  that exceed available GPU memory. This enables building indices for
  large datasets by processing data in batches streamed from host memory.

  Key features:
  - Automatic detection based on dataset size vs available workspace
  - Complete-cluster batching strategy (no partial clusters across batches)
  - OpenMP parallel host data gathering with persistent thread pool
  - Contiguous data handling in quantizer for improved performance
  - Configurable batch size via streaming_batch_size parameter

The implementation uses omp_get_max_threads() to scale with available
  hardware while maintaining efficient memory bandwidth utilization.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Add an optional parameter to force streaming construction regardless of
  dataset size. This provides users with explicit control over the
  construction method for testing or specific use cases.

  When force_streaming is enabled:
  - Streaming construction is used even if dataset fits in GPU memory
  - Distinct log message indicates explicit vs automatic decision to
  use streaming construction

  Default behavior (force_streaming=false) remains unchanged, with
  automatic detection based on dataset size vs available workspace.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Enable control of the force_streaming parameter through JSON benchmark
  configuration files. Users can now specify force_streaming in the
  build_param section of their benchmark configs.

  Example usage:
    "build_param": {
      "nlist": 10000,
      "force_streaming": true,
      ...
    }

  This allows benchmark configurations to explicitly control streaming
  construction for performance testing and comparison.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
  Update documentation to note that force_streaming has no effect when
  the dataset is already in device memory, as streaming construction is
  only applicable for host-to-device data transfer.
  Adds build_forced_streaming test case that explicitly enables streaming
  construction even for small datasets that fit in GPU memory. This validates
  the streaming code path with dynamic batch sizing and ensures compatibility
  with serialization/deserialization.

  Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove batch_flag member variables and associated dead code, then refactor
DataQuantizerGPU to move private methods to free functions for better
encapsulation.

Part 1: Remove batch_flag dead code
- Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq
- Remove dual code path conditionals (AoS vs SoA layouts)
- Simplify helper methods: first_block_batch() → first_block(),
  ex_factor_batch() → ex_factor()
- Simplify GetExFactorBytes() and block_bytes() to single return
- Maintain backward compatibility in save/load (legacy flag handling)
- Remove dead methods: quantize(), quantize_contiguous(),
  data_transformation(), data_transformation_contiguous()
- Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction)

Part 2: Move private methods to free functions
- Convert 5 private methods to free functions in anonymous namespace:
  * data_transformation_batch_opt()
  * data_transformation_batch_opt_contiguous()
  * rabitq_codes_and_factors_fused()
  * exrabitq_codes_and_factors_fused()
  * exrabitq_codes_and_factors_fused_ori()
- Pass all needed class members as explicit parameters
- Remove ~40 lines from public header (quantizer_gpu.cuh)
- Clean up unused variables

Benefits:
- Eliminates confusing dual code paths
- Cleaner public API with implementation details hidden
- Faster compilation for files including headers
- Better separation of interface and implementation

Files modified:
- cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu}
- cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu}
- cpp/src/neighbors/ivf_rabitq.cu

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@tfeher
Copy link
Copy Markdown
Contributor

tfeher commented Mar 25, 2026

/ok to test 52793fc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C++ feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

5 participants