Support IVF-RaBitQ in cuVS Library#1866
Support IVF-RaBitQ in cuVS Library#1866Stardust-SJF wants to merge 143 commits intorapidsai:mainfrom
Conversation
- Currently built as a separate library. - To be merged with existing `cuvs_objs` library. - Dependency on `Eigen` yet to be removed.
- RABITQ_BENCH_TEST for standalone testing; to be removed as integration work is completed. - CUVS_IVF_RABITQ_ANN_BENCH for benchmarking as part of ANN benchmarking suite
- `bits_per_dim` = `ex_bits` + 1 - Also update supported range of `bits_per_dim` to 2-9 inclusive
* Fix cuVS build issues with RaBitQ * Align line formatting && Delete unused variables in robust_prune.cuh
…q' into jamxia_cuvs_ivf_rabitq
* Download Eigen automatically by rapids-cmake * Disable FAISS and DISKANN benchmarks * add config files and update readme * Update Readme and openai_1M config * Update python bench command line * update README * update README --------- Co-authored-by: James Xia <jamxia@nvidia.com>
- Error-checking - Stream-ordered CUDA calls
|
/ok to test 994e951 |
tfeher
left a comment
There was a problem hiding this comment.
Thanks @Stardust-SJF for opening the PR! We are excited to have a GPU accelerated IVF-RaBitQ method in cuVS. Also thanks to @jamxia155 for working on the cuVS integration. Here is my first batch of comments (focusing on public API and benchmark wrappers).
tfeher
left a comment
There was a problem hiding this comment.
A few more comments for the build method.
| d_dataset_array = raft::make_device_mdarray<T>( | ||
| handle, big_memory_resource, raft::make_extents<int64_t>(n_rows, dim)); |
There was a problem hiding this comment.
We should not copy the whole dataset to the index. I would prefer if we can process similarly how other IVF methods in cuVS work:
- clusters trained on a subset of the data. Only subset needs to be copied to the GPU.
- We compress the dataset batch-wise. Only copy one batch at a time to the GPU.
Can the compression step be done batch-wise, or do we need to see the whole dataset for that?
There was a problem hiding this comment.
We had previously aligned on the need for out-of-core building but did not find it realistic to target for the initial release.
In the meantime, would it make sense to re-include the CPU-based index construction function as a stopgap? It won't be accelerated but at least enables building of large datasets.
There was a problem hiding this comment.
From my point of view, the build algorithm is designed to quantize data cluster by cluster, so it would be OK to transfer data between CPU and GPU at the cluster level (at the cost of making it a bit slower). Can we target this feature in the following releases (rather than the initial release)?
At the same time, the IVF-RaBitQ (GPU) has a redesigned data layout and quantization pipeline for GPUs, and, unfortunately, there is currently no CPU-based index construction for it. For CPU-based index construction, we need to reorganize the parallel granularity and rewrite the build process with SIMD-accelerated instructions.
There was a problem hiding this comment.
Sorry, I was confused by my recollection of the older construct method that required the dataset to be on host. But @Stardust-SJF is right that even that method would internally copy the entire dataset to the GPU so it won't serve as a stopgap for out-of-core building.
There was a problem hiding this comment.
The older construct method is written for cases where the clustering results are on disk or in main memory. Sorry for the inconvenience caused by the lack of necessary comments.
There was a problem hiding this comment.
I pushed a construct_on_gpu_streaming method that streams in batches of vectors from a dataset on host for index construction. The index construction is slower by about 3X with multithreading for the host-side gathering step (as tested on a CPU with 24/48 physical/logical cores). However, the subsampling for kmeans clustering is much slower (by around 10X) when runinng from host data. Having said that, I think this bottleneck will be resolved once out-of-core clustering is available.
| // TODO: a proper sampling | ||
| if constexpr (std::is_same_v<T, float>) { | ||
| raft::matrix::sample_rows<T, int64_t>(handle, random_state, dataset, trainset.view()); | ||
| } else { |
There was a problem hiding this comment.
Do we need the else branch? k-means shall support different input types.
There was a problem hiding this comment.
Thanks for pointing this out! Removed unnecessary casting.
- Remove commented-out code - Rename a variable - Update index type for device matrix views
Replace `kmeans_trainset_fraction` with `max_points_per_cluster`
Implement streaming index construction for IVF-RaBitQ to handle datasets that exceed available GPU memory. This enables building indices for large datasets by processing data in batches streamed from host memory. Key features: - Automatic detection based on dataset size vs available workspace - Complete-cluster batching strategy (no partial clusters across batches) - OpenMP parallel host data gathering with persistent thread pool - Contiguous data handling in quantizer for improved performance - Configurable batch size via streaming_batch_size parameter The implementation uses omp_get_max_threads() to scale with available hardware while maintaining efficient memory bandwidth utilization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Add an optional parameter to force streaming construction regardless of dataset size. This provides users with explicit control over the construction method for testing or specific use cases. When force_streaming is enabled: - Streaming construction is used even if dataset fits in GPU memory - Distinct log message indicates explicit vs automatic decision to use streaming construction Default behavior (force_streaming=false) remains unchanged, with automatic detection based on dataset size vs available workspace. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Enable control of the force_streaming parameter through JSON benchmark
configuration files. Users can now specify force_streaming in the
build_param section of their benchmark configs.
Example usage:
"build_param": {
"nlist": 10000,
"force_streaming": true,
...
}
This allows benchmark configurations to explicitly control streaming
construction for performance testing and comparison.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update documentation to note that force_streaming has no effect when the dataset is already in device memory, as streaming construction is only applicable for host-to-device data transfer.
Adds build_forced_streaming test case that explicitly enables streaming construction even for small datasets that fit in GPU memory. This validates the streaming code path with dynamic batch sizing and ensures compatibility with serialization/deserialization. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Remove batch_flag member variables and associated dead code, then refactor
DataQuantizerGPU to move private methods to free functions for better
encapsulation.
Part 1: Remove batch_flag dead code
- Remove IVFGPU::batch_flag and DataQuantizerGPU::batch_flag_dq
- Remove dual code path conditionals (AoS vs SoA layouts)
- Simplify helper methods: first_block_batch() → first_block(),
ex_factor_batch() → ex_factor()
- Simplify GetExFactorBytes() and block_bytes() to single return
- Maintain backward compatibility in save/load (legacy flag handling)
- Remove dead methods: quantize(), quantize_contiguous(),
data_transformation(), data_transformation_contiguous()
- Remove 936 lines of dead code from quantizer_gpu_fast.cu (96% reduction)
Part 2: Move private methods to free functions
- Convert 5 private methods to free functions in anonymous namespace:
* data_transformation_batch_opt()
* data_transformation_batch_opt_contiguous()
* rabitq_codes_and_factors_fused()
* exrabitq_codes_and_factors_fused()
* exrabitq_codes_and_factors_fused_ori()
- Pass all needed class members as explicit parameters
- Remove ~40 lines from public header (quantizer_gpu.cuh)
- Clean up unused variables
Benefits:
- Eliminates confusing dual code paths
- Cleaner public API with implementation details hidden
- Faster compilation for files including headers
- Better separation of interface and implementation
Files modified:
- cpp/src/neighbors/ivf_rabitq/gpu_index/{ivf_gpu.cuh,ivf_gpu.cu}
- cpp/src/neighbors/ivf_rabitq/gpu_index/{quantizer_gpu.cuh,quantizer_gpu.cu,quantizer_gpu_fast.cu}
- cpp/src/neighbors/ivf_rabitq.cu
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
/ok to test 52793fc |
This PR introduces IVF-RaBitQ, a GPU-native ANNS solution that integrates the cluster-based method IVF with RaBitQ quantization into an efficient GPU index build/search pipeline. It can achieve a strong recall–throughput trade-off while having fast index build speed and a small storage footprint.