feat: support `AllToAll` with OpenMPI backend implementation by halfman510 · Pull Request #14 · InfiniTensor/InfiniCCL

halfman510 · 2026-05-19T11:01:53Z

Summary

This PR introduces an OpenMPI-based implementation of AllToAll, along with a complete example program for functionality verification and basic performance evaluation.

Changes

OpenMPI-based AllToAll Implementation
- add the basic OpenMPI implementation for infiniAllToAll(), including:
  - the core interface src/base/all_to_all.h;
  - the OpenMPI backend implementation in src/ompi/impl/all_to_all.h;
  - the public API declaration in include/infiniccl.h.
- add an example program examples/all_to_all.cc similar to examples/all_reduce.cc for correctness verification and simple performance testing.

Known Issues & Future Work

The current OpenMPI AllToAll implementation uses blocking MPI_Alltoall, which prevents overlap between communication and computation. Future work may introduce non-blocking collectives (MPI_Alltoall) and stream-aware asynchronous execution to improve concurrency and performance.
The current implementation allocates temporary host staging buffers using malloc/free on every invocation. This may introduce noticeable overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
For the heterogeneous UCX + InfiniBand cluster used in testing, large AllToAll messages (e.g., 1 << 20 elements) may fail with mlx5 RC RDMA_READ errors due to UCX rendezvous RDMA_READ path limitations. This requires setting UCX_RNDV_SCHEME=put_zcopy to force a safe put-based transfer protocol. Without this setting, large-message AllToAll execution is unstable on some NIC configurations.
For extremely large AllToAll messages exceeding INT_MAX elements, execution is rejected due to MPI int range limitations. Future work may introduce chunked transfers or MPI_Count support in MPI-4+ to handle very large tensors safely.
The current implementation performs GPU-to-Host and Host-to-GPU copies for all data. While functionally correct, this is not optimal for GPU-intensive workloads. Future work may implement zero-copy GPU-GPU transfers (GPUDirect RDMA) to reduce memory traffic and improve throughput.
The operation currently performs a single-block AllToAll without pipelining. For large messages, this may lead to high memory consumption and limited scalability. Future work may add chunked or pipelined transfers to reduce peak memory usage and improve scalability across many ranks.
Error checking is limited to null pointers and datatype validation. Future work may include more fine-grained validation and enhanced error reporting to improve robustness in heterogeneous or multi-node deployments.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log

all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log

broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log

all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log

Modified file: - `include/comm.h` Added files: - `src/base/all_to_all.h` - `src/ompi/impl/all_to_all.h` - `examples/all_to_all.cc`

halfman510 force-pushed the feat/support-alltoall branch from 66d49c2 to f091b46 Compare May 19, 2026 13:16

halfman510 added 2 commits May 19, 2026 15:16

feat: support alltoall with OpenMPI backend implementation

e2d75d2

Modified file: - `include/comm.h` Added files: - `src/base/all_to_all.h` - `src/ompi/impl/all_to_all.h` - `examples/all_to_all.cc`

fix: use Validator::ValidateResult() in the examples/all_to_all.cc.

c78d7d7

halfman510 force-pushed the feat/support-alltoall branch from e16a878 to c78d7d7 Compare May 19, 2026 15:38

Ziminli requested changes May 19, 2026

View reviewed changes

Comment thread examples/all_to_all.cc Outdated

Comment thread examples/all_to_all.cc Outdated

Comment thread src/ompi/impl/all_to_all.h Outdated

halfman510 changed the title ~~feat: support alltoall with OpenMPI backend implementation~~ feat: support AllToAll with OpenMPI backend implementation May 19, 2026

fix: correct examples/all_to_all.cc and src/ompi/impl/all_to_all.h

4435a16

Ziminli requested changes May 19, 2026

View reviewed changes

Comment thread examples/all_to_all.cc Outdated

Comment thread examples/all_to_all.cc Outdated

fix: correct examples/all_to_all.cc

ae64369

Ziminli approved these changes May 19, 2026

View reviewed changes

Ziminli merged commit abef6b2 into InfiniTensor:master May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support `AllToAll` with OpenMPI backend implementation#14

feat: support `AllToAll` with OpenMPI backend implementation#14
Ziminli merged 4 commits into
InfiniTensor:masterfrom
halfman510:feat/support-alltoall

halfman510 commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

halfman510 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Known Issues & Future Work

Logs & Screenshots

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

halfman510 commented May 19, 2026 •

edited

Loading