Skip to content

feat: support AllToAll with OpenMPI backend implementation#14

Merged
Ziminli merged 4 commits into
InfiniTensor:masterfrom
halfman510:feat/support-alltoall
May 19, 2026
Merged

feat: support AllToAll with OpenMPI backend implementation#14
Ziminli merged 4 commits into
InfiniTensor:masterfrom
halfman510:feat/support-alltoall

Conversation

@halfman510
Copy link
Copy Markdown
Collaborator

@halfman510 halfman510 commented May 19, 2026

Summary

This PR introduces an OpenMPI-based implementation of AllToAll, along with a complete example program for functionality verification and basic performance evaluation.

Changes

  • OpenMPI-based AllToAll Implementation
    • add the basic OpenMPI implementation for infiniAllToAll(), including:
      • the core interface src/base/all_to_all.h;
      • the OpenMPI backend implementation in src/ompi/impl/all_to_all.h;
      • the public API declaration in include/infiniccl.h.
    • add an example program examples/all_to_all.cc similar to examples/all_reduce.cc for correctness verification and simple performance testing.

Known Issues & Future Work

  • The current OpenMPI AllToAll implementation uses blocking MPI_Alltoall, which prevents overlap between communication and computation. Future work may introduce non-blocking collectives (MPI_Alltoall) and stream-aware asynchronous execution to improve concurrency and performance.
  • The current implementation allocates temporary host staging buffers using malloc/free on every invocation. This may introduce noticeable overhead in high-frequency workloads. Future work may add reusable buffer pools, allocator caching, and pinned host memory support to improve transfer efficiency and reduce allocation overhead.
  • For the heterogeneous UCX + InfiniBand cluster used in testing, large AllToAll messages (e.g., 1 << 20 elements) may fail with mlx5 RC RDMA_READ errors due to UCX rendezvous RDMA_READ path limitations. This requires setting UCX_RNDV_SCHEME=put_zcopy to force a safe put-based transfer protocol. Without this setting, large-message AllToAll execution is unstable on some NIC configurations.
  • For extremely large AllToAll messages exceeding INT_MAX elements, execution is rejected due to MPI int range limitations. Future work may introduce chunked transfers or MPI_Count support in MPI-4+ to handle very large tensors safely.
  • The current implementation performs GPU-to-Host and Host-to-GPU copies for all data. While functionally correct, this is not optimal for GPU-intensive workloads. Future work may implement zero-copy GPU-GPU transfers (GPUDirect RDMA) to reduce memory traffic and improve throughput.
  • The operation currently performs a single-block AllToAll without pipelining. For large messages, this may lead to high memory consumption and limited scalability. Future work may add chunked or pipelined transfers to reduce peak memory usage and improve scalability across many ranks.
  • Error checking is limited to null pointers and datatype validation. Future work may include more fine-grained validation and enhanced error reporting to improve robustness in heterogeneous or multi-node deployments.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log

all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log

broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log

all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log

@halfman510 halfman510 force-pushed the feat/support-alltoall branch from 66d49c2 to f091b46 Compare May 19, 2026 13:16
Modified file:
- `include/comm.h`

Added files:
- `src/base/all_to_all.h`
- `src/ompi/impl/all_to_all.h`
- `examples/all_to_all.cc`
@halfman510 halfman510 force-pushed the feat/support-alltoall branch from e16a878 to c78d7d7 Compare May 19, 2026 15:38
Comment thread examples/all_to_all.cc Outdated
Comment thread examples/all_to_all.cc Outdated
Comment thread src/ompi/impl/all_to_all.h Outdated
@halfman510 halfman510 changed the title feat: support alltoall with OpenMPI backend implementation feat: support AllToAll with OpenMPI backend implementation May 19, 2026
Comment thread examples/all_to_all.cc Outdated
Comment thread examples/all_to_all.cc Outdated
@Ziminli Ziminli merged commit abef6b2 into InfiniTensor:master May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants