fix/comm-destroy-communicator-leak by GordonYang1 · Pull Request #15 · InfiniTensor/InfiniCCL

GordonYang1 · 2026-05-19T11:34:44Z

Summary

Fixes the Communicator lifetime leak noted in the "Known Issues &
Future Work" section of #10.

CommInitAll::Execute allocates the outer Communicator via new,
but the previous CommDestroy only tore down the backend instance
and never deleted the outer object. Every
infiniCommInitAll / infiniCommDestroy pair leaked one
Communicator.

Changes

src/base/comm_destroy.h:
- Narrow Execute to a concrete void *comm_handle signature
  (mirrors CommInitAll::Execute's void **comm_handle).
- Add a null-handle check, consistent with CommInitAll.
- delete the Communicator after the backend Apply returns,
  regardless of its return status, so error paths do not leak.

Backend impls are intentionally untouched: ownership of the outer
Communicator now lives entirely in the base layer, symmetric with
the allocation in CommInitAll. Future backends (NCCL, MCCL, HCCL,
...) get correct lifetime for free without each having to remember
to delete.

Out of scope

Tracked separately to keep this PR minimal:

RAII-fying OmpiInstance::Destroy() into ~OmpiInstance().
FinalizeImpl not calling MPI_Finalize (noted in
DEVELOPER_NOTES_OMPI.md).

Test environment

Validated on a heterogeneous 2-node cluster with container-to-container
direct connection over RDMA:

Node	Hardware	Container	IP
A	NVIDIA A100	`iccl-nvidia`	`192.168.163.40`
B	MetaX C550	`iccl-metax-clean`	`192.168.162.49`

OS / container: --network host --ipc host --privileged,
/dev/infiniband mounted on both sides.
OpenMPI: 4.1.6 (/opt/openmpi-4.1.6), built with
--with-ucx=/opt/ucx-1.17.0.
UCX: 1.17.0 (/opt/ucx-1.17.0), built with --with-verbs --with-rdmacm.
CUDA Toolkit on Node A; MACA SDK at /opt/maca on Node B.
Inter-container SSH direct connection on port 22222
(no docker exec wrapper required for mpirun or icclrun --build).
UCX transport configuration:
- UCX_NET_DEVICES=mlx5_0:1
- UCX_TLS=rc,rc_verbs,self,sm
- UCX_RNDV_SCHEME=put_zcopy
World size: 16 ranks (8 NVIDIA + 8 MetaX).

Build configuration via icclrun --build:

Node A: -DWITH_NVIDIA=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF
Node B: -DWITH_METAX=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF

Test results

Heterogeneous run across NVIDIA + MetaX over RDMA, both examples
passed end-to-end with icclrun --build:

icclrun --config examples/cluster.rdma.direct-build.yaml --build all_reduce
icclrun --config examples/cluster.rdma.direct-build.yaml --build all_gather

all_reduce (8 NVIDIA + 8 MetaX): correctness YES, no MPI
finalize warnings, clean exit.
all_gather (8 NVIDIA + 8 MetaX): correctness YES, no MPI
finalize warnings, clean exit.

No regressions observed on either platform.

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log
all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

fix(base): delete communicator in CommDestroy

4d19750

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix/comm-destroy-communicator-leak#15

fix/comm-destroy-communicator-leak#15
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy

GordonYang1 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GordonYang1 commented May 19, 2026

Summary

Changes

Out of scope

Test environment

Test results

Logs & Screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant