Skip to content

fix/comm-destroy-communicator-leak#15

Open
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy
Open

fix/comm-destroy-communicator-leak#15
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy

Conversation

@GordonYang1
Copy link
Copy Markdown
Collaborator

Summary

Fixes the Communicator lifetime leak noted in the "Known Issues &
Future Work" section of #10.

CommInitAll::Execute allocates the outer Communicator via new,
but the previous CommDestroy only tore down the backend instance
and never deleted the outer object. Every
infiniCommInitAll / infiniCommDestroy pair leaked one
Communicator.

Changes

  • src/base/comm_destroy.h:
    • Narrow Execute to a concrete void *comm_handle signature
      (mirrors CommInitAll::Execute's void **comm_handle).
    • Add a null-handle check, consistent with CommInitAll.
    • delete the Communicator after the backend Apply returns,
      regardless of its return status, so error paths do not leak.

Backend impls are intentionally untouched: ownership of the outer
Communicator now lives entirely in the base layer, symmetric with
the allocation in CommInitAll. Future backends (NCCL, MCCL, HCCL,
...) get correct lifetime for free without each having to remember
to delete.

Out of scope

Tracked separately to keep this PR minimal:

  • RAII-fying OmpiInstance::Destroy() into ~OmpiInstance().
  • FinalizeImpl not calling MPI_Finalize (noted in
    DEVELOPER_NOTES_OMPI.md).

Test environment

Validated on a heterogeneous 2-node cluster with container-to-container
direct connection over RDMA:

Node Hardware Container IP
A NVIDIA A100 iccl-nvidia 192.168.163.40
B MetaX C550 iccl-metax-clean 192.168.162.49
  • OS / container: --network host --ipc host --privileged,
    /dev/infiniband mounted on both sides.
  • OpenMPI: 4.1.6 (/opt/openmpi-4.1.6), built with
    --with-ucx=/opt/ucx-1.17.0.
  • UCX: 1.17.0 (/opt/ucx-1.17.0), built with --with-verbs --with-rdmacm.
  • CUDA Toolkit on Node A; MACA SDK at /opt/maca on Node B.
  • Inter-container SSH direct connection on port 22222
    (no docker exec wrapper required for mpirun or icclrun --build).
  • UCX transport configuration:
    • UCX_NET_DEVICES=mlx5_0:1
    • UCX_TLS=rc,rc_verbs,self,sm
    • UCX_RNDV_SCHEME=put_zcopy
  • World size: 16 ranks (8 NVIDIA + 8 MetaX).

Build configuration via icclrun --build:

  • Node A: -DWITH_NVIDIA=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF
  • Node B: -DWITH_METAX=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF

Test results

Heterogeneous run across NVIDIA + MetaX over RDMA, both examples
passed end-to-end with icclrun --build:

icclrun --config examples/cluster.rdma.direct-build.yaml --build all_reduce
icclrun --config examples/cluster.rdma.direct-build.yaml --build all_gather
  • all_reduce (8 NVIDIA + 8 MetaX): correctness YES, no MPI
    finalize warnings, clean exit.
  • all_gather (8 NVIDIA + 8 MetaX): correctness YES, no MPI
    finalize warnings, clean exit.

No regressions observed on either platform.

Logs & Screenshots

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant