fix/comm-destroy-communicator-leak#15
Open
GordonYang1 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the
Communicatorlifetime leak noted in the "Known Issues &Future Work" section of #10.
CommInitAll::Executeallocates the outerCommunicatorvianew,but the previous
CommDestroyonly tore down the backend instanceand never deleted the outer object. Every
infiniCommInitAll/infiniCommDestroypair leaked oneCommunicator.Changes
src/base/comm_destroy.h:Executeto a concretevoid *comm_handlesignature(mirrors
CommInitAll::Execute'svoid **comm_handle).CommInitAll.deletetheCommunicatorafter the backendApplyreturns,regardless of its return status, so error paths do not leak.
Backend impls are intentionally untouched: ownership of the outer
Communicatornow lives entirely in the base layer, symmetric withthe allocation in
CommInitAll. Future backends (NCCL, MCCL, HCCL,...) get correct lifetime for free without each having to remember
to
delete.Out of scope
Tracked separately to keep this PR minimal:
OmpiInstance::Destroy()into~OmpiInstance().FinalizeImplnot callingMPI_Finalize(noted inDEVELOPER_NOTES_OMPI.md).Test environment
Validated on a heterogeneous 2-node cluster with container-to-container
direct connection over RDMA:
iccl-nvidia192.168.163.40iccl-metax-clean192.168.162.49--network host --ipc host --privileged,/dev/infinibandmounted on both sides./opt/openmpi-4.1.6), built with--with-ucx=/opt/ucx-1.17.0./opt/ucx-1.17.0), built with--with-verbs --with-rdmacm./opt/macaon Node B.22222(no
docker execwrapper required formpirunoricclrun --build).UCX_NET_DEVICES=mlx5_0:1UCX_TLS=rc,rc_verbs,self,smUCX_RNDV_SCHEME=put_zcopyBuild configuration via
icclrun --build:-DWITH_NVIDIA=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF-DWITH_METAX=ON -DWITH_OMPI=ON -DWITH_NCCL=OFFTest results
Heterogeneous run across NVIDIA + MetaX over RDMA, both examples
passed end-to-end with
icclrun --build:all_reduce(8 NVIDIA + 8 MetaX): correctnessYES, no MPIfinalize warnings, clean exit.
all_gather(8 NVIDIA + 8 MetaX): correctnessYES, no MPIfinalize warnings, clean exit.
No regressions observed on either platform.
Logs & Screenshots
all_reduce.log
all_gather.log