Skip to content

[NPU] loss becomes NaN when dp_replicate_size > 1 due to missing stream synchronization #59

@danieldale2026

Description

@danieldale2026

Environment

  • Ascend NPU
  • torch-npu
  • accelerate
  • feat/npu branch

Symptom

Training loss becomes NaN when dp_replicate_size > 1.

Root cause

Backward is asynchronous on NPU, and missing stream synchronization can cause execution ordering issues in multi DP replicate training.

Proposed fix

Call torch_npu.npu.current_stream().synchronize() immediately after self.accelerator.backward(loss).

Affected file

mova/engine/trainer/accelerate/accelerate_trainer.py

Suggested patch snippet

import torch_npu

...

self.accelerator.backward(loss)
# Synchronize NPU stream to avoid async ordering issues with DP replicate training.
torch_npu.npu.current_stream().synchronize()

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions