[NPU] loss becomes NaN when dp_replicate_size > 1 due to missing stream synchronization

## Environment

- Ascend NPU
- torch-npu
- accelerate
- feat/npu branch

## Symptom

Training loss becomes NaN when dp_replicate_size > 1.

## Root cause

Backward is asynchronous on NPU, and missing stream synchronization can cause execution ordering issues in multi DP replicate training.

## Proposed fix

Call torch_npu.npu.current_stream().synchronize() immediately after self.accelerator.backward(loss).

## Affected file

mova/engine/trainer/accelerate/accelerate_trainer.py

## Suggested patch snippet

```python
import torch_npu

...

self.accelerator.backward(loss)
# Synchronize NPU stream to avoid async ordering issues with DP replicate training.
torch_npu.npu.current_stream().synchronize()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] loss becomes NaN when dp_replicate_size > 1 due to missing stream synchronization #59

Environment

Symptom

Root cause

Proposed fix

Affected file

Suggested patch snippet

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[NPU] loss becomes NaN when dp_replicate_size > 1 due to missing stream synchronization #59

Description

Environment

Symptom

Root cause

Proposed fix

Affected file

Suggested patch snippet

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions