ddp backend fix and documentation changes by jsleep · Pull Request #68 · microsoft/PyMarlin

jsleep · 2021-10-11T16:48:39Z

No description provided.

ashwinsr01 · 2021-10-11T18:16:35Z

pymarlin/core/trainer_backend.py

+        gather_frequency = n_samples

        gathered = []
-        n_chunks = n_samples // self.gather_frequency + 1


@krishansubudhi For my understanding , why did we have to chunk before? I assumed it was to avoid exceeding GPU memory limit but it looks like we only move tensors to GPU in this loop and never out of it.

Adding @aminsaied who also initially created the DDP Trainer backend and chunking logic.

The loop first moves the tensors to GPU, then does all gather op, then moves the gathered tensors back to CPU. I believe this was at the request of @gshruti95 at the time for a specific workload that was being tested (keep me honest Shruti).

We decided to introduce chunking in case of potential memory or timeout issues when trying to all gather for pretraining workloads.

I believe this logic will have to change back then, chunking needs to be implemented correctly and not hard coded to 1, but can be for the time being (just will be slow I think).

backend fix and documentation changes

4b6f2ba

jsleep assigned krishansubudhi and ashwinsr01 Oct 11, 2021

jsleep changed the title ~~backend fix and documentation changes~~ ddp backend fix and documentation changes Oct 11, 2021

jsleep linked an issue Oct 11, 2021 that may be closed by this pull request

DDP validation: All gather for flattened 1D tensors taking long time to complete #55

Open

ashwinsr01 reviewed Oct 11, 2021

View reviewed changes

jsleep assigned aminsaied Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddp backend fix and documentation changes#68

ddp backend fix and documentation changes#68
jsleep wants to merge 1 commit intomainfrom
krkusuk/backend_bug_fixes

jsleep commented Oct 11, 2021

Uh oh!

ashwinsr01 Oct 11, 2021

Uh oh!

jsleep Oct 11, 2021

Uh oh!

aminsaied Oct 20, 2021

Uh oh!

gshruti95 Oct 20, 2021 •

edited

Loading

Uh oh!

jsleep Nov 19, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

jsleep commented Oct 11, 2021

Uh oh!

ashwinsr01 Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

jsleep Oct 11, 2021

Choose a reason for hiding this comment

Uh oh!

aminsaied Oct 20, 2021

Choose a reason for hiding this comment

Uh oh!

gshruti95 Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsleep Nov 19, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

gshruti95 Oct 20, 2021 •

edited

Loading