Skip to content

on_policy_distillation训练不稳定 #7121

@Jim2016713

Description

@Jim2016713

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
训练过程非常不稳定,W1218 12:19:35.164000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7756 via 1, forcefully exiting via 9
W1218 12:19:35.824000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7757 via 1, forcefully exiting via 9
W1218 12:19:36.496000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7759 via 1, forcefully exiting via 9
W1218 12:19:37.204000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7760 via 1, forcefully exiting via 9
W1218 12:19:37.790000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7761 via 1, forcefully exiting via 9
W1218 12:19:38.483000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7762 via 1, forcefully exiting via 9
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/run.py", line 905, in
main()
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 715, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 879, in _invoke_run
time.sleep(monitor_interval)
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 7689 got signal: 1

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

Additional context
Add any other context about the problem here(在这里补充其他信息)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions