-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
训练过程非常不稳定,W1218 12:19:35.164000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7756 via 1, forcefully exiting via 9
W1218 12:19:35.824000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7757 via 1, forcefully exiting via 9
W1218 12:19:36.496000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7759 via 1, forcefully exiting via 9
W1218 12:19:37.204000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7760 via 1, forcefully exiting via 9
W1218 12:19:37.790000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7761 via 1, forcefully exiting via 9
W1218 12:19:38.483000 7689 site-packages/torch/distributed/elastic/multiprocessing/api.py:919] Unable to shutdown process 7762 via 1, forcefully exiting via 9
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/run.py", line 905, in
main()
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 138, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 715, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 879, in _invoke_run
time.sleep(monitor_interval)
File "/opt/miniconda/envs/swift/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler
raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 7689 got signal: 1
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
Additional context
Add any other context about the problem here(在这里补充其他信息)