Training failed in single GPU #102

Open
opened 2026-01-29 21:41:57 +00:00 by claunia · 10 comments
Owner

Originally created by @nowfalcodmeric on GitHub (Nov 9, 2021).

File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/basicsr/data/degradations.py", line 746, in add_jpg_compression
_, encimg = cv2.imencode('.jpg', img * 255., encode_param)
cv2.error: OpenCV(4.5.4-dev) 👎 error: (-5:Bad argument) in function 'imencode'

Overload resolution failed:

  • Can't parse 'params'. Sequence item with index 1 has a wrong type
  • Can't parse 'params'. Sequence item with index 1 has a wrong type

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91253) of binary: /home/codmeric/PycharmProjects/pikifix_env/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

gfpgan/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2021-11-09_11:48:26
host : codmeric-B365M-DS3H
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 91253)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Originally created by @nowfalcodmeric on GitHub (Nov 9, 2021). File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/basicsr/data/degradations.py", line 746, in add_jpg_compression _, encimg = cv2.imencode('.jpg', img * 255., encode_param) cv2.error: OpenCV(4.5.4-dev) :-1: error: (-5:Bad argument) in function 'imencode' > Overload resolution failed: > - Can't parse 'params'. Sequence item with index 1 has a wrong type > - Can't parse 'params'. Sequence item with index 1 has a wrong type ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91253) of binary: /home/codmeric/PycharmProjects/pikifix_env/bin/python Traceback (most recent call last): File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ gfpgan/train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2021-11-09_11:48:26 host : codmeric-B365M-DS3H rank : 0 (local_rank: 0) exitcode : 1 (pid: 91253) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Author
Owner

@nowfalcodmeric commented on GitHub (Nov 9, 2021):

python -m torch.distributed.launch --nproc_per_node=1 --master_port=22021 gfpgan/train.py -opt options/train_gfpgan_v1.yml --launcher pytorch

@nowfalcodmeric commented on GitHub (Nov 9, 2021): python -m torch.distributed.launch --nproc_per_node=1 --master_port=22021 gfpgan/train.py -opt options/train_gfpgan_v1.yml --launcher pytorch
Author
Owner

@stabling commented on GitHub (Nov 11, 2021):

Did you have solved this problem?I meet the same problem.

@stabling commented on GitHub (Nov 11, 2021): Did you have solved this problem?I meet the same problem.
Author
Owner

@YaZlob commented on GitHub (Nov 12, 2021):

try changing the file /opt/conda/lib/python3.8/site-packages/basicsr/data/degradations.py in line 764:
quality = int(np.random.uniform(quality_range[0], quality_range[1]))
it works for me

@YaZlob commented on GitHub (Nov 12, 2021): try changing the file /opt/conda/lib/python3.8/site-packages/basicsr/data/degradations.py in line 764: quality = int(np.random.uniform(quality_range[0], quality_range[1])) it works for me
Author
Owner

@stabling commented on GitHub (Nov 14, 2021):

try changing the file /opt/conda/lib/python3.8/site-packages/basicsr/data/degradations.py in line 764: quality = int(np.random.uniform(quality_range[0], quality_range[1])) it works for me

It also works for me.Thank you

@stabling commented on GitHub (Nov 14, 2021): > try changing the file /opt/conda/lib/python3.8/site-packages/basicsr/data/degradations.py in line 764: quality = int(np.random.uniform(quality_range[0], quality_range[1])) it works for me It also works for me.Thank you
Author
Owner

@danyow-cheung commented on GitHub (Apr 18, 2024):

changing degradations.py fails ,

(env) PS E:\Code\wav2lip_288x288\GFPGAN> python -m torch.distributed.launch --nproc_per_node=1 --master_port=22021 gfpgan/train.py -opt options/train_gfpgan_v1.yml --launcher pytorch
NOTE: Redirects are currently not supported in Windows or MacOs.
E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ieonline.microsoft.com]:22021 (system error: 10049 - 在其上下文中,该请求的地址无效。).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ieonline.microsoft.com]:22021 (system error: 10049 - 在其上下文中,该请求的地址无效。).
E:\Code\Personal\env\lib\site-packages\torchvision\transforms\functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
  warnings.warn(
usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local_rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]]
train.py: error: unrecognized arguments: --local-rank=0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 13988) of binary: E:\Code\Personal\env\Scripts\python.exe
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py", line 196, in <module>
    main()
  File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py", line 192, in main
    launch(args)
  File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py", line 177, in launch
    run(args)
  File "E:\Code\Personal\env\lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
gfpgan/train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-18_18:10:23
  host      : PC-20230813YPQQ
  rank      : 0 (local_rank: 0)
  exitcode  : 2 (pid: 13988)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
@danyow-cheung commented on GitHub (Apr 18, 2024): changing `degradations.py` fails , ``` (env) PS E:\Code\wav2lip_288x288\GFPGAN> python -m torch.distributed.launch --nproc_per_node=1 --master_port=22021 gfpgan/train.py -opt options/train_gfpgan_v1.yml --launcher pytorch NOTE: Redirects are currently not supported in Windows or MacOs. E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use-env is set by default in torchrun. If your script expects `--local-rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ieonline.microsoft.com]:22021 (system error: 10049 - 在其上下文中,该请求的地址无效。). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [ieonline.microsoft.com]:22021 (system error: 10049 - 在其上下文中,该请求的地址无效。). E:\Code\Personal\env\lib\site-packages\torchvision\transforms\functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be **removed in 0.17**. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional. warnings.warn( usage: train.py [-h] -opt OPT [--launcher {none,pytorch,slurm}] [--auto_resume] [--debug] [--local_rank LOCAL_RANK] [--force_yml FORCE_YML [FORCE_YML ...]] train.py: error: unrecognized arguments: --local-rank=0 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 13988) of binary: E:\Code\Personal\env\Scripts\python.exe Traceback (most recent call last): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\Administrator\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py", line 196, in <module> main() File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py", line 192, in main launch(args) File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launch.py", line 177, in launch run(args) File "E:\Code\Personal\env\lib\site-packages\torch\distributed\run.py", line 785, in run elastic_launch( File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "E:\Code\Personal\env\lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ gfpgan/train.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-04-18_18:10:23 host : PC-20230813YPQQ rank : 0 (local_rank: 0) exitcode : 2 (pid: 13988) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ ```
Author
Owner

@doniaa24 commented on GitHub (May 10, 2024):

@danyow-cheung hello, "Have you solved this problem? I am facing the same issue..

@doniaa24 commented on GitHub (May 10, 2024): @danyow-cheung hello, "Have you solved this problem? I am facing the same issue..
Author
Owner

@Knzaytsev commented on GitHub (May 15, 2024):

@doniaa24
in basicsr/utils/options.py just change local_rank argument to local-rank.

@Knzaytsev commented on GitHub (May 15, 2024): @doniaa24 in `basicsr/utils/options.py` just change `local_rank` argument to `local-rank`.
Author
Owner

@doniaa24 commented on GitHub (May 23, 2024):

@Knzaytsev I'm encountering an issue during the testing phase. After the training completed, the process saved numerous model checkpoints for various components and at different training iterations, However, I did not find a checkpoint specifically named gfpgan_model.pth, which I anticipated using for a straightforward testing of the overall model as shown is this screenshot, ANY HELP PLEASE ?
Screenshot from 2024-05-22 20-43-22

@doniaa24 commented on GitHub (May 23, 2024): @Knzaytsev I'm encountering an issue during the testing phase. After the training completed, the process saved numerous model checkpoints for various components and at different training iterations, However, I did not find a checkpoint specifically named gfpgan_model.pth, which I anticipated using for a straightforward testing of the overall model as shown is this screenshot, ANY HELP PLEASE ? ![Screenshot from 2024-05-22 20-43-22](https://github.com/TencentARC/GFPGAN/assets/107725595/2ea6d23b-9d48-4894-977a-84a7ad7c8160)
Author
Owner

@danyow-cheung commented on GitHub (May 27, 2024):

@Knzaytsev I'm encountering an issue during the testing phase. After the training completed, the process saved numerous model checkpoints for various components and at different training iterations, However, I did not find a checkpoint specifically named gfpgan_model.pth, which I anticipated using for a straightforward testing of the overall model as shown is this screenshot, ANY HELP PLEASE ? Screenshot from 2024-05-22 20-43-22

check this 7552a7791c/inference_gfpgan.py (L81) , you need to edit some code

@danyow-cheung commented on GitHub (May 27, 2024): > @Knzaytsev I'm encountering an issue during the testing phase. After the training completed, the process saved numerous model checkpoints for various components and at different training iterations, However, I did not find a checkpoint specifically named gfpgan_model.pth, which I anticipated using for a straightforward testing of the overall model as shown is this screenshot, ANY HELP PLEASE ? ![Screenshot from 2024-05-22 20-43-22](https://private-user-images.githubusercontent.com/107725595/333109587-2ea6d23b-9d48-4894-977a-84a7ad7c8160.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTY3Nzc3MzAsIm5iZiI6MTcxNjc3NzQzMCwicGF0aCI6Ii8xMDc3MjU1OTUvMzMzMTA5NTg3LTJlYTZkMjNiLTlkNDgtNDg5NC05NzdhLTg0YTdhZDdjODE2MC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNTI3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDUyN1QwMjM3MTBaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wZDY4ZWQ3YTE5NmU5ZTk4NjFmM2Y5MzQ3ZDEyNWM4MDY4MjcwNWNhYjI5NjRjNmVjOWYwOTM4OGFhY2QyYWY0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.v_w7yBaKqLcyXBtivFGazj7Hm5zpOmmHEBnR132HKw0) check this https://github.com/TencentARC/GFPGAN/blob/7552a7791caad982045a7bbe5634bbf1cd5c8679/inference_gfpgan.py#L81 , you need to edit some code
Author
Owner

@CoolStarmoon commented on GitHub (Nov 28, 2024):

@danyow-cheung hello, "Have you solved this problem? I am facing the same issue..

I am facing the same issue. How do you solved this?

@CoolStarmoon commented on GitHub (Nov 28, 2024): > @danyow-cheung hello, "Have you solved this problem? I am facing the same issue.. I am facing the same issue. How do you solved this?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: TencentARC/GFPGAN#102