mirror of
https://github.com/TencentARC/GFPGAN.git
synced 2026-02-20 08:21:05 +00:00
Training failed in single GPU #102
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nowfalcodmeric on GitHub (Nov 9, 2021).
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/basicsr/data/degradations.py", line 746, in add_jpg_compression
_, encimg = cv2.imencode('.jpg', img * 255., encode_param)
cv2.error: OpenCV(4.5.4-dev) 👎 error: (-5:Bad argument) in function 'imencode'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91253) of binary: /home/codmeric/PycharmProjects/pikifix_env/bin/python
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/codmeric/PycharmProjects/pikifix_env/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
gfpgan/train.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2021-11-09_11:48:26
host : codmeric-B365M-DS3H
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 91253)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
@nowfalcodmeric commented on GitHub (Nov 9, 2021):
python -m torch.distributed.launch --nproc_per_node=1 --master_port=22021 gfpgan/train.py -opt options/train_gfpgan_v1.yml --launcher pytorch
@stabling commented on GitHub (Nov 11, 2021):
Did you have solved this problem?I meet the same problem.
@YaZlob commented on GitHub (Nov 12, 2021):
try changing the file /opt/conda/lib/python3.8/site-packages/basicsr/data/degradations.py in line 764:
quality = int(np.random.uniform(quality_range[0], quality_range[1]))
it works for me
@stabling commented on GitHub (Nov 14, 2021):
It also works for me.Thank you
@danyow-cheung commented on GitHub (Apr 18, 2024):
changing
degradations.pyfails ,@doniaa24 commented on GitHub (May 10, 2024):
@danyow-cheung hello, "Have you solved this problem? I am facing the same issue..
@Knzaytsev commented on GitHub (May 15, 2024):
@doniaa24
in
basicsr/utils/options.pyjust changelocal_rankargument tolocal-rank.@doniaa24 commented on GitHub (May 23, 2024):
@Knzaytsev I'm encountering an issue during the testing phase. After the training completed, the process saved numerous model checkpoints for various components and at different training iterations, However, I did not find a checkpoint specifically named gfpgan_model.pth, which I anticipated using for a straightforward testing of the overall model as shown is this screenshot, ANY HELP PLEASE ?

@danyow-cheung commented on GitHub (May 27, 2024):
check this
7552a7791c/inference_gfpgan.py (L81), you need to edit some code@CoolStarmoon commented on GitHub (Nov 28, 2024):
I am facing the same issue. How do you solved this?