mirror of
https://github.com/TencentARC/GFPGAN.git
synced 2026-02-14 05:14:35 +00:00
RuntimeError:"Distributed package doesn't have NCCL" ??? #45
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ghost on GitHub (Aug 10, 2021).
How to train a custom model under Windows 10 with miniconda?
Inference works great but when I try to start a custom training only errors come up.
Latest RTX/Quadro driver and Nvida Cuda Toolkit 11.3 + cudnn 11.3 + ms vs buildtools are installed.
My Miniconda Env:

Training:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=22021 gfpgan\train.py -opt c:\GFPGAN\options\test.yml --launcher pytorch
Train_Error.txt
@xinntao commented on GitHub (Aug 11, 2021):
I have not tried on Windows for training.
It seems that you have not installed NCCL or you have installed a pytorch version that does not build with nccl.
BTW, if you only have one GPU, you may not use distributed training.
@ghost commented on GitHub (Aug 11, 2021):
No idea what I am doing wrong. Under Windows or in Google Colab come only Errors when trying to train.
The inference_gfpgan.py works under Windows and Google Colab. With other projects e.g. Nvidia Stylegan2-ADA Pytorch etc. it works with cuda ops build at runtime.
Error on Win 10 with Conda:
set BASICSR_JIT=True && python gfpgan\train.py -opt c:\Users\Chaos\Downloads\test.yml
or
python gfpgan\train.py -opt c:\Users\Chaos\Downloads\test.yml
2021-08-11 10:36:58,307 INFO: Model [GFPGANModel] is created.
2021-08-11 10:37:05,481 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "gfpgan\train.py", line 11, in
train_pipeline(root_path)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\train.py", line 167, in train_pipeline
model.optimize_parameters(current_iter)
File "d:!_ai!_repo\gfpgan\gfpgan\models\gfpgan_model.py", line 305, in optimize_parameters
self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "d:!_ai!_repo\gfpgan\gfpgan\archs\gfpganv1_arch.py", line 347, in forward
feat = self.conv_body_first(x)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 91, in forward
return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 95, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 65, in forward
out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined
Error on Google Colab:
!BASICSR_JIT=True python gfpgan/train.py -opt /content/test.yml
2021-08-11 08:33:04,594 INFO: Model [GFPGANModel] is created.
2021-08-11 08:33:04,654 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "gfpgan/train.py", line 11, in
train_pipeline(root_path)
File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 167, in train_pipeline
model.optimize_parameters(current_iter)
File "/content/GFPGAN/gfpgan/models/gfpgan_model.py", line 305, in optimize_parameters
self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/GFPGAN/gfpgan/archs/gfpganv1_arch.py", line 347, in forward
feat = self.conv_body_first(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 91, in forward
return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 95, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 65, in forward
out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
RuntimeError: input must be a CUDA tensor
Error on Google Colab:
!python gfpgan/train.py -opt /content/test.yml
2021-08-11 08:35:29,867 INFO: Model [GFPGANModel] is created.
2021-08-11 08:35:29,924 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "gfpgan/train.py", line 11, in
train_pipeline(root_path)
File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 167, in train_pipeline
model.optimize_parameters(current_iter)
File "/content/GFPGAN/gfpgan/models/gfpgan_model.py", line 305, in optimize_parameters
self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/GFPGAN/gfpgan/archs/gfpganv1_arch.py", line 347, in forward
feat = self.conv_body_first(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 91, in forward
return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 95, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 65, in forward
out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined
@xinntao commented on GitHub (Aug 13, 2021):
BASICSR_JITenv variable. You can check in BasicSR:Or you can force using the cuda ops building at runtime.
check whether put the tensor to GPU.