RuntimeError:"Distributed package doesn't have NCCL" ??? #45

Open
opened 2026-01-29 21:39:30 +00:00 by claunia · 3 comments
Owner

Originally created by @ghost on GitHub (Aug 10, 2021).

How to train a custom model under Windows 10 with miniconda?
Inference works great but when I try to start a custom training only errors come up.
Latest RTX/Quadro driver and Nvida Cuda Toolkit 11.3 + cudnn 11.3 + ms vs buildtools are installed.

My Miniconda Env:
pytorchconda

Training:
python -m torch.distributed.launch --nproc_per_node=4 --master_port=22021 gfpgan\train.py -opt c:\GFPGAN\options\test.yml --launcher pytorch

Train_Error.txt

Originally created by @ghost on GitHub (Aug 10, 2021). How to train a custom model under Windows 10 with miniconda? Inference works great but when I try to start a custom training only errors come up. Latest RTX/Quadro driver and Nvida Cuda Toolkit 11.3 + cudnn 11.3 + ms vs buildtools are installed. My Miniconda Env: ![pytorchconda](https://user-images.githubusercontent.com/29997517/128884866-ab3245f2-aacd-4d00-8560-7c48b00d2213.png) Training: python -m torch.distributed.launch --nproc_per_node=4 --master_port=22021 gfpgan\train.py -opt c:\GFPGAN\options\test.yml --launcher pytorch [Train_Error.txt](https://github.com/TencentARC/GFPGAN/files/6958052/Train_Error.txt)
Author
Owner

@xinntao commented on GitHub (Aug 11, 2021):

I have not tried on Windows for training.
It seems that you have not installed NCCL or you have installed a pytorch version that does not build with nccl.

BTW, if you only have one GPU, you may not use distributed training.

@xinntao commented on GitHub (Aug 11, 2021): I have not tried on Windows for training. It seems that you have not installed NCCL or you have installed a pytorch version that does not build with nccl. BTW, if you only have one GPU, you may not use distributed training.
Author
Owner

@ghost commented on GitHub (Aug 11, 2021):

No idea what I am doing wrong. Under Windows or in Google Colab come only Errors when trying to train.
The inference_gfpgan.py works under Windows and Google Colab. With other projects e.g. Nvidia Stylegan2-ADA Pytorch etc. it works with cuda ops build at runtime.

Error on Win 10 with Conda:
set BASICSR_JIT=True && python gfpgan\train.py -opt c:\Users\Chaos\Downloads\test.yml
or
python gfpgan\train.py -opt c:\Users\Chaos\Downloads\test.yml

2021-08-11 10:36:58,307 INFO: Model [GFPGANModel] is created.
2021-08-11 10:37:05,481 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "gfpgan\train.py", line 11, in
train_pipeline(root_path)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\train.py", line 167, in train_pipeline
model.optimize_parameters(current_iter)
File "d:!_ai!_repo\gfpgan\gfpgan\models\gfpgan_model.py", line 305, in optimize_parameters
self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "d:!_ai!_repo\gfpgan\gfpgan\archs\gfpganv1_arch.py", line 347, in forward
feat = self.conv_body_first(x)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\container.py", line 139, in forward
input = module(input)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 91, in forward
return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 95, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 65, in forward
out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined

Error on Google Colab:
!BASICSR_JIT=True python gfpgan/train.py -opt /content/test.yml

2021-08-11 08:33:04,594 INFO: Model [GFPGANModel] is created.
2021-08-11 08:33:04,654 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "gfpgan/train.py", line 11, in
train_pipeline(root_path)
File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 167, in train_pipeline
model.optimize_parameters(current_iter)
File "/content/GFPGAN/gfpgan/models/gfpgan_model.py", line 305, in optimize_parameters
self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/GFPGAN/gfpgan/archs/gfpganv1_arch.py", line 347, in forward
feat = self.conv_body_first(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 91, in forward
return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 95, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 65, in forward
out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
RuntimeError: input must be a CUDA tensor

Error on Google Colab:
!python gfpgan/train.py -opt /content/test.yml

2021-08-11 08:35:29,867 INFO: Model [GFPGANModel] is created.
2021-08-11 08:35:29,924 INFO: Start training from epoch: 0, iter: 0
Traceback (most recent call last):
File "gfpgan/train.py", line 11, in
train_pipeline(root_path)
File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 167, in train_pipeline
model.optimize_parameters(current_iter)
File "/content/GFPGAN/gfpgan/models/gfpgan_model.py", line 305, in optimize_parameters
self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/content/GFPGAN/gfpgan/archs/gfpganv1_arch.py", line 347, in forward
feat = self.conv_body_first(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 91, in forward
return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 95, in fused_leaky_relu
return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 65, in forward
out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined

@ghost commented on GitHub (Aug 11, 2021): No idea what I am doing wrong. Under Windows or in Google Colab come only Errors when trying to train. The inference_gfpgan.py works under Windows and Google Colab. With other projects e.g. Nvidia Stylegan2-ADA Pytorch etc. it works with cuda ops build at runtime. Error on Win 10 with Conda: set BASICSR_JIT=True && python gfpgan\train.py -opt c:\Users\Chaos\Downloads\test.yml or python gfpgan\train.py -opt c:\Users\Chaos\Downloads\test.yml 2021-08-11 10:36:58,307 INFO: Model [GFPGANModel] is created. 2021-08-11 10:37:05,481 INFO: Start training from epoch: 0, iter: 0 Traceback (most recent call last): File "gfpgan\train.py", line 11, in <module> train_pipeline(root_path) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\train.py", line 167, in train_pipeline model.optimize_parameters(current_iter) File "d:\!_ai\!_repo\gfpgan\gfpgan\models\gfpgan_model.py", line 305, in optimize_parameters self.output, out_rgbs = self.net_g(self.lq, return_rgb=True) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "d:\!_ai\!_repo\gfpgan\gfpgan\archs\gfpganv1_arch.py", line 347, in forward feat = self.conv_body_first(x) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\container.py", line 139, in forward input = module(input) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 91, in forward return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 95, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "C:\Users\Chaos\miniconda3\envs\GFPGAN\lib\site-packages\basicsr\ops\fused_act\fused_act.py", line 65, in forward out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) NameError: name 'fused_act_ext' is not defined Error on Google Colab: !BASICSR_JIT=True python gfpgan/train.py -opt /content/test.yml 2021-08-11 08:33:04,594 INFO: Model [GFPGANModel] is created. 2021-08-11 08:33:04,654 INFO: Start training from epoch: 0, iter: 0 Traceback (most recent call last): File "gfpgan/train.py", line 11, in <module> train_pipeline(root_path) File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 167, in train_pipeline model.optimize_parameters(current_iter) File "/content/GFPGAN/gfpgan/models/gfpgan_model.py", line 305, in optimize_parameters self.output, out_rgbs = self.net_g(self.lq, return_rgb=True) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/content/GFPGAN/gfpgan/archs/gfpganv1_arch.py", line 347, in forward feat = self.conv_body_first(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 91, in forward return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale) File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 95, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 65, in forward out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) RuntimeError: input must be a CUDA tensor Error on Google Colab: !python gfpgan/train.py -opt /content/test.yml 2021-08-11 08:35:29,867 INFO: Model [GFPGANModel] is created. 2021-08-11 08:35:29,924 INFO: Start training from epoch: 0, iter: 0 Traceback (most recent call last): File "gfpgan/train.py", line 11, in <module> train_pipeline(root_path) File "/usr/local/lib/python3.7/dist-packages/basicsr/train.py", line 167, in train_pipeline model.optimize_parameters(current_iter) File "/content/GFPGAN/gfpgan/models/gfpgan_model.py", line 305, in optimize_parameters self.output, out_rgbs = self.net_g(self.lq, return_rgb=True) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/content/GFPGAN/gfpgan/archs/gfpganv1_arch.py", line 347, in forward feat = self.conv_body_first(x) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 91, in forward return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale) File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 95, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "/usr/local/lib/python3.7/dist-packages/basicsr/ops/fused_act/fused_act.py", line 65, in forward out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) NameError: name 'fused_act_ext' is not defined
Author
Owner

@xinntao commented on GitHub (Aug 13, 2021):

  1. on windows conda: you may need to check the BASICSR_JIT env variable. You can check in BasicSR:
    image

Or you can force using the cuda ops building at runtime.

  1. Google colab: RuntimeError: input must be a CUDA tensor

check whether put the tensor to GPU.

@xinntao commented on GitHub (Aug 13, 2021): 1. on windows conda: you may need to check the `BASICSR_JIT` env variable. You can check in BasicSR: ![image](https://user-images.githubusercontent.com/17445847/129355739-d370632c-1c4c-434a-8ddd-f3279a423614.png) Or you can force using the cuda ops building at runtime. 2. Google colab: RuntimeError: input must be a CUDA tensor check whether put the tensor to GPU.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: TencentARC/GFPGAN#45