Issues with PyTorch Distributed Training on Google Colab #487

Open
opened 2026-01-29 21:48:08 +00:00 by claunia · 12 comments
Owner

Originally created by @doniaa24 on GitHub (May 10, 2024).

hello everyone,
Screenshot from 2024-05-10 20-16-55

I'm encountering errors while training a GFPGAN model using PyTorch's distributed training setup on Google Colab. The primary issues are a ChildFailedError with a SIGKILL signal and multiple plugin registration errors for cuDNN, cuFFT, and cuBLAS.

Anyy help please !!!!

Originally created by @doniaa24 on GitHub (May 10, 2024). hello everyone, ![Screenshot from 2024-05-10 20-16-55](https://github.com/TencentARC/GFPGAN/assets/107725595/78b5a5a5-0ea3-4f50-8a0b-97640b851e48) I'm encountering errors while training a GFPGAN model using PyTorch's distributed training setup on Google Colab. The primary issues are a ChildFailedError with a SIGKILL signal and multiple plugin registration errors for cuDNN, cuFFT, and cuBLAS. Anyy help please !!!!
Author
Owner

@jean943 commented on GitHub (May 15, 2024):

Fazendo pelo servidor do Google, também dá erro na etapa "Inference". Sou leigo na área, mas eu queria tanto que lá funcionasse para eu restaurar algumas fotos..

@jean943 commented on GitHub (May 15, 2024): Fazendo pelo servidor do Google, também dá erro na etapa "Inference". Sou leigo na área, mas eu queria tanto que lá funcionasse para eu restaurar algumas fotos..
Author
Owner

@doniaa24 commented on GitHub (May 15, 2024):

@jean943 What error did you encounter?

@doniaa24 commented on GitHub (May 15, 2024): @jean943 What error did you encounter?
Author
Owner

@doniaa24 commented on GitHub (May 15, 2024):

I solved this issue by by removing "torch.distributed.launch --nproc_per_node= 4 --master_port= 22021", so train the GFPGAN model you can simply run !torchrun gfpgan/train.py -opt options/train_gfpgan_v1.yml

@doniaa24 commented on GitHub (May 15, 2024): I solved this issue by by removing "torch.distributed.launch --nproc_per_node= 4 --master_port= 22021", so train the GFPGAN model you can simply run !torchrun gfpgan/train.py -opt options/train_gfpgan_v1.yml
Author
Owner

@doniaa24 commented on GitHub (May 15, 2024):

If you got the error: " name "fused_act_ex"t is not defined run: !BASICSR_JIT=True torchrun gfpgan/train.py -opt options/train_gfpgan_v1.yml

@doniaa24 commented on GitHub (May 15, 2024): If you got the error: " name "fused_act_ex"t is not defined run: !BASICSR_JIT=True torchrun gfpgan/train.py -opt options/train_gfpgan_v1.yml
Author
Owner

@jean943 commented on GitHub (May 16, 2024):

Perdoe-me pela ignorância, não sou um programador, sou apenas um curioso no assunto. Estou tentando restaurar algumas imagens pelo servidor do google Colab, não tenho o torch e nem o python instalado, executo tudo na nuvem. Vou copiar o erro que está aparecendo quando eu tento processar as imagens. Se você puder, tente subir uma imagem nesse link abaixo para tentar processar e veja se no "3. Inference" vai acontecer o mesmo erro que acontece comigo.

Esse é o site do Colab: https://colab.research.google.com/drive/1sVsoBd9AjckIXThgtZhGrHRfFI6UUYOo

Esse é o erro no Passo "3. Inference"

Now we use the GFPGAN to restore the above low-quality images

We use Real-ESRGAN for enhancing the background (non-face) regions

You can find the different models in https://github.com/TencentARC/GFPGAN#european_castle-model-zoo

!rm -rf results
!python inference_gfpgan.py -i inputs/upload -o results -v 1.3 -s 2 --bg_upsampler realesrgan

Usage: python inference_gfpgan.py -i inputs/whole_imgs -o results -v 1.3 -s 2 [options]...

-h show this help

-i input Input image or folder. Default: inputs/whole_imgs

-o output Output folder. Default: results

-v version GFPGAN model version. Option: 1 | 1.2 | 1.3. Default: 1.3

-s upscale The final upsampling scale of the image. Default: 2

-bg_upsampler background upsampler. Default: realesrgan

-bg_tile Tile size for background sampler, 0 for no tile during testing. Default: 400

-suffix Suffix of the restored faces

-only_center_face Only restore the center face

-aligned Input are aligned faces

-ext Image extension. Options: auto | jpg | png, auto means using the same extension as inputs. Default: auto

!ls results/cmp

Traceback (most recent call last):
File "/content/GFPGAN/inference_gfpgan.py", line 7, in
from basicsr.utils import imwrite
File "/usr/local/lib/python3.10/dist-packages/basicsr/init.py", line 4, in
from .data import *
File "/usr/local/lib/python3.10/dist-packages/basicsr/data/init.py", line 22, in
_dataset_modules = [importlib.import_module(f'basicsr.data.{file_name}') for file_name in dataset_filenames]
File "/usr/local/lib/python3.10/dist-packages/basicsr/data/init.py", line 22, in
_dataset_modules = [importlib.import_module(f'basicsr.data.{file_name}') for file_name in dataset_filenames]
File "/usr/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "/usr/local/lib/python3.10/dist-packages/basicsr/data/realesrgan_dataset.py", line 11, in
from basicsr.data.degradations import circular_lowpass_kernel, random_mixed_kernels
File "/usr/local/lib/python3.10/dist-packages/basicsr/data/degradations.py", line 8, in
from torchvision.transforms.functional_tensor import rgb_to_grayscale
ModuleNotFoundError: No module named 'torchvision.transforms.functional_tensor'
ls: cannot access 'results/cmp': No such file or directory

Eu deixei as duas últimas linhas do código em negrito pois eu creio que o erro esteja aí, não sei como resolver.. Poderia me ajudar?

@jean943 commented on GitHub (May 16, 2024): Perdoe-me pela ignorância, não sou um programador, sou apenas um curioso no assunto. Estou tentando restaurar algumas imagens pelo servidor do google Colab, não tenho o torch e nem o python instalado, executo tudo na nuvem. Vou copiar o erro que está aparecendo quando eu tento processar as imagens. Se você puder, tente subir uma imagem nesse link abaixo para tentar processar e veja se no "3. Inference" vai acontecer o mesmo erro que acontece comigo. Esse é o site do Colab: https://colab.research.google.com/drive/1sVsoBd9AjckIXThgtZhGrHRfFI6UUYOo Esse é o erro no Passo "3. Inference" # Now we use the GFPGAN to restore the above low-quality images # We use [Real-ESRGAN](https://github.com/xinntao/Real-ESRGAN) for enhancing the background (non-face) regions # You can find the different models in https://github.com/TencentARC/GFPGAN#european_castle-model-zoo !rm -rf results !python inference_gfpgan.py -i inputs/upload -o results -v 1.3 -s 2 --bg_upsampler realesrgan # Usage: python inference_gfpgan.py -i inputs/whole_imgs -o results -v 1.3 -s 2 [options]... # # -h show this help # -i input Input image or folder. Default: inputs/whole_imgs # -o output Output folder. Default: results # -v version GFPGAN model version. Option: 1 | 1.2 | 1.3. Default: 1.3 # -s upscale The final upsampling scale of the image. Default: 2 # -bg_upsampler background upsampler. Default: realesrgan # -bg_tile Tile size for background sampler, 0 for no tile during testing. Default: 400 # -suffix Suffix of the restored faces # -only_center_face Only restore the center face # -aligned Input are aligned faces # -ext Image extension. Options: auto | jpg | png, auto means using the same extension as inputs. Default: auto !ls results/cmp Traceback (most recent call last): File "/content/GFPGAN/inference_gfpgan.py", line 7, in <module> from basicsr.utils import imwrite File "/usr/local/lib/python3.10/dist-packages/basicsr/__init__.py", line 4, in <module> from .data import * File "/usr/local/lib/python3.10/dist-packages/basicsr/data/__init__.py", line 22, in <module> _dataset_modules = [importlib.import_module(f'basicsr.data.{file_name}') for file_name in dataset_filenames] File "/usr/local/lib/python3.10/dist-packages/basicsr/data/__init__.py", line 22, in <listcomp> _dataset_modules = [importlib.import_module(f'basicsr.data.{file_name}') for file_name in dataset_filenames] File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/usr/local/lib/python3.10/dist-packages/basicsr/data/realesrgan_dataset.py", line 11, in <module> from basicsr.data.degradations import circular_lowpass_kernel, random_mixed_kernels File "/usr/local/lib/python3.10/dist-packages/basicsr/data/degradations.py", line 8, in <module> from torchvision.transforms.functional_tensor import rgb_to_grayscale **ModuleNotFoundError: No module named 'torchvision.transforms.functional_tensor'** **ls: cannot access 'results/cmp': No such file or directory** Eu deixei as duas últimas linhas do código em negrito pois eu creio que o erro esteja aí, não sei como resolver.. Poderia me ajudar?
Author
Owner

@doniaa24 commented on GitHub (May 16, 2024):

@jean943 on the file degradation.py line 8 simply change:
from torchvision.transforms.functional_tensor import rgb_to_grayscale
to:
from torchvision.transforms.functional import rgb_to_grayscale

@doniaa24 commented on GitHub (May 16, 2024): @jean943 on the file degradation.py line 8 simply change: from torchvision.transforms.functional_tensor import rgb_to_grayscale to: from torchvision.transforms.functional import rgb_to_grayscale
Author
Owner

@jean943 commented on GitHub (May 16, 2024):

@doniaa24, eu consigo mudar essa linha de código pelo próprio Google Colab? Você poderia me dar uma direção de como fazer isso? Pois os códigos são gerados quando eu aperto o play do "3. Inference"

@jean943 commented on GitHub (May 16, 2024): @doniaa24, eu consigo mudar essa linha de código pelo próprio Google Colab? Você poderia me dar uma direção de como fazer isso? Pois os códigos são gerados quando eu aperto o play do "3. Inference"
Author
Owner

@shishirahm3d commented on GitHub (May 16, 2024):

@jean943 run this code on colab before running 3. Inference. it will replace the degradation.py line 8
from torchvision.transforms.functional_tensor import rgb_to_grayscale
to:
from torchvision.transforms.functional import rgb_to_grayscale

# Define the file path
file_path = '/usr/local/lib/python3.10/dist-packages/basicsr/data/degradations.py'

# Define the new import statement
new_import_statement = "from torchvision.transforms.functional import rgb_to_grayscale\n"

# Read the content of the file
with open(file_path, 'r') as file:
    lines = file.readlines()

# Modify the desired line (line 8 in this case)
if len(lines) >= 8:
    lines[7] = new_import_statement  # Index 7 corresponds to line 8 (0-based indexing)

# Write the modified content back to the file
with open(file_path, 'w') as file:
    file.writelines(lines)

print("Replacement completed successfully!")
@shishirahm3d commented on GitHub (May 16, 2024): @jean943 run this code on colab before running 3. Inference. it will replace the degradation.py line 8 from torchvision.transforms.functional_tensor import rgb_to_grayscale to: from torchvision.transforms.functional import rgb_to_grayscale ``` # Define the file path file_path = '/usr/local/lib/python3.10/dist-packages/basicsr/data/degradations.py' # Define the new import statement new_import_statement = "from torchvision.transforms.functional import rgb_to_grayscale\n" # Read the content of the file with open(file_path, 'r') as file: lines = file.readlines() # Modify the desired line (line 8 in this case) if len(lines) >= 8: lines[7] = new_import_statement # Index 7 corresponds to line 8 (0-based indexing) # Write the modified content back to the file with open(file_path, 'w') as file: file.writelines(lines) print("Replacement completed successfully!") ```
Author
Owner

@doniaa24 commented on GitHub (May 16, 2024):

@jean943 try the code given by @shishirahm3d I did the same. .

@doniaa24 commented on GitHub (May 16, 2024): @jean943 try the code given by @shishirahm3d I did the same. .
Author
Owner

@jean943 commented on GitHub (May 18, 2024):

Deu tudo certo, você foi a única que conseguiu me ajudar!! Você nem imagina o quanto sou grato pela sua ajuda!!

@jean943 commented on GitHub (May 18, 2024): Deu tudo certo, você foi a única que conseguiu me ajudar!! Você nem imagina o quanto sou grato pela sua ajuda!!
Author
Owner

@jean943 commented on GitHub (May 18, 2024):

estou restaurando uma fita VHS, eu exportei todos os frames em JPG para que eu pudesse tratar dentro do GFPGAN.. Como eu tinha visto alguns vídeos mostrando os resultados eu fiquei empolgado mas infelizmente os resultados deixaram a desejar. Não sei se você conhece outra IA que possa me ajudar, mas aceito novas sugestões.. Eu selecionei alguns frames pra fazer alguns testes e infelizmente o resultado foi bem decepcionante.. :(

@jean943 commented on GitHub (May 18, 2024): estou restaurando uma fita VHS, eu exportei todos os frames em JPG para que eu pudesse tratar dentro do GFPGAN.. Como eu tinha visto alguns vídeos mostrando os resultados eu fiquei empolgado mas infelizmente os resultados deixaram a desejar. Não sei se você conhece outra IA que possa me ajudar, mas aceito novas sugestões.. Eu selecionei alguns frames pra fazer alguns testes e infelizmente o resultado foi bem decepcionante.. :(
Author
Owner

@doniaa24 commented on GitHub (May 18, 2024):

@jean943 which version of GFPGAN you were testing ?

@doniaa24 commented on GitHub (May 18, 2024): @jean943 which version of GFPGAN you were testing ?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: TencentARC/GFPGAN#487