使用checkpoint继续训练的bug #23

Closed
opened 2026-01-29 21:37:45 +00:00 by claunia · 8 comments
Owner

Originally created by @SimKarras on GitHub (Jul 11, 2021).

当我想要从断点继续训练,我修改了.yml文件以下内容:

# path
path:
  pretrain_network_g: experiments/train_GFPGANv1_512/models/net_g_490000.pth
  param_key_g: params_ema
  strict_load_g: ~
  pretrain_network_d: experiments/train_GFPGANv1_512/models/net_d_490000.pth
  pretrain_network_d_left_eye: experiments/train_GFPGANv1_512/models/net_d_left_eye_490000.pth
  pretrain_network_d_right_eye: experiments/train_GFPGANv1_512/models/net_d_right_eye_490000.pth
  pretrain_network_d_mouth: experiments/train_GFPGANv1_512/models/net_d_mouth_490000.pth
  pretrain_network_identity: experiments/pretrained_models/arcface_resnet18.pth
  # resume
  resume_state: experiments/train_GFPGANv1_512/training_states/490000.state
  ignore_resume_networks: ['network_identity']

我并没有修改pretrain_network_identity项。
但是随后报错:

FileNotFoundError: [Errno 2] No such file or directory: 'GFPGAN/experiments/train_GFPGANv1_512/models/net_identity_490000.pth'

一脸懵啊。。。
翻看log初始打印所有配置,此时pretrain_network_identity已经变了:

2021-07-11 22:21:11,000 INFO: Loading ResNetArcFace model from GFPGAN/experiments/train_GFPGANv1_512/models/net_identity_490000.pth.

这。。。。

Originally created by @SimKarras on GitHub (Jul 11, 2021). 当我想要从断点继续训练,我修改了.yml文件以下内容: ``` # path path: pretrain_network_g: experiments/train_GFPGANv1_512/models/net_g_490000.pth param_key_g: params_ema strict_load_g: ~ pretrain_network_d: experiments/train_GFPGANv1_512/models/net_d_490000.pth pretrain_network_d_left_eye: experiments/train_GFPGANv1_512/models/net_d_left_eye_490000.pth pretrain_network_d_right_eye: experiments/train_GFPGANv1_512/models/net_d_right_eye_490000.pth pretrain_network_d_mouth: experiments/train_GFPGANv1_512/models/net_d_mouth_490000.pth pretrain_network_identity: experiments/pretrained_models/arcface_resnet18.pth # resume resume_state: experiments/train_GFPGANv1_512/training_states/490000.state ignore_resume_networks: ['network_identity'] ``` 我并没有修改pretrain_network_identity项。 但是随后报错: ``` FileNotFoundError: [Errno 2] No such file or directory: 'GFPGAN/experiments/train_GFPGANv1_512/models/net_identity_490000.pth' ``` 一脸懵啊。。。 翻看log初始打印所有配置,此时pretrain_network_identity已经变了: ``` 2021-07-11 22:21:11,000 INFO: Loading ResNetArcFace model from GFPGAN/experiments/train_GFPGANv1_512/models/net_identity_490000.pth. ``` 这。。。。
Author
Owner

@xinntao commented on GitHub (Jul 12, 2021):

@JiaweiShiCV 这是basicsr的一个bug,你可以更新一下basicsr (v1.3.3.5):

具体问题原因是这个: 4a96712827

@xinntao commented on GitHub (Jul 12, 2021): @JiaweiShiCV 这是basicsr的一个bug,你可以更新一下basicsr (v1.3.3.5): 具体问题原因是这个: https://github.com/xinntao/BasicSR/commit/4a9671282797fcc8675ed5c25299364953d660f7
Author
Owner

@SimKarras commented on GitHub (Jul 12, 2021):

@xinntao pip install basicsr --upgrade 更新以后处理图片报错:

(BasicSR) ➜  GFPGAN git:(master) ✗ python inference_gfpgan_full.py --model_path experiments/pretrained_models/G8/net_g_480000.pth --test_path inputs/whole_imgs --paste_back
Processing 112.jpg ...
Traceback (most recent call last):
  File "inference_gfpgan_full.py", line 129, in <module>
    restoration(
  File "inference_gfpgan_full.py", line 52, in restoration
    output = gfpgan(cropped_face_t, return_rgb=False)[0]
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sjw/文档/GFPGAN/archs/gfpganv1_arch.py", line 348, in forward
    feat = self.conv_body_first(x)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 85, in forward
    return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 89, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 59, in forward
    out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined

然后我尝试卸载basicsr, 加上环境变量重新安装
BASICSR_EXT=True pip install basicsr
还是一样报错。。。
我暂时先换回1.3.3.4了

@SimKarras commented on GitHub (Jul 12, 2021): @xinntao pip install basicsr --upgrade 更新以后处理图片报错: ``` (BasicSR) ➜ GFPGAN git:(master) ✗ python inference_gfpgan_full.py --model_path experiments/pretrained_models/G8/net_g_480000.pth --test_path inputs/whole_imgs --paste_back Processing 112.jpg ... Traceback (most recent call last): File "inference_gfpgan_full.py", line 129, in <module> restoration( File "inference_gfpgan_full.py", line 52, in restoration output = gfpgan(cropped_face_t, return_rgb=False)[0] File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/sjw/文档/GFPGAN/archs/gfpganv1_arch.py", line 348, in forward feat = self.conv_body_first(x) File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 85, in forward return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale) File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 89, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "/home/sjw/anaconda3/envs/BasicSR/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 59, in forward out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) NameError: name 'fused_act_ext' is not defined ``` 然后我尝试卸载basicsr, 加上环境变量重新安装 `BASICSR_EXT=True pip install basicsr` 还是一样报错。。。 我暂时先换回1.3.3.4了
Author
Owner

@SimKarras commented on GitHub (Jul 12, 2021):

新版本(1.3.3.5)下,stylegan的fused_act_ext编译有问题,导致训练开始不了。

@SimKarras commented on GitHub (Jul 12, 2021): 新版本(1.3.3.5)下,stylegan的fused_act_ext编译有问题,导致训练开始不了。
Author
Owner

@xinntao commented on GitHub (Jul 12, 2021):

这个版本相关的代码没有修改过。

你可以使用 git clone 来编译, 能够更好定位问题

  1. 先卸载现有的basicsr
  2. git clone https://github.com/xinntao/BasicSR.git
  3. 进入basicsr目录, 编译 BASICSR_EXT=True python setup.py develop

如果有问题,可以把输出贴一下, 1.3.3.5应该是没有影响的才对=-=

@xinntao commented on GitHub (Jul 12, 2021): 这个版本相关的代码没有修改过。 你可以使用 git clone 来编译, 能够更好定位问题 1. 先卸载现有的basicsr 2. git clone https://github.com/xinntao/BasicSR.git 3. 进入basicsr目录, 编译 BASICSR_EXT=True python setup.py develop 如果有问题,可以把输出贴一下, 1.3.3.5应该是没有影响的才对=-=
Author
Owner

@SimKarras commented on GitHub (Jul 12, 2021):

@xinntao haha 我刚在两台机器上都试过了,无论是infer推演还是train,1.3.3.5都报错NameError: name 'fused_act_ext' is not defined, 。然后换1.3.3.4就和之前一样正常,1.3.3.4只有断点继续训练不行。

关于1.3.3.5多卡训练报错(和推演一样):

Traceback (most recent call last):
  File "train.py", line 10, in <module>
    train_pipeline(root_path)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/train.py", line 166, in train_pipeline
    model.optimize_parameters(current_iter)
  File "/home/shijiawei/data-vol-1/GFPGAN/models/gfpgan_model.py", line 307, in optimize_parameters
    self.output, out_rgbs = self.net_g(self.lq, return_rgb=True)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 684, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/shijiawei/data-vol-1/GFPGAN/archs/gfpganv1_arch.py", line 348, in forward
    feat = self.conv_body_first(x)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 85, in forward
    return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 89, in fused_leaky_relu
    return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale)
  File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 59, in forward
    out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale)
NameError: name 'fused_act_ext' is not defined
@SimKarras commented on GitHub (Jul 12, 2021): @xinntao haha 我刚在两台机器上都试过了,无论是infer推演还是train,1.3.3.5都报错`NameError: name 'fused_act_ext' is not defined`, 。然后换1.3.3.4就和之前一样正常,1.3.3.4只有断点继续训练不行。 关于1.3.3.5多卡训练报错(和推演一样): ``` Traceback (most recent call last): File "train.py", line 10, in <module> train_pipeline(root_path) File "/opt/conda/lib/python3.8/site-packages/basicsr/train.py", line 166, in train_pipeline model.optimize_parameters(current_iter) File "/home/shijiawei/data-vol-1/GFPGAN/models/gfpgan_model.py", line 307, in optimize_parameters self.output, out_rgbs = self.net_g(self.lq, return_rgb=True) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 684, in forward output = self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/home/shijiawei/data-vol-1/GFPGAN/archs/gfpganv1_arch.py", line 348, in forward feat = self.conv_body_first(x) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 85, in forward return fused_leaky_relu(input, self.bias, self.negative_slope, self.scale) File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 89, in fused_leaky_relu return FusedLeakyReLUFunction.apply(input, bias, negative_slope, scale) File "/opt/conda/lib/python3.8/site-packages/basicsr/ops/fused_act/fused_act.py", line 59, in forward out = fused_act_ext.fused_bias_act(input, bias, empty, 3, 0, negative_slope, scale) NameError: name 'fused_act_ext' is not defined ```
Author
Owner

@SimKarras commented on GitHub (Jul 12, 2021):

@xinntao 使用你上面的编译方式好像解决了。。。

@SimKarras commented on GitHub (Jul 12, 2021): @xinntao 使用你上面的编译方式好像解决了。。。
Author
Owner

@xinntao commented on GitHub (Jul 12, 2021):

ok,可能是上面没有卸载干净

或者是 BASICSR_EXT=True pip install basicsr 编译有问题, 这个可以通过 BASICSR_EXT=True pip -vvv install basicsr 来查看输出信息

@xinntao commented on GitHub (Jul 12, 2021): ok,可能是上面没有卸载干净 或者是 `BASICSR_EXT=True pip install basicsr ` 编译有问题, 这个可以通过 `BASICSR_EXT=True pip -vvv install basicsr` 来查看输出信息
Author
Owner

@SimKarras commented on GitHub (Jul 12, 2021):

ok,可能是上面没有卸载干净

或者是 BASICSR_EXT=True pip install basicsr 编译有问题, 这个可以通过 BASICSR_EXT=True pip -vvv install basicsr 来查看输出信息

好的 thx!

@SimKarras commented on GitHub (Jul 12, 2021): > ok,可能是上面没有卸载干净 > > 或者是 `BASICSR_EXT=True pip install basicsr ` 编译有问题, 这个可以通过 `BASICSR_EXT=True pip -vvv install basicsr` 来查看输出信息 好的 thx!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: TencentARC/GFPGAN#23