unable to start training : NCCL library error #155

Open
opened 2026-01-29 21:44:15 +00:00 by claunia · 2 comments
Owner

Originally created by @nowfalcodmeric on GitHub (Feb 3, 2022).

RuntimeError: RuntimeErrorRuntimeErrorNCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).: RuntimeError:
NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Originally created by @nowfalcodmeric on GitHub (Feb 3, 2022). RuntimeError: RuntimeErrorRuntimeErrorNCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).: RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Author
Owner

@Asuka001100 commented on GitHub (Feb 8, 2022):

maybe you GPU numbers is not ture for parameter setting

@Asuka001100 commented on GitHub (Feb 8, 2022): maybe you GPU numbers is not ture for parameter setting
Author
Owner

@ucalyptus2 commented on GitHub (Dec 7, 2022):

@Asuka001100 couldn't understand what u meant

@ucalyptus2 commented on GitHub (Dec 7, 2022): @Asuka001100 couldn't understand what u meant
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: TencentARC/GFPGAN#155