ZeroDivisionError: integer division or modulo by zero #11

underwhitee · 2023-03-06T09:34:32Z

per_proc = len(worker_batches) // len(self.update_forward_grad_ps)
How can I set the number of processes and clients to avoid "updating_ forward_ grad_ ps" becomes an empty array?

kiddyboots216 · 2023-03-06T15:22:27Z

Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time.
To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes.

underwhitee · 2023-03-08T07:55:13Z

Sorry to take so long to reply to you. My parameter setting details are as follows: "mode == sketch" "num_clients == 20" "num_workers == 20" "num_devices == 1" "share_ps_gpu , action= "store_true" " The problem is described as follows:   File "CommEfficient\fed_aggregator.py", line 232, in _call_train     per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero I encountered problems when running "cv_train. py". I think these parameters may cause problems. If you need to set other parameters, please let me know. Thank you very much!

…

------------------ 原始邮件 ------------------ 发件人: "kiddyboots216/CommEfficient" ***@***.***>; 发送时间: 2023年3月6日(星期一) 晚上11:22 ***@***.***>; ***@***.******@***.***>; 主题: Re: [kiddyboots216/CommEfficient] ZeroDivisionError: integer division or modulo by zero (Issue #11) Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

kiddyboots216 · 2023-03-08T23:33:37Z

Alright, so the issue I think is that you've got 20 clients and 20 workers. This means that you're trying to get through the entire dataset at each iteration. Can you try doing, say, 100 clients and 20 workers? Also, you can try increasing the timeout. Try 900s.

xiayuanj · 2023-08-02T15:42:29Z

Sorry to take so long to reply to you. My parameter setting details are as follows: "mode == sketch" "num_clients == 20" "num_workers == 20" "num_devices == 1" "share_ps_gpu , action= "store_true" " The problem is described as follows: File "CommEfficient\fed_aggregator.py", line 232, in _call_train per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero I encountered problems when running "cv_train. py". I think these parameters may cause problems. If you need to set other parameters, please let me know. Thank you very much!
…
------------------ 原始邮件 ------------------ 发件人: "kiddyboots216/CommEfficient" @.>; 发送时间: 2023年3月6日(星期一) 晚上11:22 @.>; @.@.>; 主题: Re: [kiddyboots216/CommEfficient] ZeroDivisionError: integer division or modulo by zero (Issue #11) Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Hello, my device only has one GPU, and this problem also occurs when executing the code. Have you solved it?

kiddyboots216 · 2023-08-02T16:07:32Z

Hi, this error occurs when the worker processes do not enqueue to updating_ forward_ grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker.

xiayuanj · 2023-08-03T03:10:01Z

Hi, this error occurs when the worker processes do not enqueue to updating_ forward_ grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker.

Thanks for your reply. I've tried increasing the number of clients and workers and I still get this problem. I think it's the number of devices that causes this problem, as shown below. My device has only one GPU, so I defined num_device as 1. During code execution, if num_device=1 and share_ps_gpu=False, n_worker_gpus=0. This means that the following "for loop" operation cannot be performed. Therefore, the update_forward_grad_ps list is empty.

kiddyboots216 · 2023-08-03T03:11:50Z

Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu.

xiayuanj · 2023-08-03T03:34:26Z

Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu.

I tried to do this but also encountered a new problem as follows.

So I modified torch.distributed.reduce(sum_g, 0), as follows.

When I modified it, I felt as if the code was in a dead end loop.

kiddyboots216 · 2023-08-03T03:37:28Z

Could you revert the change to torch.distributed.reduce and add these lines;
Could you also try export NCCL_DEBUG=INFO and export NCCL_DEBUG_SUBSYS=ALL

xiayuanj · 2023-08-03T03:53:16Z

I have tried these commands and they don't work.

torch.distributed.reduce(sum_g, 0)
I tried what you said about export NCCL_DEBUG=INFO and export NCCL_DEBUG_SUBSYS=ALL, but still have the same problem.

When I modified torch.distributed.reduce(sum_g, 0), the command line outputs batch queue was empty.

kiddyboots216 · 2023-08-03T03:54:55Z

The export commands are just adding some environment variables in to make the error message more useful. The "NCCL error invalid usage" message you were originally getting is not descriptive because it could be a versioning error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroDivisionError: integer division or modulo by zero #11

ZeroDivisionError: integer division or modulo by zero #11

underwhitee commented Mar 6, 2023

kiddyboots216 commented Mar 6, 2023

underwhitee commented Mar 8, 2023 via email

kiddyboots216 commented Mar 8, 2023

xiayuanj commented Aug 2, 2023

kiddyboots216 commented Aug 2, 2023

xiayuanj commented Aug 3, 2023

kiddyboots216 commented Aug 3, 2023

xiayuanj commented Aug 3, 2023

kiddyboots216 commented Aug 3, 2023

xiayuanj commented Aug 3, 2023

kiddyboots216 commented Aug 3, 2023

ZeroDivisionError: integer division or modulo by zero #11

ZeroDivisionError: integer division or modulo by zero #11

Comments

underwhitee commented Mar 6, 2023

kiddyboots216 commented Mar 6, 2023

underwhitee commented Mar 8, 2023 via email

kiddyboots216 commented Mar 8, 2023

xiayuanj commented Aug 2, 2023

kiddyboots216 commented Aug 2, 2023

xiayuanj commented Aug 3, 2023

kiddyboots216 commented Aug 3, 2023

xiayuanj commented Aug 3, 2023

kiddyboots216 commented Aug 3, 2023

xiayuanj commented Aug 3, 2023

kiddyboots216 commented Aug 3, 2023