Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroDivisionError: integer division or modulo by zero #11

Open
underwhitee opened this issue Mar 6, 2023 · 11 comments
Open

ZeroDivisionError: integer division or modulo by zero #11

underwhitee opened this issue Mar 6, 2023 · 11 comments

Comments

@underwhitee
Copy link

per_proc = len(worker_batches) // len(self.update_forward_grad_ps)
How can I set the number of processes and clients to avoid "updating_ forward_ grad_ ps" becomes an empty array?

@kiddyboots216
Copy link
Owner

Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time.
To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes.

@underwhitee
Copy link
Author

underwhitee commented Mar 8, 2023 via email

@kiddyboots216
Copy link
Owner

Alright, so the issue I think is that you've got 20 clients and 20 workers. This means that you're trying to get through the entire dataset at each iteration. Can you try doing, say, 100 clients and 20 workers? Also, you can try increasing the timeout. Try 900s.

@xiayuanj
Copy link

xiayuanj commented Aug 2, 2023

Sorry to take so long to reply to you. My parameter setting details are as follows: "mode == sketch" "num_clients == 20" "num_workers == 20" "num_devices == 1" "share_ps_gpu , action= "store_true" " The problem is described as follows:   File "CommEfficient\fed_aggregator.py", line 232, in _call_train     per_proc = len(worker_batches) // len(self.update_forward_grad_ps) ZeroDivisionError: integer division or modulo by zero I encountered problems when running "cv_train. py". I think these parameters may cause problems. If you need to set other parameters, please let me know. Thank you very much!

------------------ 原始邮件 ------------------ 发件人: "kiddyboots216/CommEfficient" @.>; 发送时间: 2023年3月6日(星期一) 晚上11:22 @.>; @.@.>; 主题: Re: [kiddyboots216/CommEfficient] ZeroDivisionError: integer division or modulo by zero (Issue #11) Hi, are you getting the error where that's an empty array currently? Could you share your setup details? I have typically only had problems very transiently and they are fixed by increasing the number of workers; it can also happen if the code to compute the gradient takes a really long time. To be honest -this code was written 3 years ago when multiprocessing libraries were in a very different state. At this point if I were going to write the code again, or even use it for another paper, I think I would use libraries that don't expose the user to as much churn from lower level processes. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Hello, my device only has one GPU, and this problem also occurs when executing the code. Have you solved it?

@kiddyboots216
Copy link
Owner

Hi, this error occurs when the worker processes do not enqueue to updating_ forward_ grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker.

@xiayuanj
Copy link

xiayuanj commented Aug 3, 2023

Hi, this error occurs when the worker processes do not enqueue to updating_ forward_ grad_ ps in time. Can you try increasing the timeout or increasing the number of clients? If you try for example clients=1 workers=1 then you're trying to do the entire dataset at each iteration, and the default timeout is not long enough (perhaps) to process the entire dataset with only 1 DataLoader worker.

Thanks for your reply. I've tried increasing the number of clients and workers and I still get this problem. I think it's the number of devices that causes this problem, as shown below. My device has only one GPU, so I defined num_device as 1. During code execution, if num_device=1 and share_ps_gpu=False, n_worker_gpus=0. This means that the following "for loop" operation cannot be performed. Therefore, the update_forward_grad_ps list is empty.

image

@kiddyboots216
Copy link
Owner

Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu.

@xiayuanj
Copy link

xiayuanj commented Aug 3, 2023

Oh I see! Yeah so you need to set share_ps_gpu=True when you run the code. That way, the workers can share a GPU with the parameter server. This will limit the size of the model you're able to run since you have to hold 2 copies in memory at the same time, but it's necessary if you are running on 1 gpu.

I tried to do this but also encountered a new problem as follows.
image
So I modified torch.distributed.reduce(sum_g, 0), as follows.
image
When I modified it, I felt as if the code was in a dead end loop.

@kiddyboots216
Copy link
Owner

Could you revert the change to torch.distributed.reduce and add these lines;
Could you also try export NCCL_DEBUG=INFO and export NCCL_DEBUG_SUBSYS=ALL

@xiayuanj
Copy link

xiayuanj commented Aug 3, 2023

I have tried these commands and they don't work.

torch.distributed.reduce(sum_g, 0)
I tried what you said about export NCCL_DEBUG=INFO and export NCCL_DEBUG_SUBSYS=ALL, but still have the same problem.

When I modified torch.distributed.reduce(sum_g, 0), the command line outputs batch queue was empty.
image

@kiddyboots216
Copy link
Owner

The export commands are just adding some environment variables in to make the error message more useful. The "NCCL error invalid usage" message you were originally getting is not descriptive because it could be a versioning error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants