Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix PLE network bug & sort file list for ps trainer #932

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

liangzhenduo
Copy link

i表示第几个task,需要乘上每个task的expert数量,而不是乘task数量

i是第几个task,需要乘上每个task的expert数量而不是task数量
@liangzhenduo liangzhenduo changed the title Update net.py fix a network bug Update net.py fix PLE network bug Jun 8, 2023
@CLAassistant
Copy link

CLAassistant commented Jun 8, 2023

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

fixed code style
在不同节点上的文件顺序可能不一致,split_file_list可能读到相同的文件。sort以后保证每个节点的文件列表顺序一致,拆分读取个节点不会读到重复文件。
@liangzhenduo liangzhenduo changed the title Update net.py fix PLE network bug fix PLE network bug & sort file list for ps trainer Apr 15, 2024
@liangzhenduo
Copy link
Author

更改reader是因为分布式训练时不同节点上的文件顺序可能不一致(都是无序状态),split_file_list后不同节点可能读到相同的文件。sort以后保证每个节点的文件列表顺序一致,拆分读取各节点不会读到重复文件。

@dachr8
Copy link

dachr8 commented Sep 4, 2024

pr 是正确的,同时 task_init 和 exp_init 的部分也有问题

@@ -179,7 +179,7 @@ def forward(self, input_data):
# task-specific expert part
for i in range(0, self.task_num):
for j in range(0, self.exp_per_task):
linear_out = self._param_expert[i * self.task_num + j](
linear_out = self._param_expert[i * self.exp_per_task + j](
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

正确的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants