Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教:pretrain如何配置多副本 #6674

Closed
1 task done
ltm920716 opened this issue Jan 16, 2025 · 4 comments
Closed
1 task done

请教:pretrain如何配置多副本 #6674

ltm920716 opened this issue Jan 16, 2025 · 4 comments
Labels
duplicate This issue or pull request already exists

Comments

@ltm920716
Copy link

Reminder

  • I have read the above rules and searched the existing issues.

System Info

Reproduction

Put your message here.

Others

hello,
想请教一下,如果有多个gpu节点,其中部分节点就可以训练一个模型副本,如何设置多副本的并行训练呀,在示例中没有看到对应的样例,感谢

@ltm920716 ltm920716 added bug Something isn't working pending This problem is yet to be addressed labels Jan 16, 2025
@hiyouga
Copy link
Owner

hiyouga commented Jan 17, 2025

See document

@hiyouga hiyouga closed this as completed Jan 17, 2025
@hiyouga hiyouga added solved This problem has been already solved duplicate This issue or pull request already exists and removed bug Something isn't working pending This problem is yet to be addressed solved This problem has been already solved labels Jan 17, 2025
@ltm920716
Copy link
Author

https://llamafactory.readthedocs.io/zh-cn/latest/advanced/distributed.html

实在抱歉,我看过了文档,可能是我理解的有问题,我看文档上的多机多卡示例似乎应该都是单副本(也就是多机切分一个模型),所以来请教一下多副本并行的设置

@hiyouga
Copy link
Owner

hiyouga commented Jan 17, 2025

Do you mean zero++ or hybrid shard?

@ltm920716
Copy link
Author

llama-factory默认的并行是dp,没有支持tp与pp的设置示例,所以我想了解一下如果我4个节点用llamafactoy的deepspeed可以训练一个qwen2-72b,那我还有16张卡是不是可以已当前dp的方式叠加额外4个副本并行pretrain训练?还是只能用20个节点训练一个副本?或者是自行用deepspeed的tp并行来实现兼容么,谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants