Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run Fluid with EDL #35

Open
10 tasks
typhoonzero opened this issue Jul 6, 2018 · 2 comments
Open
10 tasks

Run Fluid with EDL #35

typhoonzero opened this issue Jul 6, 2018 · 2 comments

Comments

@typhoonzero
Copy link
Collaborator

Tasks

  • full fault-tolerant training
  • dynamic trainer count in the pserver side so that we will be able to average gradients according to current trainer count.
  • Upgrade EDL controller to CRD so that we can support Kubernetes higher than v1.8
  • a tutorial to run distributed lookup sparse table with EDL
  • update experiment report, https://github.com/PaddlePaddle/cloud/tree/develop/doc/edl/experiment
@typhoonzero
Copy link
Collaborator Author

Update:

  • add ctr model to models and test with EDL

@seiriosPlus
Copy link
Contributor

seiriosPlus commented Jul 11, 2018

EDL一些想法(欢迎指正)

我理解EDL应该包括两大部分:

  1. 针对全局的弹性调度 (EDL-Controller)
  2. 针对单个训练任务伸缩的容错 (MASTER)

Master功能和目前存在的问题

功能

  1. 任务控制
  2. 分配TR/PS ID
  3. 分配ProgramDesc
  4. 自选存储(ETCD/LOCAL/ZOOKEEPER)
  5. 异常处理(任务异常->Kill)
  6. 全局指令

问题

  1. 目前功能简单可以直接由ETCD完成

思考

  1. 分布式和master耦合, 分布式也必须要用master-salve结构
  2. master是否存在过度设计
  3. EDL是给一级调度提供的feature

结论

  1. 在目前Master功能不更新的情况下,可以去掉Master,直接用ETCD
  2. 如果后期分布式功能更加丰富和完善,还需要一个Master的角色

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants