Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 05_ddp.md #525

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion cn/docs/parallelism/05_ddp.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@
download=True,
)

sampler = flow.utils.data.distributed.DistributedSampler(training_data)
doombeaker marked this conversation as resolved.
Show resolved Hide resolved

train_dataloader = flow.utils.data.DataLoader(
training_data, BATCH_SIZE, shuffle=True
training_data, BATCH_SIZE, shuffle=(sampler is None), sampler=sampler
doombeaker marked this conversation as resolved.
Show resolved Hide resolved
)

model = flowvision.models.mobilenet_v2().to(DEVICE)
Expand All @@ -48,6 +50,7 @@

for t in range(EPOCH_NUM):
print(f"Epoch {t+1}\n-------------------------------")
train_dataloader.sampler.set_epoch(t)
size = len(train_dataloader.dataset)
for batch, (x, y) in enumerate(train_dataloader):
x = x.to_global(placement=PLACEMENT, sbp=S0)
Expand Down Expand Up @@ -88,6 +91,8 @@
y = y.to_global(placement=PLACEMENT, sbp=S0)
```

- 需要注意的是,在进行分布式并行训练时,代码中规定的`BATCH_SIZE`为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡`BATCH_SIZE=64`的训练效果与单机单卡`BATCH_SIZE=128`一致。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- 需要注意的是,在进行分布式并行训练时,代码中规定的`BATCH_SIZE`为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡`BATCH_SIZE=64`的训练效果与单机单卡`BATCH_SIZE=128`一致。
- 需要注意的是,在进行分布式并行训练时,代码中规定的 `BATCH_SIZE` 为每一台机器的本地值而非`GLOBAL_BATCH_SIZE`,故上述代码单机双卡 `BATCH_SIZE=64` 的训练效果与单机单卡 `BATCH_SIZE=128` 一致。

中英文之间、中文和数字之间要有空格。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实我觉得这句不用加这里,因为它如果懂 global tensor,应该自己懂这个道理。
如果真要解释,是不是把 global tensor 那篇文章多做解释,解释下各种 sbp 下,to global 后的 global tensor 的形状。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实我觉得这句不用加这里,因为它如果懂 global tensor,应该自己懂这个道理。
如果真要解释,是不是把 global tensor 那篇文章多做解释,解释下各种 sbp 下,to global 后的 global tensor 的形状。

好的,global tensor的文档中已经有相应的tensor形状变化的解释以及例子。因为客户在微信聊天记录里问了一下这个batch_size=64是local还是global,我想着这里再解释一遍。


这样,按照 [常见的分布式并行策略](./01_introduction.md) 中的介绍,我们就通过对数据进行 `split(0)` 切分,对模型进行广播,进行了分布式数据并行训练。

## 使用 DistributedDataParallel 做数据并行训练
Expand Down