Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not for land] TE experiments, take 2 #614

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

[not for land] TE experiments, take 2 #614

wants to merge 1 commit into from

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Oct 14, 2024

Summary:

Test Plan:

// on a machine with 8 H100 GPUs...
// PT - bf16 + compile
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --activation_checkpoint.mode none --training.seq_len 6144 --training.compile
// PT - f8 compute + compile
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --activation_checkpoint.mode none --training.seq_len 6144 --training.compile --float8.enable_float8_linear
// PT - f8 compute + comms + compile
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --activation_checkpoint.mode none --training.seq_len 6144 --training.compile --float8.enable_float8_linear --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp
// TE + compile - doesn't work OOB, needs a workaround
// TE - bf16
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.use_te --activation_checkpoint.mode none --training.seq_len 6144
// TE - float8
with-proxy CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --training.use_te --training.use_te_float8 --activation_checkpoint.mode none --training.seq_len 6144

Reviewers:

Subscribers:

Tasks:

Tags:

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 14, 2024
@vkuzo vkuzo force-pushed the te_experiments branch 5 times, most recently from 08fc333 to 8a005ec Compare December 4, 2024 00:40
@vkuzo vkuzo force-pushed the te_experiments branch 3 times, most recently from 4ce0086 to af6d1e9 Compare December 6, 2024 16:43
@vkuzo vkuzo force-pushed the te_experiments branch 2 times, most recently from 3e09ba9 to 96469bb Compare December 19, 2024 03:13
Summary:

Test Plan:

```
with-proxy CUDA_VISIBLE_DEVICES=4,5,6,7 NGPU=4 CONFIG_FILE="./train_configs/debug_model.toml" ./run_llama_train.sh --training.use_te
```

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants