From 3a43e5d5bcb34675ce2353c1143fc8f42d71efc7 Mon Sep 17 00:00:00 2001 From: Eddie Mattia Date: Thu, 11 Apr 2024 10:41:11 -0700 Subject: [PATCH] rename files --- .../{parallel-compute => compute}/README.md | 0 .../multi-node.md | 32 +++--- .../use-accelerators.md} | 97 ++++++++++++------- 3 files changed, 84 insertions(+), 45 deletions(-) rename docs/scaling/{parallel-compute => compute}/README.md (100%) rename docs/scaling/{parallel-compute => compute}/multi-node.md (80%) rename docs/scaling/{parallel-compute/use-gpus.md => compute/use-accelerators.md} (73%) diff --git a/docs/scaling/parallel-compute/README.md b/docs/scaling/compute/README.md similarity index 100% rename from docs/scaling/parallel-compute/README.md rename to docs/scaling/compute/README.md diff --git a/docs/scaling/parallel-compute/multi-node.md b/docs/scaling/compute/multi-node.md similarity index 80% rename from docs/scaling/parallel-compute/multi-node.md rename to docs/scaling/compute/multi-node.md index 9cfd897d..69783b32 100644 --- a/docs/scaling/parallel-compute/multi-node.md +++ b/docs/scaling/compute/multi-node.md @@ -1,8 +1,10 @@ # Running Multi-node Tasks -The `@parallel` decorator is like the [foreach](https://docs.metaflow.org/metaflow/basics#foreach) in that it allows Metaflow users to launch multiple runtime tasks based on the same step function process end users write. The primary difference is that `@parallel` enables inter-process communication (IPC) between tasks, often referred to as “multi-node computation” in HPC and deep learning contexts. +## From foreach to @parallel -This basic difference is simple to visualize. First consider the embarassingly parallel `foreach`: +The `@parallel` decorator, and the `num_parallel=N` construct attached to it, is like the [`foreach`](https://docs.metaflow.org/metaflow/basics#foreach) construct in that each mode allows Metaflow users to launch multiple runtime tasks based on one `@step` function. The primary difference is `@parallel` enables inter-process communication (IPC) between tasks. + +This difference is best visualized by comparing `foreach` to `num_parallel` and `@parallel`. First consider the embarassingly parallel `foreach`, where there are no inter-process communications expected: ```mdx-code-block import ReactPlayer from 'react-player'; ``` @@ -14,11 +16,11 @@ import ReactPlayer from 'react-player'; And now note the difference with the `num_parallel` case: -IPC is needed in cases that require computers on different machines pooling their resources to complete one job, such as a multi-node process that where each worker does some computing and then results are synced in an [all reduce](https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/). This computation pattern primarily appears in distributed AI training and numerical simulations in traditional HPC contexts. To support these use cases, the `@parallel` decorator provides a Metaflow API to launch `num_parallel` tasks that can communicate directly with each other. +IPC is needed in cases that require computers on different machines pooling their resources to complete one job, such as a multi-node process where each worker computes something and results are synced via an [all reduce](https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/) or similar aggregation. This computation pattern primarily appears in distributed model training and numerical simulations in the AI and HPC fields. To support these use cases, the `@parallel` decorator provides a Metaflow API to launch `num_parallel` tasks that can communicate directly with each other. -The main reason to use Metaflow and the `@parallel` decorator for this style of compute is that the implementation works with standard tools in the distributed computing ecosystem like Ray and PyTorch distributed, and is additive with the typical benefits of Metaflow, for example, packaging code and dependencies across worker nodes and seamlessly moving between compute providers. +The implementation works with standard tools in the distributed computing ecosystem like MPI, Ray, and PyTorch distributed, and is additive with the typical benefits of Metaflow, for example, packaging code and dependencies across worker nodes and seamlessly moving between compute providers. -To implement the idea, Metaflow has a decorator called `@parallel` that automatically forms a set of [gang-scheduled](https://en.wikipedia.org/wiki/Gang_scheduling) tasks that can coordinate on a single job defined or called in the `@step` function. +## How it works ### Unbounded foreach The unbounded foreach is the fundamental mechanism that underlies the `@parallel` implementation and enables gang-scheduled Metaflow tasks. It provides context to other Metaflow decorators that need to be aware of the multi-node setting. @@ -102,7 +104,7 @@ if __name__ == "__main__": MPI4PyFlow() ``` -### User-facing abstractions +## Higher-level abstractions The MPI decorator shown above is one of several existing Metaflow plugins demonstrating how to extend the `@parallel` functionality. You can look into the repositories in the decorator implementation column of this table to see the pattern you can follow to modify or implement your own decorator for any multi-node computing framework. :::note @@ -110,14 +112,14 @@ If preferred, you can also use the `@parallel` decorator itself and manually con ::: | Decorator Implementation | UX | Description | PyPi Release | Example | -| :---: | :---: | :---: | :---: | :---: | +| :---: | --- | --- | :---: | :---: | | [`@torchrun`](https://github.com/outerbounds/metaflow-torchrun) | Use `current.torch.run` to submit your `torch.distributed` program. No need to log into each node, call the code once in `@step`. | A [`torchrun`](https://pytorch.org/docs/stable/elastic/run.html) command that runs `@step` function code on each node. [Torch distributed](https://pytorch.org/tutorials/beginner/dist_overview.html) is used under the hood to handle communication between nodes. | [`metaflow-torchrun`](https://pypi.org/project/metaflow-torchrun/) | [MinGPT](https://github.com/outerbounds/metaflow-torchrun/blob/main/examples/min-gpt/flow.py) | -| [`@tensorflow`](https://github.com/outerbounds/metaflow-tensorflow/tree/main) | Put TensorFlow code in a distributed strategy scope, and call it from step function. | Run the `@step` function code on each node. This means the user picks the appropriate [strategy](https://www.tensorflow.org/guide/distributed_training#types_of_strategies) in their code. | [`metaflow-tensorflow`](https://pypi.org/project/metaflow-tensorflow/) | [Keras Distributed](https://github.com/outerbounds/metaflow-tensorflow/tree/main/examples/multi-node) | +| [`@deepspeed`](https://github.com/outerbounds/metaflow-deepspeed) | Exposes `current.deepspeed.run`
Requires OpenSSH and OpenMPI installed in the Metaflow task container. | Form MPI cluster with passwordless SSH configured at task runtime (to reduce the risk of leaking private keys). Submit the Deepspeed program and run. | [`metaflow-deepspeed`](https://pypi.org/project/metaflow-deepspeed/) | [Bert](https://github.com/outerbounds/metaflow-deepspeed/tree/main/examples/bert) & [Dolly](https://github.com/outerbounds/metaflow-deepspeed/tree/main/examples/dolly) | | [`@metaflow_ray`](https://github.com/outerbounds/metaflow-ray/tree/main) | Write a Ray program locally or call script from `@step` function, `@metaflow_ray` takes care of forming the Ray cluster. | Forms a [Ray cluster](https://docs.ray.io/en/latest/cluster/getting-started.html) dynamically. Runs the `@step` function code on the control task as Ray’s “head node”. | [`metaflow-ray`](https://pypi.org/project/metaflow-ray/) | [GPT-J](https://github.com/outerbounds/metaflow-ray/tree/main/examples/ray-fine-tuning-gpt-j) & [Distributed XGBoost](https://github.com/outerbounds/metaflow-ray/tree/main/examples/train) | +| [`@tensorflow`](https://github.com/outerbounds/metaflow-tensorflow/tree/main) | Put TensorFlow code in a distributed strategy scope, and call it from step function. | Run the `@step` function code on each node. This means the user picks the appropriate [strategy](https://www.tensorflow.org/guide/distributed_training#types_of_strategies) in their code. | [`metaflow-tensorflow`](https://pypi.org/project/metaflow-tensorflow/) | [Keras Distributed](https://github.com/outerbounds/metaflow-tensorflow/tree/main/examples/multi-node) | | [`@mpi`](https://github.com/outerbounds/metaflow-mpi) | Exposes `current.mpi.cc`, `current.mpi.broadcast_file`, `current.mpi.run`, `current.mpi.exec`. Cluster SSH config is handled automatically inside the decorator. Requires OpenSSH and an MPI implementation are installed in the Metaflow task container. It was tested against OpenMPI, which you can find a sample Dockerfile for [here](https://github.com/outerbounds/metaflow-mpi/blob/main/examples/Dockerfile). | Forms an MPI cluster with passwordless SSH configured at task runtime. Users can submit a `mpi4py` program or compile, broadcast, and submit a C program. | [`metaflow-mpi`](https://pypi.org/project/metaflow-mpi/) | [Libgrape](https://github.com/outerbounds/metaflow-mpi/tree/main/examples/libgrape-ldbc-graph-benchmark) | -| [`@deepspeed`](https://github.com/outerbounds/metaflow-deepspeed) | Exposes `current.deepspeed.run`
Requires OpenSSH and OpenMPI installed in the Metaflow task container. | Form MPI cluster with passwordless SSH configured at task runtime (to reduce the risk of leaking private keys). Submit the Deepspeed program and run. | [`metaflow-deepspeed`](https://pypi.org/project/metaflow-deepspeed/) | [Bert](https://github.com/outerbounds/metaflow-deepspeed/tree/main/examples/bert) & [Dolly](https://github.com/outerbounds/metaflow-deepspeed/tree/main/examples/dolly) | -## Compute environment considerations +## Preparing multi-node infrastructure Depending on the distributed computing frameworks and job types you want to use, there are various network adapters and HPC services that you may want to install into the Metaflow deployment. @@ -155,7 +157,7 @@ When you pick the instance types you want in your AWS Batch compute environment, The reason to do this is that latency between nodes is much faster when all worker nodes are in the same AWS Availability Zone, which will not necessarily happen without a Cluster Placement Group. #### Intranode communication with shared memory -AWS Batch has a parameter called `shared_memory` that allows multiple processors on the same compute node to communicate using a memory (RAM) portion that is shared. _This feature works independently of the multi-node setting_, but can have additive benefits. This value can be tuned to your applications, and this [AWS blog](https://aws.amazon.com/blogs/compute/using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch/) suggests a reasonable starting value of 4096 MB for most cases. In Metaflow, you can set this value like any other argument to the batch decorator: +AWS Batch has a parameter called `shared_memory` that allows multiple processors on the same compute node to communicate using a memory (RAM) portion that is shared. _This feature works independently of the multi-node setting_, but can have additive benefits and may resolve errors that can appear in the multi-node setting. This value can be tuned to your applications, and this [AWS blog](https://aws.amazon.com/blogs/compute/using-shared-memory-for-low-latency-intra-node-communication-in-aws-batch/) suggests a reasonable starting value of 4096 MB for most cases. In Metaflow, you can set this value like any other argument to the batch decorator: ```python from metaflow import FlowSpec, step, resources @@ -185,5 +187,13 @@ if __name__ == '__main__': SharedMemory() ``` +#### Intranode communication with AWS Elastic Fabric Adapter (EFA) + +[Some AWS EC2 instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types) have network devices built for high-performance computing use cases. In order to use this, the AWS Batch Compute Environment that is connected the AWS Batch Job Queue needs to install the necessary EFA drivers in a Launch Template. You can see an example CloudFormation template with Launch Template that installs EFA devices [here](https://github.com/outerbounds/metaflow-trainium/blob/main/cfn/trn1_batch_resources.yaml#L294-L320). Then, Metaflow users running can specify `@batch(efa=8, ...)` to attach 8 network interfaces to each node in a multi-node batch job. + +#### Limitations + +AWS Batch Multi-node is not integrated with AWS Step Functions, so you cannot use the Metaflow step functions integration when using `@parallel` decorators and `@batch` together. + ### Kubernetes Multi-node support for Metaflow tasks run on Kubernetes clusters is a work-in-progress. Please reach us on [Slack](http://slack.outerbounds.co/) to access the latest experimental implementation and/or co-develop it with the Outerbounds team. \ No newline at end of file diff --git a/docs/scaling/parallel-compute/use-gpus.md b/docs/scaling/compute/use-accelerators.md similarity index 73% rename from docs/scaling/parallel-compute/use-gpus.md rename to docs/scaling/compute/use-accelerators.md index 9e43d790..523e017a 100644 --- a/docs/scaling/parallel-compute/use-gpus.md +++ b/docs/scaling/compute/use-accelerators.md @@ -1,17 +1,19 @@ -# Using GPUs +# Using Accelerators -Metaflow allows access to GPUs on your local workstation, [AWS Batch](/scaling/remote-tasks/aws-batch), or [Kubernetes](/scaling/remote-tasks/kubernetes) cluster with the change of a few characters in your Python code or Metaflow run command. By changing parameters in either of these ways, Metaflow tasks can run on: -- Single GPUs -- Single instances with many GPUs -- Many instances with single GPUs (see: [Running Multi-node Tasks](./multi-node.md)) -- Many instances with many GPUs (see: [Running Multi-node Tasks](./multi-node.md)) +Metaflow enables access to accelerators on your local workstation, [AWS Batch](/scaling/remote-tasks/aws-batch), or [Kubernetes](/scaling/remote-tasks/kubernetes) cluster with the change of a few characters in your Python code or Metaflow run command. By changing parameters in either of these ways, Metaflow tasks can run on: +- Single accelerators +- Single instances with multiple accelerators +- Multiple instances with multiple accelerators (see: [Running Multi-node Tasks](./multi-node.md)) -This page provides a collection of tips for how to do the above, and how to configure your GPU environments to use other Metaflow features that make these workflows more robust and portable. +This page provides a collection of tips for how to do the above, and how to configure your compute environments to use other Metaflow features that make these workflows more robust and performant. -## Requesting GPUs in Metaflow steps + +## Graphics Processing Units (GPUs) + +### Requesting GPUs in Metaflow steps Whether running in a local workstation or [executing tasks remotely](https://docs.metaflow.org/scaling/remote-tasks/introduction), when requesting resources with Metaflow, you can use the `@resources` decorator to change properties of the compute environment like the amount of memory and number of CPUs and GPUs (the purpose of this page) contributing to the task. -Consider this flow, where the start step calls the `my_gpu_routine` function on 1 GPU. +Consider this flow, where the start step calls the `my_gpu_routine` function on a single GPU. ```python # generic_gpu_routine.py from metaflow import FlowSpec, step, resources @@ -33,16 +35,16 @@ if __name__ == '__main__': GPUFlow() ``` -## Preparing compute environments for GPU scheduling -In order to enable the end user experience shown above, the Metaflow admin needs to set up their cluster to run GPU jobs. +### Preparing infrastructure dependencies +In order to enable the end user experience shown above, the Metaflow admin needs to set up their cluster to run GPU jobs. This section explains how to get started. -### AWS Batch +#### AWS Batch -This section assumes that you have a basic familiarity with deploying Metaflow. If you have never done this or want a refresher, please see [this guide to deploying Metaflow](https://outerbounds.com/engineering/welcome/) or directly inspect [this repository](https://github.com/outerbounds/metaflow-tools/) that shows various ways to deploy Metaflow. +If you have never deployed Metaflow or want a refresher, please see [this guide to deploying Metaflow](https://outerbounds.com/engineering/welcome/) or directly inspect [this repository](https://github.com/outerbounds/metaflow-tools/) before continuing. -To use GPUs in your Metaflow tasks that run on AWS Batch, you need to run the flow in a [Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) that is attached to a [Compute Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) with GPU instances. +To use GPUs in Metaflow tasks that run on AWS Batch, you need to run the flow in a [Job Queue](https://docs.aws.amazon.com/batch/latest/userguide/job_queues.html) that is attached to a [Compute Environment](https://docs.aws.amazon.com/batch/latest/userguide/compute_environments.html) with GPU instances. -To set this up, you can either modify the allowable instances in a [Metaflow AWS deployment template](https://github.com/outerbounds/metaflow-tools/tree/master/aws) or manually modify an existing deployment from the AWS console. The steps are: +To set this up, you can either modify the allowable instances in a [Metaflow AWS deployment template](https://github.com/outerbounds/metaflow-tools/tree/master/aws) or manually add such a compute environment from the AWS console. The steps are: 1. Create a compute environment with GPU-enabled EC2 instances. 2. Attach the compute environment to a new Job Queue - for example named `my-gpu-queue`. @@ -65,7 +67,7 @@ from metaflow import FlowSpec, step, batch class GPUBatchFlow(FlowSpec): - @batch(memory=32000, cpu=4, gpu=1, queue=”my-gpu-queue”) + @batch(gpu=1, queue='my-gpu-queue') @step def start(self): from my_script import my_gpu_routine @@ -80,12 +82,14 @@ if __name__ == '__main__': GPUBatchFlow() ``` -### Kubernetes +You can alternatively set the queue in your Metaflow config file, or using the `METAFLOW_BATCH_JOB_QUEUE` environment variable. + +#### Kubernetes Metaflow compute tasks can run on any Kubernetes cluster. You can use this section to help set up GPU nodes in your Kubernetes cluster, but it is not intended to be a complete guide. It is meant to help you: 1. understand GPU requirements in a Kubernetes cluster and how they relate to how Metaflow users declare a task needs GPUs, and 2. find resources to guide decisions around configuring nodes in Kubernetes clusters for GPU use. -#### Device plugins for scheduling GPU pods +##### Device plugins for scheduling GPU pods Metaflow tasks that run with Kubernetes as a compute backend are run as Pods. To access GPUs, Kubernetes Pods need to be configured in a special way that allows them to access specific GPU hardware features. If you are a Metaflow workflow developer or data scientist, you can skip ahead to the [next section](#what-resources-can-i-declare-in-metaflow-steps) of this page, unless you want to understand what happens behind the scenes. @@ -94,16 +98,16 @@ If you are the administrator of the Kubernetes cluster powering Metaflow tasks a The rest of this section points to additional resources for specific Kubernetes cluster providers. Each of them makes some use of the [NVIDIA Device Plugin implementation](https://github.com/NVIDIA/k8s-device-plugin), which is also suggested for Kubernetes GPU admins. -#### Amazon Web Services EKS +##### Amazon Web Services EKS Amazon has prepared the [EKS-optimized accelerated Amazon Linux AMIs](https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html#gpu-ami). Read the linked guide to install the hardware dependencies and choose the AMI you want to run on your GPU node group. You will need to make the suggested modifications to the [Kubernetes cluster deployed as part of your Metaflow AWS deployment](https://github.com/outerbounds/terraform-aws-metaflow/blob/master/examples/eks_argo/eks.tf). -#### Google Cloud Platform GKE +##### Google Cloud Platform GKE Read GCP’s guide about [GPUs on GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/gpus). You will need to make the suggested modifications to the [Kubernetes cluster deployed as part of your Metaflow GCP deployment](https://github.com/outerbounds/metaflow-tools/blob/master/gcp/terraform/infra/kubernetes.tf). -#### Microsoft Azure AKS +##### Microsoft Azure AKS Read Azure’s guide about [GPUs on AKS](https://learn.microsoft.com/en-us/azure/aks/gpu-cluster). You will need to make the suggested modifications to the [Kubernetes cluster deployed as part of your Metaflow Azure deployment](https://github.com/outerbounds/metaflow-tools/blob/master/azure/terraform/infra/kubernetes.tf). -### What resources can I declare in Metaflow steps? +#### What resources can I declare in Metaflow steps? A potential for end-users to waste time emerges if the workflow developer does not understand what resources their Metaflow steps can declare. For example, you can run into problems like AWS Batch jobs being stuck in a `RUNNABLE` or `STARTING` state or Kubernetes jobs being stuck in `PENDING` state for unclear reasons if you are not aware of what machines are available in your Metaflow deployment compute environment. @@ -116,7 +120,7 @@ Here is a suggested recipe you can iterate on with respect to this dynamic: 3. Based on 2, someone decides which compute instances (EC2 instances, on-premise data center VMs, etc.) to attach to a Kubernetes Node Group or AWS Batch Job Queue. 4. The resources in the compute environments set up in step 3 determine what an end user can write in decorators like `@resources`, `@batch`, and `@kubernetes` and have their job get scheduled. -#### Common Issues +##### Common Issues **Issue**: My `@batch` task has been stuck in `RUNNABLE` forever. **Check**: Does the Batch job queue you are trying to run the Metaflow task in have a compute environment with EC2 instances with the resources requested? For example, if your job queue is connected to a single compute environment that only has `p3.2xlarge` as a GPU instance, and a user requests 2 GPUs, that job will never get scheduled because `p3.2xlarge` only have 1 GPU per instance. @@ -127,14 +131,14 @@ Here is a suggested recipe you can iterate on with respect to this dynamic: **Issue**: My `@kubernetes` task has been stuck in `PENDING` forever. **Check**: Are the resources requested in your Metaflow code/command sufficient? Especially when using custom GPU images, you might need to increase the requested memory to pull the container image into your compute environment. -## Preparing Python environments to run on GPUs +### Preparing code dependencies -There are a variety of ways Metaflow allows you to [specify and version dependencies](/scaling/dependencies). -This section shares some observations and examples to get you started with each. +There are several ways Metaflow allows you to [specify and version dependencies](/scaling/dependencies). +This section shares observations and examples to get you started with each. -### Conda -When using `@conda` or `@conda_base` to run GPU Metaflow tasks, you will want to remember that you can declare Conda channels. -Often, you will want to request packages from specific channels like the following flow outline demonstrates: +#### Conda +When using `@conda` or `@conda_base` to run GPU Metaflow tasks, you may need to declare Conda channels. +Often, you will want to request packages from specific channels like the following flow pseudocode demonstrates: ```python from metaflow import FlowSpec, step, conda_base @@ -160,10 +164,15 @@ Remember that if you run workflows from a machine with a different operating sys - you can go to the [conda-forge website](https://conda-forge.org/feedstock-outputs/) and find which package versions are available across each platform, and - it might help to you to use `@conda` localized to Metaflow steps, instead of `@conda_base` applied to all flow steps. -### Docker +#### Docker If you make your own GPU image, we suggest building from the standard images released by a trusted GPU or framework provider. -#### Example: Official PyTorch Images +**Example: Nvidia Container Registry (NVCR)** + +Nvidia builds and supports many images and microservices through [NVCR catalogs](https://catalog.ngc.nvidia.com/containers). + +**Example: Official PyTorch Images** + See more tags on the [PyTorch DockerHub Registry](https://hub.docker.com/r/pytorch/pytorch). ```Dockerfile @@ -173,7 +182,8 @@ RUN pip install -r /tmp/requirements.txt ... ``` -#### Example: Official TensorFlow Images +**Example: Official TensorFlow Images** + See more tags on the [TensorFlow DockerHub Registry](https://hub.docker.com/r/tensorflow/tensorflow). ```Dockerfile @@ -183,5 +193,24 @@ RUN pip install -r /tmp/requirements.txt ... ``` -#### Example: AWS Deep Learning Containers (DLCs) -AWS maintains a [registry of Docker images for deep learning](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), many of which have CPU and GPU versions. These can be helpful guides if your team decides to craft images for deep learning tasks to run on Metaflow. \ No newline at end of file +**Example: AWS Deep Learning Containers (DLCs)** + +AWS maintains a [registry of Docker images for deep learning](https://github.com/aws/deep-learning-containers/blob/master/available_images.md), many of which have CPU and GPU versions. These can be helpful guides if your team decides to craft images for deep learning tasks to run on Metaflow. + +### GPU Access + +If you do not have access to the GPUs you need in your cloud account, [Outerbounds may be able to help](https://outerbounds.com/blog/nvidia-cloud-gpu-announcement/). We can help you find A100s, H100s, L40s, and more GPU-card types, often at rates signficantly lower than on-demand cloud prices. + +## Application-specific Integrated Circuits (ASICs) + +### AWS Trainium and Inferentia + +Metaflow supports the following arguments: +* `@batch(trainium=16)` +* `@batch(inferentia=16)` + +For detailed guides, visit the [`metaflow-trainium` repository](https://github.com/outerbounds/metaflow-trainium/tree/main). + +### GCP Tensor Processing Units (TPUs) + +Contact Outerbounds to learn more about how to get started using TPUs on Google cloud with Metaflow. \ No newline at end of file