Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Troubleshooting Steps for MPI Operator #1756

Merged
merged 1 commit into from
Oct 21, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions examples/kfmpi_plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,3 +55,37 @@ pyflyte run --remote \
```{auto-examples-toc}
mpi_mnist
```

## MPI Plugin Troubleshooting Guide

This section covers common issues encountered during the setup of the MPI operator for distributed training jobs on Flyte.

**Worker Pods Failing to Start (Insufficient Resources)**

MPI worker pods may fail to start or exhibit scheduling issues, leading to job timeouts or failures. This often occurs due to resource constraints (CPU, memory, or GPU) in the cluster.

1. Adjust Resource Requests:
Ensure that each worker pod has sufficient resources. You can adjust the resource requests in your task definition:

```
requests=Resources(cpu="<your_cpu_request>", mem="<your_mem_request>")
```

Modify the CPU and memory values according to your cluster's available resources. This helps prevent pod scheduling failures caused by resource constraints.

2. Check Pod Logs for Errors:
If the worker pods still fail to start, check the logs for any related errors:

```
kubectl logs <pod-name> -n <namespace>
```

Look for resource allocation or worker communication errors.

**Workflow Registration Method Errors (Timeouts or Deadlocks)**

If your MPI workflow hangs or times out, it may be caused by an incorrect workflow registration method.

1. Verify Registration Method:
When using a custom image, refer to the Flyte documentation on [Registering workflows](https://docs.flyte.org/en/latest/user_guide/flyte_fundamentals/registering_workflows.html#registration-patterns) to ensure you're following the correct registration method.

Loading