Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: dynocli support for checking process status in on-demand profiling tool #265

Open
yingjun8 opened this issue May 31, 2024 · 3 comments
Assignees

Comments

@yingjun8
Copy link

Hello dynolog maintainers,

I've recently integrated the dynolog with Kubernetes (k8s) to create an on-demand profiling tool for GPU training clusters. This tool is designed to help us gain insights into the performance of our training processes.

However, I've encountered an issue where the tool fails to provide a clear error message when a user's training process isn't running. At the moment, when a profiling request is sent and the target process does not exist, the lack of a clear indication makes it challenging for users to understand the reason behind the failed profiling attempt.

Therefore, I would like to request a feature improvement for dynocli that could allow the checking of the process status before attempting to profile it.

Ideally, the functionality would:

Verify whether the specified process is running before initiating the profiling.
Provide an informative error message if the process is not found or not running, enhancing the user experience by making the tool more robust and user-friendly.
This feature would greatly assist us in troubleshooting and enhance the overall usability of our on-demand profiling tool.

Thank you for considering this feature request. I am willing to collaborate or assist in testing if needed.

@briancoutinho
Copy link
Contributor

@yingjun8 this is a reasonable request, we can collaborate on this.

Let's classify errors

  1. Process is not running
  2. Process ran but trace collection failed

What is current error for 1, it should be not matching any process right now I think?

@yingjun8
Copy link
Author

yingjun8 commented Jun 21, 2024

@yingjun8 this is a reasonable request, we can collaborate on this.

Let's classify errors

  1. Process is not running
  2. Process ran but trace collection failed

What is current error for 1, it should be not matching any process right now I think?

For case 1, while injecting the SLURM_JOB_ID environment variable addresses part of the scenarios, there remain instances that we cannot adequately manage.
Suppose a scenario where a process with PID 41001 is subjected to consecutive gputrace requests.

  • The first request completes successfully
  • the PID 41001 process is then terminated
  • when the second request is initiated, the heartbeat registered by PID 41001 has not yet expired, leading dynocli to incorrectly indicate
    "Trace output files will be written to: ...." despite the process no longer being active.

Regarding case 2, we've encountered issues in scenarios where multiple triggers for Kineto coexist:

  • Users manually invoke torch.profiler within their code.
  • Tracing is initiated by sending signal 2.
  • Tracing is triggered by injecting environment variables via dynolog.
    When these methods are used concurrently, conflicts arise, leaving users unaware of the precise status or failure reasons of the trace they attempted to initiate.

In summary, our primary desire is to have visibility into the exact status of each trace request—whether it’s successful, failed, timed out, or still in progress. To this end, I propose that every time a tracing request is made, it generates a unique trace-id. A complementary CLI/API could then utilize this trace-id for querying the status of the dispatched trace, thereby enhancing our user experience significantly.

@briancoutinho
Copy link
Contributor

@yingjun8 thanks for sharing details on both 1 and 2 :)

For 1. , we will add a simple check if the process still exists before returning a match and a trace link.

For 2. yes the case you see is a conflict. PyTorch will resolve it by first giving priority to users manually invoking it in torch.profiler code. That will also override an existing trace request via signla/dynolog.

But broadly speaking the idea is to

  1. Pass a UUID on every trace request that is propagated to Kineto.
  2. Then have some status prints in the PyTorch application for showing progress of trace collection.

We have to use the application logs unfortunately as PyTorch queries dynolog and not viceversa so we cannot get a status update from dynolog. Hope this makes sense

We can get started to fix case (1) and then share potential PRs for 2, which will involve changing kineto/pytorch as well so a more complex fix.

  • @jj10306 could you help take a look at (1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants