Feature Request: dynocli support for checking process status in on-demand profiling tool #265

yingjun8 · 2024-05-31T06:43:50Z

Hello dynolog maintainers,

I've recently integrated the dynolog with Kubernetes (k8s) to create an on-demand profiling tool for GPU training clusters. This tool is designed to help us gain insights into the performance of our training processes.

However, I've encountered an issue where the tool fails to provide a clear error message when a user's training process isn't running. At the moment, when a profiling request is sent and the target process does not exist, the lack of a clear indication makes it challenging for users to understand the reason behind the failed profiling attempt.

Therefore, I would like to request a feature improvement for dynocli that could allow the checking of the process status before attempting to profile it.

Ideally, the functionality would:

Verify whether the specified process is running before initiating the profiling.
Provide an informative error message if the process is not found or not running, enhancing the user experience by making the tool more robust and user-friendly.
This feature would greatly assist us in troubleshooting and enhance the overall usability of our on-demand profiling tool.

Thank you for considering this feature request. I am willing to collaborate or assist in testing if needed.

briancoutinho · 2024-06-21T01:34:32Z

@yingjun8 this is a reasonable request, we can collaborate on this.

Let's classify errors

Process is not running
Process ran but trace collection failed

What is current error for 1, it should be not matching any process right now I think?

yingjun8 · 2024-06-21T08:27:49Z

@yingjun8 this is a reasonable request, we can collaborate on this.

Let's classify errors

Process is not running

Process ran but trace collection failed

What is current error for 1, it should be not matching any process right now I think?

For case 1, while injecting the SLURM_JOB_ID environment variable addresses part of the scenarios, there remain instances that we cannot adequately manage.
Suppose a scenario where a process with PID 41001 is subjected to consecutive gputrace requests.

The first request completes successfully
the PID 41001 process is then terminated
when the second request is initiated, the heartbeat registered by PID 41001 has not yet expired, leading dynocli to incorrectly indicate
"Trace output files will be written to: ...." despite the process no longer being active.

Regarding case 2, we've encountered issues in scenarios where multiple triggers for Kineto coexist:

Users manually invoke torch.profiler within their code.
Tracing is initiated by sending signal 2.
Tracing is triggered by injecting environment variables via dynolog.
When these methods are used concurrently, conflicts arise, leaving users unaware of the precise status or failure reasons of the trace they attempted to initiate.

In summary, our primary desire is to have visibility into the exact status of each trace request—whether it’s successful, failed, timed out, or still in progress. To this end, I propose that every time a tracing request is made, it generates a unique trace-id. A complementary CLI/API could then utilize this trace-id for querying the status of the dispatched trace, thereby enhancing our user experience significantly.

briancoutinho · 2024-07-01T22:46:10Z

@yingjun8 thanks for sharing details on both 1 and 2 :)

For 1. , we will add a simple check if the process still exists before returning a match and a trace link.

For 2. yes the case you see is a conflict. PyTorch will resolve it by first giving priority to users manually invoking it in torch.profiler code. That will also override an existing trace request via signla/dynolog.

But broadly speaking the idea is to

Pass a UUID on every trace request that is propagated to Kineto.
Then have some status prints in the PyTorch application for showing progress of trace collection.

We have to use the application logs unfortunately as PyTorch queries dynolog and not viceversa so we cannot get a status update from dynolog. Hope this makes sense

We can get started to fix case (1) and then share potential PRs for 2, which will involve changing kineto/pytorch as well so a more complex fix.

@jj10306 could you help take a look at (1)

briancoutinho assigned briancoutinho and jj10306 and unassigned briancoutinho Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: dynocli support for checking process status in on-demand profiling tool #265

Feature Request: dynocli support for checking process status in on-demand profiling tool #265

yingjun8 commented May 31, 2024

briancoutinho commented Jun 21, 2024

yingjun8 commented Jun 21, 2024 •

edited

Loading

briancoutinho commented Jul 1, 2024

Feature Request: dynocli support for checking process status in on-demand profiling tool #265

Feature Request: dynocli support for checking process status in on-demand profiling tool #265

Comments

yingjun8 commented May 31, 2024

briancoutinho commented Jun 21, 2024

yingjun8 commented Jun 21, 2024 • edited Loading

briancoutinho commented Jul 1, 2024

yingjun8 commented Jun 21, 2024 •

edited

Loading