-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: dynocli support for checking process status in on-demand profiling tool #265
Comments
@yingjun8 this is a reasonable request, we can collaborate on this. Let's classify errors
What is current error for 1, it should be not matching any process right now I think? |
For case 1, while injecting the SLURM_JOB_ID environment variable addresses part of the scenarios, there remain instances that we cannot adequately manage.
Regarding case 2, we've encountered issues in scenarios where multiple triggers for Kineto coexist:
In summary, our primary desire is to have visibility into the exact status of each trace request—whether it’s successful, failed, timed out, or still in progress. To this end, I propose that every time a tracing request is made, it generates a unique trace-id. A complementary CLI/API could then utilize this trace-id for querying the status of the dispatched trace, thereby enhancing our user experience significantly. |
@yingjun8 thanks for sharing details on both 1 and 2 :) For 1. , we will add a simple check if the process still exists before returning a match and a trace link. For 2. yes the case you see is a conflict. PyTorch will resolve it by first giving priority to users manually invoking it in torch.profiler code. That will also override an existing trace request via signla/dynolog. But broadly speaking the idea is to
We have to use the application logs unfortunately as PyTorch queries dynolog and not viceversa so we cannot get a status update from dynolog. Hope this makes sense We can get started to fix case (1) and then share potential PRs for 2, which will involve changing kineto/pytorch as well so a more complex fix.
|
Hello dynolog maintainers,
I've recently integrated the dynolog with Kubernetes (k8s) to create an on-demand profiling tool for GPU training clusters. This tool is designed to help us gain insights into the performance of our training processes.
However, I've encountered an issue where the tool fails to provide a clear error message when a user's training process isn't running. At the moment, when a profiling request is sent and the target process does not exist, the lack of a clear indication makes it challenging for users to understand the reason behind the failed profiling attempt.
Therefore, I would like to request a feature improvement for dynocli that could allow the checking of the process status before attempting to profile it.
Ideally, the functionality would:
Verify whether the specified process is running before initiating the profiling.
Provide an informative error message if the process is not found or not running, enhancing the user experience by making the tool more robust and user-friendly.
This feature would greatly assist us in troubleshooting and enhance the overall usability of our on-demand profiling tool.
Thank you for considering this feature request. I am willing to collaborate or assist in testing if needed.
The text was updated successfully, but these errors were encountered: