Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote errors in interactive mode when larger workflows are run #174

Closed
JaGeo opened this issue Sep 7, 2024 · 8 comments
Closed

Remote errors in interactive mode when larger workflows are run #174

JaGeo opened this issue Sep 7, 2024 · 8 comments

Comments

@JaGeo
Copy link
Collaborator

JaGeo commented Sep 7, 2024

I have been running into many remote errors when I start a larger workflow in the runner's interactive mode, but this does not happen when I start, for example, the Phonon workflow. I am currently suspecting that this might be related to the one connection to the remote cluster that is only established in the interactive mode. If I rerun the jobs, they will eventually run through.
Sometimes, also downloads fail and restarts enable the run.

Could we do anything about this? For example, could we add the possibility to use more than 1 connection in interactive mode to make it more stable? I would be fine with adding more than one OTP if it helps with execution.

@gpetretto
Copy link
Contributor

Hi @JaGeo,
can you provide more details to better under the problem?

  • which states are affected the most? Is it the download phase?
  • which errors do you get as remote error? Is it always the same, or does it change?
  • Is the workflow large in the sense that it has many jobs? Or that its jobs have a lot of data? or both?
  • is it clear how is this related to the fact that the workflow is large? From what you have seen, would you expect to have the same kind of error if many smaller workflows would be submitted instead?
  • When these errors happen, do you need to restart the runner? (I am asking this because in principle in the interactive mode if the connection drops, reconnect would require reinserting the OTP. So, if you don't need to restart the runner it means that at least the connection is still alive, even if an error happened)

I think it could be possible to enable multiple connections by inserting the OTP multiple times. I will investigate how to do that.

@JaGeo
Copy link
Collaborator Author

JaGeo commented Sep 7, 2024

@gpetretto Thank you for your response.

I will make a few additional tests and then answer your questions in more detail.

With regard to the size of the workflow: i was referring to one with many jobs. Size of the data per job would not be bigger than a PhononDos Object from the phonon workflow or standard VASP outputs.

Restarting the job solves the issue. I don't need to restart the runner. I get REMOTE_ERROR mostly and sometimes a process stops in the middle (e.g., it gets stuck when it downloads the data)

An additional suspicion that I have is that there could be a connectivity error within the flow.

@JaGeo
Copy link
Collaborator Author

JaGeo commented Sep 13, 2024

I looked closer into the errors: it sometimes seems to pick up an old project and the pathes of the outputs. Maybe related to #177

@gpetretto
Copy link
Contributor

Thanks for the updates. When you mention an "old project" do you refer to really a different projct with a different configuration file that is present in the ~/.fremote folder? Or to another workflow in the same project?
In principle the case in #177 is really more an issue if the user tries to insert multiple times the same instance of the Flow. Otherwise a random collision between uuid should be extremely unlikely, and would seem very difficult that this happened in your case more than once.
Anyway it should be relatively easy to check: If you query by uuids the jobs that had an issue it will come up if there is more than one with the same uuid (except those with different job index).

Do you maybe have the stack trace reported when the jobs got into the REMOTE_ERROR state?

@JaGeo
Copy link
Collaborator Author

JaGeo commented Sep 13, 2024

I am really referring to an old project. I will check if there is still an old jf runner running on a different computer and get back to you...

@gpetretto
Copy link
Contributor

Thanks for the clarification. Do they use the same queue DB?
Indeed checking if an old runner is still active is a good idea. If that is the problem #150 could prevent such an occurrence.

@JaGeo JaGeo changed the title Remote errors in interactive mode when larger workflows are rund Remote errors in interactive mode when larger workflows are run Sep 13, 2024
@JaGeo
Copy link
Collaborator Author

JaGeo commented Sep 13, 2024

I think we can close this. I think there was simply a leftover jf remote running in the background of the other cluster from mid of August, even after logout from the cluster

@JaGeo JaGeo closed this as completed Sep 13, 2024
@gpetretto
Copy link
Contributor

Thanks for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants