Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Check only one runner #150

Open
wants to merge 16 commits into
base: develop
Choose a base branch
from
Open

Conversation

davidwaroquiers
Copy link
Member

This PR adds info to the auxiliary collection when the runner is started. The idea would be to warn the user that there may be another runner running already, in which case bad things could happen.

Currently just inserts the info when starting the runner and removes it when stopping it.

Still to be done:

  • Deal with the locked document (if two people try to start the runner at the exact same time, guess it would not be very common but still)
  • Deal with the specific goal of this PR: what do we do when this happens
  • Deal with when the document with the runner info in the auxiliary collection is not present

@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2024

Codecov Report

Attention: Patch coverage is 36.13445% with 152 lines in your changes missing coverage. Please review.

Project coverage is 47.03%. Comparing base (29eb2dc) to head (c0d07de).

Files with missing lines Patch % Lines
src/jobflow_remote/jobs/daemon.py 45.13% 52 Missing and 10 partials ⚠️
src/jobflow_remote/jobs/jobcontroller.py 26.25% 59 Missing ⚠️
src/jobflow_remote/cli/admin.py 17.85% 23 Missing ⚠️
src/jobflow_remote/utils/db.py 38.46% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #150      +/-   ##
===========================================
- Coverage    47.17%   47.03%   -0.15%     
===========================================
  Files           44       44              
  Lines         5314     5492     +178     
  Branches      1164     1208      +44     
===========================================
+ Hits          2507     2583      +76     
- Misses        2549     2645      +96     
- Partials       258      264       +6     
Files with missing lines Coverage Δ
src/jobflow_remote/cli/utils.py 31.95% <100.00%> (ø)
src/jobflow_remote/jobs/runner.py 67.01% <ø> (ø)
src/jobflow_remote/utils/db.py 49.31% <38.46%> (-0.69%) ⬇️
src/jobflow_remote/cli/admin.py 24.56% <17.85%> (-2.19%) ⬇️
src/jobflow_remote/jobs/jobcontroller.py 33.33% <26.25%> (-0.58%) ⬇️
src/jobflow_remote/jobs/daemon.py 48.77% <45.13%> (+5.54%) ⬆️

@davidwaroquiers
Copy link
Member Author

Hi @gpetretto

Still a few things to be done but I think before going on, it would be better to have the expert's eye on this ;)

src/jobflow_remote/cli/admin.py Outdated Show resolved Hide resolved
src/jobflow_remote/jobs/daemon.py Show resolved Hide resolved
src/jobflow_remote/jobs/daemon.py Outdated Show resolved Hide resolved
src/jobflow_remote/jobs/daemon.py Outdated Show resolved Hide resolved
db_filter = {"running_runner": {"$exists": True}}
with self.job_controller.lock_auxiliary(filter=db_filter) as lock:
if lock is None:
# print('Handle case where the document is not present!')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid repeating the same pieces of code, you could add a method to DaemonManager that is a context manager (e.g. lock_runner_doc) that contains all the logic that needs to be checked. Will make it easier to update if something changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was planning to.

src/jobflow_remote/jobs/jobcontroller.py Outdated Show resolved Hide resolved
src/jobflow_remote/jobs/jobcontroller.py Outdated Show resolved Hide resolved
src/jobflow_remote/jobs/jobcontroller.py Outdated Show resolved Hide resolved
src/jobflow_remote/jobs/jobcontroller.py Outdated Show resolved Hide resolved
tests/db/cli/test_admin.py Outdated Show resolved Hide resolved
Copy link
Contributor

@gpetretto gpetretto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some more comments to the latest changes

] = None,
) -> None:
"""
Upgrade the jobflow database.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify what the "upgrade" means here: it can/should be run when updgrading the jobflow-remote version.

Also suggesting to create a backup of the db before proceeding? Not sure if needed or possible at all... Maybe we should have a backup command as well? Using MongoDB to dump the collections to json files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll clarify what it means.
About the backup command, good idea. I propose we put that for a different PR. I'll open an issue for it.

return True
db_filter = {"running_runner": {"$exists": True}}
with self.job_controller.lock_auxiliary(filter=db_filter) as lock:
if lock is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably here you meant to check if lock.locked_document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think it was indeed lock is None, in the case where the document is not found, the lock itself returns None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle lock should always be the instance of MongoLock:


If not I would say it is a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe I was not awake enough. I'll check and see if it should indeed be lock.locked_document (probably should).

if lock is None:
# print('Handle case where the document is not present!')
pass
if lock.is_locked:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it gets here lock.is_locked is probably always True. Need to check unavailable_document (and locking with get_locked_doc=True?

DaemonStatus.SHUT_DOWN,
):
lock.update_on_release = {"$set": {"running_runner": None}}
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be a fault in the logic here? If the runner is started on a machine and then run "stop" from another, the DaemonStatus will be SHUT_DOWN, but I suppose the running_runner should not be set to None in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed ... I guess it should only set it to None if the status was not stopped before. Or I should make a matching between the information in the running runner and the information about the runner on the machine ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually just not do anything ...

Copy link
Contributor

@gpetretto gpetretto Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. What happens if I have only one machine that runs the Runner (which is likely 99% of the use cases) . It is started but the machine goes down abruptly (e.g. power cut). The document will not be updated but the deamon is shut down. What happens then? How am I supposed to deal with that situation? I likely don't want to go through an annoying procedure of having to reset the document manually before I can restart the db. Could it make sense that there is a global check that whenever the runner is not active and the document matches the local information it will clean up the content of the document?

Acutally this is also a linked to how the start method works. Currently in a scenario as the one I just described I would never be able to start the runner again, right?

status = self.check_status()
if status == DaemonStatus.SHUT_DOWN:
logger.info("supervisord is not running. No process is running")
lock.update_on_release = {"$set": {"running_runner": None}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the comment above, if this is executed on a machine where the runner is SHUT_DOWN the running_runner will be set to None even if it is actually running o another machine.

+ jobflow_check
+ full_check
+ "Carefully check this is ok and use 'jf admin upgrade --force'"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already raised a doubt in the previous review, but I still have doubt about this part of the code. In the previous message you said that this check is disabled by default, but to me it will always run if the -force option will not be set.
Moreover the message pushes the user to check the upgrades but I doubt any user will have any clue to determine if this could be an issue. It will be really difficult to say if the difference in some package will impact the upgrade of the jobflow-remote DB/configuration. And even if there is any kind of issue, this alaready present in the fact that the packages in the python environment have already been changed.
I would find more useful to have instead a sort of dry-run here that will tell which operations are going to be performed. At least one can have a better idea of what to backup and what is going to happen. Would that be feasible?

@@ -3570,10 +3626,128 @@ def remove_batch_process(self, process_id: str, worker: str) -> dict:
upsert=True,
)

def upgrade_check_jobflow(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does jobflow has its own specific function? Can't this be in upgrade_full_check like the other dependencies?

return msg
return ""

def upgrade(self, previous_version, this_version):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify the return type. Is there any case where this returns False?

jc = get_job_controller()
if not force:
jobflow_check = jc.upgrade_check_jobflow()
full_check = jc.upgrade_full_check()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the checks should go through the upgrader, rather than directly to the JobController? Or since these are just check on the db it goes directly through the Jobcontroller?

_wait_daemon_started(daemon_manager)

mocker.patch.object(
daemon_manager, "check_status", return_value="FAKE_NOT_RUNNING_LOCALLY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it be better to return a proper DaemonStatus value? e.g. SHUT_DOWN?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check there is not already a runner running
3 participants