-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Check only one runner #150
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #150 +/- ##
===========================================
- Coverage 47.17% 47.03% -0.15%
===========================================
Files 44 44
Lines 5314 5492 +178
Branches 1164 1208 +44
===========================================
+ Hits 2507 2583 +76
- Misses 2549 2645 +96
- Partials 258 264 +6
|
e8a9534
to
e663546
Compare
Hi @gpetretto Still a few things to be done but I think before going on, it would be better to have the expert's eye on this ;) |
db_filter = {"running_runner": {"$exists": True}} | ||
with self.job_controller.lock_auxiliary(filter=db_filter) as lock: | ||
if lock is None: | ||
# print('Handle case where the document is not present!') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid repeating the same pieces of code, you could add a method to DaemonManager
that is a context manager (e.g. lock_runner_doc
) that contains all the logic that needs to be checked. Will make it easier to update if something changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I was planning to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some more comments to the latest changes
] = None, | ||
) -> None: | ||
""" | ||
Upgrade the jobflow database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify what the "upgrade" means here: it can/should be run when updgrading the jobflow-remote version.
Also suggesting to create a backup of the db before proceeding? Not sure if needed or possible at all... Maybe we should have a backup
command as well? Using MongoDB to dump the collections to json files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll clarify what it means.
About the backup command, good idea. I propose we put that for a different PR. I'll open an issue for it.
return True | ||
db_filter = {"running_runner": {"$exists": True}} | ||
with self.job_controller.lock_auxiliary(filter=db_filter) as lock: | ||
if lock is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably here you meant to check if lock.locked_document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I think it was indeed lock is None, in the case where the document is not found, the lock itself returns None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle lock should always be the instance of MongoLock
:
jobflow-remote/src/jobflow_remote/utils/db.py
Line 266 in 45fd5ec
return self |
If not I would say it is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I was not awake enough. I'll check and see if it should indeed be lock.locked_document (probably should).
if lock is None: | ||
# print('Handle case where the document is not present!') | ||
pass | ||
if lock.is_locked: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it gets here lock.is_locked is probably always True. Need to check unavailable_document (and locking with get_locked_doc=True
?
DaemonStatus.SHUT_DOWN, | ||
): | ||
lock.update_on_release = {"$set": {"running_runner": None}} | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could there be a fault in the logic here? If the runner is started on a machine and then run "stop" from another, the DaemonStatus will be SHUT_DOWN, but I suppose the running_runner
should not be set to None in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed ... I guess it should only set it to None if the status was not stopped before. Or I should make a matching between the information in the running runner and the information about the runner on the machine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or actually just not do anything ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. What happens if I have only one machine that runs the Runner (which is likely 99% of the use cases) . It is started but the machine goes down abruptly (e.g. power cut). The document will not be updated but the deamon is shut down. What happens then? How am I supposed to deal with that situation? I likely don't want to go through an annoying procedure of having to reset the document manually before I can restart the db. Could it make sense that there is a global check that whenever the runner is not active and the document matches the local information it will clean up the content of the document?
Acutally this is also a linked to how the start
method works. Currently in a scenario as the one I just described I would never be able to start the runner again, right?
status = self.check_status() | ||
if status == DaemonStatus.SHUT_DOWN: | ||
logger.info("supervisord is not running. No process is running") | ||
lock.update_on_release = {"$set": {"running_runner": None}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the comment above, if this is executed on a machine where the runner is SHUT_DOWN the running_runner
will be set to None even if it is actually running o another machine.
+ jobflow_check | ||
+ full_check | ||
+ "Carefully check this is ok and use 'jf admin upgrade --force'" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already raised a doubt in the previous review, but I still have doubt about this part of the code. In the previous message you said that this check is disabled by default, but to me it will always run if the -force
option will not be set.
Moreover the message pushes the user to check the upgrades but I doubt any user will have any clue to determine if this could be an issue. It will be really difficult to say if the difference in some package will impact the upgrade of the jobflow-remote DB/configuration. And even if there is any kind of issue, this alaready present in the fact that the packages in the python environment have already been changed.
I would find more useful to have instead a sort of dry-run here that will tell which operations are going to be performed. At least one can have a better idea of what to backup and what is going to happen. Would that be feasible?
@@ -3570,10 +3626,128 @@ def remove_batch_process(self, process_id: str, worker: str) -> dict: | |||
upsert=True, | |||
) | |||
|
|||
def upgrade_check_jobflow(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does jobflow has its own specific function? Can't this be in upgrade_full_check
like the other dependencies?
return msg | ||
return "" | ||
|
||
def upgrade(self, previous_version, this_version): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify the return type. Is there any case where this returns False
?
jc = get_job_controller() | ||
if not force: | ||
jobflow_check = jc.upgrade_check_jobflow() | ||
full_check = jc.upgrade_full_check() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the checks should go through the upgrader, rather than directly to the JobController? Or since these are just check on the db it goes directly through the Jobcontroller?
_wait_daemon_started(daemon_manager) | ||
|
||
mocker.patch.object( | ||
daemon_manager, "check_status", return_value="FAKE_NOT_RUNNING_LOCALLY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't it be better to return a proper DaemonStatus value? e.g. SHUT_DOWN?
Added RunnersLockedError.
e7112a7
to
c0d07de
Compare
This PR adds info to the auxiliary collection when the runner is started. The idea would be to warn the user that there may be another runner running already, in which case bad things could happen.
Currently just inserts the info when starting the runner and removes it when stopping it.
Still to be done: