WIP: Check only one runner #150

davidwaroquiers · 2024-07-17T19:58:50Z

This PR adds info to the auxiliary collection when the runner is started. The idea would be to warn the user that there may be another runner running already, in which case bad things could happen.

Currently just inserts the info when starting the runner and removes it when stopping it.

Still to be done:

Deal with the locked document (if two people try to start the runner at the exact same time, guess it would not be very common but still)
Deal with the specific goal of this PR: what do we do when this happens
Deal with when the document with the runner info in the auxiliary collection is not present

codecov-commenter · 2024-07-18T13:43:20Z

Codecov Report

Attention: Patch coverage is 36.13445% with 152 lines in your changes missing coverage. Please review.

Project coverage is 47.03%. Comparing base (29eb2dc) to head (c0d07de).

Files with missing lines	Patch %	Lines
src/jobflow_remote/jobs/daemon.py	45.13%	52 Missing and 10 partials ⚠️
src/jobflow_remote/jobs/jobcontroller.py	26.25%	59 Missing ⚠️
src/jobflow_remote/cli/admin.py	17.85%	23 Missing ⚠️
src/jobflow_remote/utils/db.py	38.46%	8 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop     #150      +/-   ##
===========================================
- Coverage    47.17%   47.03%   -0.15%     
===========================================
  Files           44       44              
  Lines         5314     5492     +178     
  Branches      1164     1208      +44     
===========================================
+ Hits          2507     2583      +76     
- Misses        2549     2645      +96     
- Partials       258      264       +6

Files with missing lines	Coverage Δ
src/jobflow_remote/cli/utils.py	`31.95% <100.00%> (ø)`
src/jobflow_remote/jobs/runner.py	`67.01% <ø> (ø)`
src/jobflow_remote/utils/db.py	`49.31% <38.46%> (-0.69%)`	⬇️
src/jobflow_remote/cli/admin.py	`24.56% <17.85%> (-2.19%)`	⬇️
src/jobflow_remote/jobs/jobcontroller.py	`33.33% <26.25%> (-0.58%)`	⬇️
src/jobflow_remote/jobs/daemon.py	`48.77% <45.13%> (+5.54%)`	⬆️

davidwaroquiers · 2024-07-25T07:48:41Z

Hi @gpetretto

Still a few things to be done but I think before going on, it would be better to have the expert's eye on this ;)

src/jobflow_remote/cli/admin.py

src/jobflow_remote/jobs/daemon.py

gpetretto · 2024-07-25T08:53:38Z

src/jobflow_remote/jobs/daemon.py

+        db_filter = {"running_runner": {"$exists": True}}
+        with self.job_controller.lock_auxiliary(filter=db_filter) as lock:
+            if lock is None:
+                # print('Handle case where the document is not present!')


To avoid repeating the same pieces of code, you could add a method to DaemonManager that is a context manager (e.g. lock_runner_doc) that contains all the logic that needs to be checked. Will make it easier to update if something changes.

Indeed, I was planning to.

src/jobflow_remote/jobs/jobcontroller.py

tests/db/cli/test_admin.py

gpetretto

I have added some more comments to the latest changes

gpetretto · 2024-08-19T17:06:16Z

src/jobflow_remote/cli/admin.py

+    ] = None,
+) -> None:
+    """
+    Upgrade the jobflow database.


Maybe clarify what the "upgrade" means here: it can/should be run when updgrading the jobflow-remote version.

Also suggesting to create a backup of the db before proceeding? Not sure if needed or possible at all... Maybe we should have a backup command as well? Using MongoDB to dump the collections to json files?

Sure, I'll clarify what it means.
About the backup command, good idea. I propose we put that for a different PR. I'll open an issue for it.

gpetretto · 2024-08-19T21:54:28Z

src/jobflow_remote/jobs/daemon.py

-            return True
+        db_filter = {"running_runner": {"$exists": True}}
+        with self.job_controller.lock_auxiliary(filter=db_filter) as lock:
+            if lock is None:


Probably here you meant to check if lock.locked_document?

No I think it was indeed lock is None, in the case where the document is not found, the lock itself returns None.

In principle lock should always be the instance of MongoLock:

jobflow-remote/src/jobflow_remote/utils/db.py

Line 266 in 45fd5ec

return self

If not I would say it is a bug.

Hmm, maybe I was not awake enough. I'll check and see if it should indeed be lock.locked_document (probably should).

gpetretto · 2024-08-19T21:55:53Z

src/jobflow_remote/jobs/daemon.py

+            if lock is None:
+                # print('Handle case where the document is not present!')
+                pass
+            if lock.is_locked:


if it gets here lock.is_locked is probably always True. Need to check unavailable_document (and locking with get_locked_doc=True?

gpetretto · 2024-08-19T22:15:55Z

src/jobflow_remote/jobs/daemon.py

+                DaemonStatus.SHUT_DOWN,
+            ):
+                lock.update_on_release = {"$set": {"running_runner": None}}
+                return True


Could there be a fault in the logic here? If the runner is started on a machine and then run "stop" from another, the DaemonStatus will be SHUT_DOWN, but I suppose the running_runner should not be set to None in that case.

Indeed ... I guess it should only set it to None if the status was not stopped before. Or I should make a matching between the information in the running runner and the information about the runner on the machine ?

Or actually just not do anything ...

I don't know. What happens if I have only one machine that runs the Runner (which is likely 99% of the use cases) . It is started but the machine goes down abruptly (e.g. power cut). The document will not be updated but the deamon is shut down. What happens then? How am I supposed to deal with that situation? I likely don't want to go through an annoying procedure of having to reset the document manually before I can restart the db. Could it make sense that there is a global check that whenever the runner is not active and the document matches the local information it will clean up the content of the document?

Acutally this is also a linked to how the start method works. Currently in a scenario as the one I just described I would never be able to start the runner again, right?

gpetretto · 2024-08-19T22:22:12Z

src/jobflow_remote/jobs/daemon.py

+                status = self.check_status()
+                if status == DaemonStatus.SHUT_DOWN:
+                    logger.info("supervisord is not running. No process is running")
+                    lock.update_on_release = {"$set": {"running_runner": None}}


As in the comment above, if this is executed on a machine where the runner is SHUT_DOWN the running_runner will be set to None even if it is actually running o another machine.

gpetretto · 2024-08-20T08:18:55Z

src/jobflow_remote/cli/admin.py

+                + jobflow_check
+                + full_check
+                + "Carefully check this is ok and use 'jf admin upgrade --force'"
+            )


I already raised a doubt in the previous review, but I still have doubt about this part of the code. In the previous message you said that this check is disabled by default, but to me it will always run if the -force option will not be set.
Moreover the message pushes the user to check the upgrades but I doubt any user will have any clue to determine if this could be an issue. It will be really difficult to say if the difference in some package will impact the upgrade of the jobflow-remote DB/configuration. And even if there is any kind of issue, this alaready present in the fact that the packages in the python environment have already been changed.
I would find more useful to have instead a sort of dry-run here that will tell which operations are going to be performed. At least one can have a better idea of what to backup and what is going to happen. Would that be feasible?

gpetretto · 2024-08-20T08:19:54Z

src/jobflow_remote/jobs/jobcontroller.py

@@ -3570,10 +3626,128 @@ def remove_batch_process(self, process_id: str, worker: str) -> dict:
            upsert=True,
        )

+    def upgrade_check_jobflow(self):


Why does jobflow has its own specific function? Can't this be in upgrade_full_check like the other dependencies?

gpetretto · 2024-08-20T08:32:03Z

src/jobflow_remote/jobs/jobcontroller.py

+            return msg
+        return ""
+
+    def upgrade(self, previous_version, this_version):


Specify the return type. Is there any case where this returns False?

gpetretto · 2024-08-21T14:26:13Z

src/jobflow_remote/cli/admin.py

+    jc = get_job_controller()
+    if not force:
+        jobflow_check = jc.upgrade_check_jobflow()
+        full_check = jc.upgrade_full_check()


maybe the checks should go through the upgrader, rather than directly to the JobController? Or since these are just check on the db it goes directly through the Jobcontroller?

gpetretto · 2024-08-21T14:36:56Z

tests/db/jobs/test_daemon.py

+    _wait_daemon_started(daemon_manager)
+
+    mocker.patch.object(
+        daemon_manager, "check_status", return_value="FAKE_NOT_RUNNING_LOCALLY"


wouldn't it be better to return a proper DaemonStatus value? e.g. SHUT_DOWN?

Added RunnersLockedError.

collection.

shutdown.

davidwaroquiers force-pushed the check_only_one_runner branch from e8a9534 to e663546 Compare July 22, 2024 16:14

gpetretto reviewed Jul 25, 2024

View reviewed changes

gpetretto reviewed Aug 21, 2024

View reviewed changes

gpetretto linked an issue Aug 23, 2024 that may be closed by this pull request

Check there is not already a runner running #140

Open

gpetretto mentioned this pull request Sep 13, 2024

Remote errors in interactive mode when larger workflows are run #174

Closed

davidwaroquiers mentioned this pull request Sep 16, 2024

Add jf runner show-config #179

Open

gpetretto mentioned this pull request Sep 23, 2024

Add backup functionality #190

Merged

davidwaroquiers and others added 16 commits October 3, 2024 00:44

Added is_locked property method for MongoLock.

2bf1cc2

Added RunnersLockedError.

Added lock_auxiliary method in JobController.

648633b

First implementation of runner info in auxiliary collection.

763476e

Fixed problem when running runner document is not yet in the auxiliary

f8725e3

collection.

Added test for MongoLock.

dd11b81

Added upgrade utility.

a8af973

Added reset of the runnning_runner in the database for kill and

97f8f1b

shutdown.

Fixed RuntimeError message in run_one_job.

a652fc1

Fixed problems for tests failing locally.

6cbb9dc

Added some tests.

f1e2257

Fixed upgrades. Added tests.

35f9fdb

Fixes after review.

4fcde22

Updates following review.

59e0247

Added pytest-mock to test deps.

ab7dd32

Reorganized upgrade procedure.

0d7f0ed

update tests

c0d07de

gpetretto force-pushed the check_only_one_runner branch from e7112a7 to c0d07de Compare October 3, 2024 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Check only one runner #150

WIP: Check only one runner #150

davidwaroquiers commented Jul 17, 2024

codecov-commenter commented Jul 18, 2024 •

edited

Loading

davidwaroquiers commented Jul 25, 2024

gpetretto Jul 25, 2024

davidwaroquiers Jul 25, 2024

gpetretto left a comment

gpetretto Aug 19, 2024

davidwaroquiers Aug 22, 2024

gpetretto Aug 19, 2024

davidwaroquiers Aug 22, 2024

gpetretto Aug 22, 2024

davidwaroquiers Aug 22, 2024

gpetretto Aug 19, 2024

gpetretto Aug 19, 2024

davidwaroquiers Aug 22, 2024

davidwaroquiers Aug 22, 2024

gpetretto Aug 22, 2024 •

edited

Loading

gpetretto Aug 19, 2024

gpetretto Aug 20, 2024

gpetretto Aug 20, 2024

gpetretto Aug 20, 2024

gpetretto Aug 21, 2024

gpetretto Aug 21, 2024

WIP: Check only one runner #150

Are you sure you want to change the base?

WIP: Check only one runner #150

Conversation

davidwaroquiers commented Jul 17, 2024

codecov-commenter commented Jul 18, 2024 • edited Loading

Codecov Report

davidwaroquiers commented Jul 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpetretto left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gpetretto Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 18, 2024 •

edited

Loading

gpetretto Aug 22, 2024 •

edited

Loading