Add more explanation to some section as per comments

Epistimio · Jul 22, 2022 · 527825a · 527825a
1 parent 80515d4
commit 527825a
Showing 1 changed file with 46 additions and 7 deletions.
diff --git a/docs/src/developer/plan.rst b/docs/src/developer/plan.rst
@@ -16,16 +16,27 @@ The code in :py:func:`orion.core.cli.main` will parse the command line
 arguments and route to :py:func:`orion.core.cli.hunt.main`.
 
 The command line arguments are passed to
-:py:func:`orion.core.io.experiment_builder.build_from_args`. This will
-massage the parsed command line arguments and merge that configuration
-with the config file and the defaults with various helpers from
-:py:mod:`orion.core.io.resolve_config` to build the final
-configuration. The result is eventually handled off to
+:py:func:`orion.core.io.experiment_builder.build_from_args`, which
+does some setup and hands over the arguments to
+:py:func:`orion.core.io.experiment_builder.build`. This will hand over
+the configuration to
+:py:func:`orion.core.io.experiment_builder.consolidate_config` which
+will look up the experiment in the configured storage to see if it's
+already there and merge the loaded configuration with the provided one
+with various helpers from :py:mod:`orion.core.io.resolve_config` to
+build the final configuration. The result is eventually handled off to
 :py:func:`orion.core.io.experiment_builder.create_experiment` to
 create an :py:class:`orion.core.worker.experiment.Experiment` and set
 its properties.
 
-The created experiments finds its way back to
+If the experiment is new, meaning it has no storage id, then it will
+attempt to save it to storage, which may conflict in case another
+instance of ``orion hunt`` is doing the same thing. The storage is
+responsible for repoting conflicts and
+:py:func:`orion.core.io.experiment_builder.build` is called again
+recursively in that case to retry the whole operation.
+
+The created experiment finds its way back to
 :py:func:`orion.core.cli.hunt.main` and is handed off to
 :py:func:`orion.core.cli.hunt.workon` along with some more
 configuration for the workers.
@@ -60,7 +71,7 @@ This will first check if any trials are available in the storage using
 :py:meth:`orion.core.worker.experiment.Experiment.reserve_trial`.
 
 If none are available, it will produce new trials using
-:py:meth:`orion.core.worker.producer.Producer.produce()` which loads
+:py:meth:`orion.core.worker.producer.Producer.produce` which loads
 the state of the algorithm from the storage, runs it to suggest new
 :py:class:`orion.core.worker.trial.Trial` and saves both the new
 trials and the new algorithm state to the storage. This is protected
@@ -95,3 +106,31 @@ the count of broken trials if they did not finish successfully.
 
 Finally we monitor the total amount of time spent waiting for trials
 to finish.
+
+
+Stopping criteria
+~~~~~~~~~~~~~~~~~
+
+There are multiple criteria that are monitored to stop the
+experiment.
+
+The first obvious one is the configured maximum number of trials to
+run. If this is reached, then we stop running more. This is checked at
+the beginning of the loop with
+:py:attr:`orion.client.runner.Runner.is_running`.
+
+The experiment can also stop if too many trials fail, either because
+they fail to start, they crashed, were killed (like by an external job
+scheduler) or the take too much time to complete. This is checked in
+:py:meth:`orion.client.runner.Runner.gather` with
+:py:attr:`orion.client.runner.Runner.is_broken`.
+
+If one of the workers returns an unexpected result the experiment is
+also stop immediately because it is assume that something is wrong
+with either the code or the configuration and spending more time
+computing stuff will not fix it. This is also checked for in
+:py:meth:`orion.client.runner.Runner.gather`.
+
+Finaly if the loop spends too much time waiting and nothing happens
+the experiment is considered stalled and will also stop. This is
+checked at the end of :py:meth:`orion.client.runner.Runner.run`.