Skip to content

Commit

Permalink
Add more explanation to some section as per comments
Browse files Browse the repository at this point in the history
  • Loading branch information
abergeron committed Jul 22, 2022
1 parent 80515d4 commit 527825a
Showing 1 changed file with 46 additions and 7 deletions.
53 changes: 46 additions & 7 deletions docs/src/developer/plan.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,27 @@ The code in :py:func:`orion.core.cli.main` will parse the command line
arguments and route to :py:func:`orion.core.cli.hunt.main`.

The command line arguments are passed to
:py:func:`orion.core.io.experiment_builder.build_from_args`. This will
massage the parsed command line arguments and merge that configuration
with the config file and the defaults with various helpers from
:py:mod:`orion.core.io.resolve_config` to build the final
configuration. The result is eventually handled off to
:py:func:`orion.core.io.experiment_builder.build_from_args`, which
does some setup and hands over the arguments to
:py:func:`orion.core.io.experiment_builder.build`. This will hand over
the configuration to
:py:func:`orion.core.io.experiment_builder.consolidate_config` which
will look up the experiment in the configured storage to see if it's
already there and merge the loaded configuration with the provided one
with various helpers from :py:mod:`orion.core.io.resolve_config` to
build the final configuration. The result is eventually handled off to
:py:func:`orion.core.io.experiment_builder.create_experiment` to
create an :py:class:`orion.core.worker.experiment.Experiment` and set
its properties.

The created experiments finds its way back to
If the experiment is new, meaning it has no storage id, then it will
attempt to save it to storage, which may conflict in case another
instance of ``orion hunt`` is doing the same thing. The storage is
responsible for repoting conflicts and
:py:func:`orion.core.io.experiment_builder.build` is called again
recursively in that case to retry the whole operation.

The created experiment finds its way back to
:py:func:`orion.core.cli.hunt.main` and is handed off to
:py:func:`orion.core.cli.hunt.workon` along with some more
configuration for the workers.
Expand Down Expand Up @@ -60,7 +71,7 @@ This will first check if any trials are available in the storage using
:py:meth:`orion.core.worker.experiment.Experiment.reserve_trial`.

If none are available, it will produce new trials using
:py:meth:`orion.core.worker.producer.Producer.produce()` which loads
:py:meth:`orion.core.worker.producer.Producer.produce` which loads
the state of the algorithm from the storage, runs it to suggest new
:py:class:`orion.core.worker.trial.Trial` and saves both the new
trials and the new algorithm state to the storage. This is protected
Expand Down Expand Up @@ -95,3 +106,31 @@ the count of broken trials if they did not finish successfully.

Finally we monitor the total amount of time spent waiting for trials
to finish.


Stopping criteria
~~~~~~~~~~~~~~~~~

There are multiple criteria that are monitored to stop the
experiment.

The first obvious one is the configured maximum number of trials to
run. If this is reached, then we stop running more. This is checked at
the beginning of the loop with
:py:attr:`orion.client.runner.Runner.is_running`.

The experiment can also stop if too many trials fail, either because
they fail to start, they crashed, were killed (like by an external job
scheduler) or the take too much time to complete. This is checked in
:py:meth:`orion.client.runner.Runner.gather` with
:py:attr:`orion.client.runner.Runner.is_broken`.

If one of the workers returns an unexpected result the experiment is
also stop immediately because it is assume that something is wrong
with either the code or the configuration and spending more time
computing stuff will not fix it. This is also checked for in
:py:meth:`orion.client.runner.Runner.gather`.

Finaly if the loop spends too much time waiting and nothing happens
the experiment is considered stalled and will also stop. This is
checked at the end of :py:meth:`orion.client.runner.Runner.run`.

0 comments on commit 527825a

Please sign in to comment.