diff --git a/docs/index.rst b/docs/index.rst index f9b5f54f3..4aef38f33 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -3,11 +3,11 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -Welcome to ALF's documentation! -=============================== +ALF documentation +================= .. toctree:: - :maxdepth: 2 + :maxdepth: 1 overview tutorial diff --git a/docs/tutorial.rst b/docs/tutorial.rst index 855a83762..c8eb3d96c 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -4,7 +4,7 @@ Tutorial ALF is designed with **modularization** in mind. Unlike most RL libraries or frameworks which implement different algorithms by repeating the almost entire RL pipeline in separate source code files with little code reuse, ALF categorizes RL algorithms -and distills the common structure and logic within each caterogy, so +and distills the common structure and logic within each category, so that each algorithm only needs to implement or override its own exclusive logic. Usually to create an ALF job, a user is expected to: @@ -32,7 +32,7 @@ provides at least two benefits: ensures the remaining part of the pipeline is unaffected. 2. *Reusing ALF's carefully designed training pipeline which contains a ton - of crtical details and tricks that help an algorithm's training.* For example, + of critical details and tricks that help an algorithm's training.* For example, * Careful handling of environment step types and their discounts, * Temporally independent training of a rollout trajectory if no episodic memory @@ -42,23 +42,31 @@ provides at least two benefits: * Automatically applying various input data transformers during rollout and training, * Specifying different optimizers for different sub-algorithms, - * Exploiting a variety of tensorboard summary utils, + * Exploiting a variety of Tensorboard summary utils, * and many more... Below are a series of examples for writing training files using ALF, -from simple to advanced usage. Each section is a detailed, step-by-step guide -walking through key ALF cencepts. All the tutorial code files can +from simple to advanced usage. Each chapter is a detailed, step-by-step guide +walking through key ALF concepts. All the tutorial code files can be found under ``/alf/examples/tutorial``. +.. note:: + + This tutorial won't cover the technical details of different algorithms and + models, as we assume the user learns them from other resources, e.g., the + original papers. We only focus on how to use ALF as a tool to write them. + .. - The following section schedule might evolve as the tutorial proceeds + The following chapter schedule might evolve as the tutorial proceeds .. toctree:: :maxdepth: 3 tutorial/a_minimal_working_example tutorial/understanding_ALF_via_the_minimal_working_example - tutorial/configuring_existing_algorithms - tutorial/customize_environment_and_wrappers + tutorial/algorithm_interfaces + tutorial/summary_metrics_and_tensorboard + tutorial/customize_environments_and_wrappers tutorial/customize_algorithms + tutorial/customize_training_pipeline tutorial/advanced_play_and_alf_snapshot \ No newline at end of file diff --git a/docs/tutorial/a_minimal_working_example.rst b/docs/tutorial/a_minimal_working_example.rst index 4062005a5..1084f3b47 100644 --- a/docs/tutorial/a_minimal_working_example.rst +++ b/docs/tutorial/a_minimal_working_example.rst @@ -2,19 +2,16 @@ A minimal working example ========================= We start with a minimal working example of ALF. The example, as a pure ALF -configuration file, is located at ``/alf/examples/tutorial/minimal_example_conf.py``, +configuration file, is :mod:`alf.examples.tutorial.minimal_example_conf`, and consists of only 8 lines. -Train and play --------------- - -Let's ignore its content for a moment (see the next section +Let's ignore its content for a moment (see the next chapter :doc:`./understanding_ALF_via_the_minimal_working_example` for an explanation of the configuration content), and just focus on how to launch the training, interpret the output training messages, and evaluate a trained model. Train from scratch -^^^^^^^^^^^^^^^^^^ +------------------ We can train from scratch by @@ -31,7 +28,7 @@ assuming ``/tmp/alf_tutorial1`` doesn't exist or is empty. output log, etc) are stored. The training will finish in several seconds, but with some informative messages -shown in the terminal. First of all, you should see a message from ``checkpoint_utils.py`` +shown in the terminal. First of all, you should see a message from :mod:`.checkpoint_utils` like :: @@ -40,7 +37,7 @@ like from scratch which basically confirms that the training is from scratch and all algorithm parameters -and states are randomly initialized. Also ``policy_trainer.py`` will output +and states are randomly initialized. Also :mod:`.policy_trainer` will output message lines like :: @@ -63,7 +60,7 @@ as the training finishes. Here we have the checkpoint numbered by the training iteration, which is '1' because only one iteration is performed by this example. Train from a checkpoint -^^^^^^^^^^^^^^^^^^^^^^^ +----------------------- By launching the same command again, this time the checkpoint messages are different. First it should say @@ -75,7 +72,7 @@ First it should say which means the training is no longer from scratch, but instead reads the saved checkpoint from the last run. By default ALF reads the most recent checkpoint in a training root dir if multiple checkpoints exist. Also at the end of training, -``checkpoint_utils.py`` outputs: +:mod:`.checkpoint_utils` outputs: :: @@ -93,16 +90,17 @@ While the training is ongoing, we can monitor the real-time progress by tensorboard --logdir /tmp/alf_tutorial1 -We leave the interpretation of various Tensorboard statistics to later sections. +We leave the interpretation of various Tensorboard statistics to a later chapter +:doc:`./summary_metrics_and_tensorboard`. Play from a checkpoint -^^^^^^^^^^^^^^^^^^^^^^ +---------------------- ALF defines the term *play* as evaluating a model on a task and possibly also visualizing the evaluation process, for example, by rendering environment frames or various inference statistics. -Here we only introduce three basic usages of the ALF ``play`` module. For advanced +Here we only introduce three basic usages of the ALF :mod:`.play` module. For advanced play (e.g., rendering customized model inference results, play from an ALF snapshot, headless rendering, etc), we refer the reader to :doc:`./advanced_play_and_alf_snapshot`. @@ -125,7 +123,7 @@ Or you can save the rendered result to a ``mp4`` video file: python -m alf.bin.play --root_dir /tmp/alf_tutorial1 --record_file /tmp/alf_tutorial1.mp4 -We recommend the reader to read the various commandline flags in ``/alf/bin/play.py``, +We recommend the reader to read the various commandline flags in :mod:`.play`, for specifying different options such as checkpoint number and number of episodes to evaluate. @@ -133,7 +131,7 @@ Summary ------- So far, we've talked about how to train a conf file and play the trained model, -with very basic options of ``train.py`` and ``play.py``. This covers a usual +with very basic options of :mod:`.train` and :mod:`.play.py`. This covers a usual command-line usage of ALF. We really haven't explained the content of the -example and the ALF pipeline yet. In the next section, we will try to get a +example and the ALF pipeline yet. In the next chapter, we will try to get a rough picture of ALF through the lens of this minimal working example. \ No newline at end of file diff --git a/docs/tutorial/algorithm_interfaces.rst b/docs/tutorial/algorithm_interfaces.rst new file mode 100644 index 000000000..4f302dd92 --- /dev/null +++ b/docs/tutorial/algorithm_interfaces.rst @@ -0,0 +1,2 @@ +Algorithm interfaces +==================== \ No newline at end of file diff --git a/docs/tutorial/customize_algorithms.rst b/docs/tutorial/customize_algorithms.rst new file mode 100644 index 000000000..5410f27d1 --- /dev/null +++ b/docs/tutorial/customize_algorithms.rst @@ -0,0 +1,2 @@ +Customize algorithms +==================== \ No newline at end of file diff --git a/docs/tutorial/customize_environments_and_wrappers.rst b/docs/tutorial/customize_environments_and_wrappers.rst new file mode 100644 index 000000000..92950df03 --- /dev/null +++ b/docs/tutorial/customize_environments_and_wrappers.rst @@ -0,0 +1,2 @@ +Customize environments and wrappers +=================================== \ No newline at end of file diff --git a/docs/tutorial/customize_training_pipeline.rst b/docs/tutorial/customize_training_pipeline.rst new file mode 100644 index 000000000..3f15413f6 --- /dev/null +++ b/docs/tutorial/customize_training_pipeline.rst @@ -0,0 +1,2 @@ +Customize a training pipeline +============================= \ No newline at end of file diff --git a/docs/tutorial/images/alf_diagram.png b/docs/tutorial/images/alf_diagram.png new file mode 100644 index 000000000..9792a0a3f Binary files /dev/null and b/docs/tutorial/images/alf_diagram.png differ diff --git a/docs/tutorial/images/pipeline.png b/docs/tutorial/images/pipeline.png new file mode 100644 index 000000000..efe38d8fc Binary files /dev/null and b/docs/tutorial/images/pipeline.png differ diff --git a/docs/tutorial/summary_metrics_and_tensorboard.rst b/docs/tutorial/summary_metrics_and_tensorboard.rst new file mode 100644 index 000000000..e0f53ffa5 --- /dev/null +++ b/docs/tutorial/summary_metrics_and_tensorboard.rst @@ -0,0 +1,2 @@ +Summary, metrics, and Tensorboard +================================= \ No newline at end of file diff --git a/docs/tutorial/understanding_ALF_via_the_minimal_working_example.rst b/docs/tutorial/understanding_ALF_via_the_minimal_working_example.rst index f09f7a6c8..20a18f975 100644 --- a/docs/tutorial/understanding_ALF_via_the_minimal_working_example.rst +++ b/docs/tutorial/understanding_ALF_via_the_minimal_working_example.rst @@ -1,2 +1,431 @@ Understanding ALF via the minimal working example -================================================= \ No newline at end of file +================================================= + +In the previous chapter :doc:`./a_minimal_working_example`, we talked about how +to train a minimal conf file and play the trained model. The 8-line code in it +is just the tip of the iceberg: a very sophisticated RL training pipeline runs +under the surface. In this chapter, based on that minimal example, we will +go through some major concepts of ALF and grasp the big picture. + +ALF design overview +------------------- + +As a good analogy, you can think of ALF as a very flexible +*circuit board*. A circuit board electrically connects electronic components using +conductive tracks, and the various components on the board altogether fulfill some +kind of complex function. Similarly, ALF is a framework that logically connects +algorithmic components and fulfills RL training tasks. + +.. image:: images/alf_diagram.png + :width: 500 + :align: center + :alt: ALF as a circuit board + +However, different with a circuit board, ALF's components are highly customizable. In the +figure above, the temperature color (redness) represents how frequently each +component is expected to be modified by a user in typical ALF use cases. + +.. note:: + For examples: + + 1. If training an existing algorithm (e.g., SAC) on existing environments/tasks + (e.g., ``Hopper-v3``), only "data transformer" and "models&networks" are to + be modified. + 2. If a new algorithm is tried on existing environments/tasks, + "algorithm hierarchy" is additionally to be modified. + 3. In a rare case, "trainer" needs to be modified if the desired pipeline is + vastly different from what ALF has by default. + +Importantly, whatever component is changed, ALF's design makes sure that the +pipeline still runs with the other components unchanged. + +And the customization is very easy to implement. The user only needs to write a +configuration file in Python to specify which components need what changes. As a +typical Python file, this conf file also supports writing new code for any ALF +component (e.g., defining new algorithms, models, environments, etc). + +What the minimal example does +----------------------------- + +Let's go back to our first example which trains an existing algorithm +:class:`.ActorCriticAlgorithm` on ``CartPole-v0``. To achieve this, normally we would +need to configure the environment name by + +.. code-block:: python + + alf.config("create_environment", + env_name="CartPole-v0", + num_parallel_environments=30) + +which tells ALF to use ``CartPole-v0`` and create 30 environments in parallel for +rollout data collection. The :func:`.create_environment` function is defined as: + +.. code-block:: python + + @alf.configurable + def create_environment(env_name='CartPole-v0', + env_load_fn=suite_gym.load, + num_parallel_environments=30, + nonparallel=False, + seed=None, + batched_wrappers=()) + +We can see that because the default values of ``env_name`` and ``num_parallel_environments`` +are already what we want, in the example conf we've skipped configuring them. But +it is recommended to always explicitly specify them in a conf for readability +purpose. On the other hand, ``env_load_fn`` is the func that loads ``env_name``. +Usually :func:`.suite_gym.load` can load most built-in Gym environments. For +extra Gym environments or user-customized environments, this argument value +should be set accordingly. For instance, see :func:`.suite_mario.load` and +:func:`.suite_simple.load`. + +.. note:: + **ALF configuration** is one of the secret sauces that make ALF flexible. + + For any function decorated by :func:`~alf.config_util.configurable`, we can + configure its argument value **before** that function is actually evaluated. + If configured, the default value will be overwritten by the configured value. + :func:`~.alf.config_util.config` can be called multiple times on the same + function. + +.. note:: + When ``nonparallel=False``, ALF always creates a **batched environment**. This + env accepts batched actions and returns batched observations/rewards/info. + The first dim of these tensors is the batch size equal to ``num_parallel_environments``. + +The example conf file configures the algorithm, number of training iterations, +and the optimizer by + +.. code-block:: python + + alf.config('TrainerConfig', + algorithm_ctor=partial( + ActorCriticAlgorithm, optimizer=alf.optimizers.Adam(lr=1e-3)), + num_iterations=1) + +The algorithm and training iterations are configured through a global object +:class:`.TrainerConfig`, which is supposed to be passed from the trainer to algorithms. +One important hyperparameter that's skipped in the conf file is ``unroll_length``. +We simply use its default value which is equivalent to doing + +.. code-block:: python + + alf.config("TrainerConfig", unroll_length=8) + +This specifies how many rollout steps are performed in *each* environment before +updating parameters (in total :math:`30\times 8=240` steps). + +The algorithm itself is configurable, too. Because ALF allows defining a hierarchy +of algorithms (e.g., an RL algorithm with an auxiliary self-supervised learning +child algorithm), and each algorithm can have a different optimizer, the optimizer +configuration is always through the algorithm interface. Here we use Adam with a +learning rate of :math:`10^{-3}`. + +.. note:: + :class:`.TrainerConfig` is a very important concept in ALF. It allows customizing + many crucial parameters of the training pipeline, for example, random seed, number + of checkpoints, summary interval, rollout length, etc. The user is highly + recommended to read the API doc of this class. + +Everything can be configured! +----------------------------- + +If you look at the algorithm class definition, + +.. code-block:: python + + @alf.configurable + class ActorCriticAlgorithm(OnPolicyAlgorithm): + """Actor critic algorithm.""" + + def __init__(self, + observation_spec, + action_spec, + reward_spec=TensorSpec(()), + actor_network_ctor=ActorDistributionNetwork, + value_network_ctor=ValueNetwork, + epsilon_greedy=None, + env=None, + config: TrainerConfig = None, + loss=None, + loss_class=ActorCriticLoss, + optimizer=None, + debug_summaries=False, + name="ActorCriticAlgorithm") + +its arguments are also configurable. Notably, ``actor_network_ctor`` and +``value_network_ctor`` allow configuring the actor and value networks, respectively. +By default :class:`.ActorDistributionNetwork` is used. This class can potentially be +replaced by a user's custom actor network class. By further looking into + +.. code-block:: python + + @alf.configurable + class ActorDistributionNetwork(Network): + """Network which outputs temporally uncorrelated action distributions.""" + + def __init__(self, + input_tensor_spec, + action_spec, + input_preprocessors=None, + preprocessing_combiner=None, + conv_layer_params=None, + fc_layer_params=None, + activation=torch.relu_, + kernel_initializer=None, + use_fc_bn=False, + discrete_projection_net_ctor=CategoricalProjectionNetwork, + continuous_projection_net_ctor=NormalProjectionNetwork, + name="ActorDistributionNetwork"): + +you'll realize that the actor network is also configurable, including its layers, +input preprocessors, kernel initializer, and projection network, etc. If we keep +going deeper, the projection network can also be configured (assuming we have +continuous actions): + +.. code-block:: python + + @alf.configurable + class NormalProjectionNetwork(Network): + def __init__(self, + input_size, + action_spec, + activation=math_ops.identity, + projection_output_init_gain=0.3, + std_bias_initializer_value=0.0, + squash_mean=True, + state_dependent_std=False, + std_transform=nn.functional.softplus, + scale_distribution=False, + dist_squashing_transform=dist_utils.StableTanh(), + name="NormalProjectionNetwork"): + +In the above example conf, we didn't bother configuring all these one by one. The +default argument values were used. + +``alf.config`` vs. ``partial`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +One good thing about ALF configuration is that you can easily configure something +that is deep in the calling tree by one line, e.g., + +.. code-block:: python + + alf.config("alf.networks.projection_networks.NormalProjectionNetwork", + activation=torch.tanh) + +.. note:: + In fact, you can also specify a shorter name for the class/function to be + configured, as long as the specified name is an unambiguous suffix of a + complete path under ALF. For example, + ``alf.config("NormalProjectionNetwork", activation=torch.tanh)`` will also + work. + +Compared to passing a huge config dictionary from the main function to other places +in the code, this makes the code less cluttered. However, one side effect is that +the configuration takes place *globally*. That is, if there are multiple places +that create :class:`.NormalProjectionNetwork`, they will share the same configured +values. + +There are two ways of overwriting the globally configured values. One is to +manually overwrite argument values where the configured values are not +needed, e.g., + +.. code-block:: python + + # the hard-coded ``torch.relu`` will shadow the configured ``torch.tanh`` + proj_net = NormalProjectionNetwork(activation=torch.relu, ...) + +In this case, the configuration ``activation=torch.tanh`` becomes *inoperative*. + +.. note:: + **Inoperative vs operative** + + There are two types of configured values. An *operative* config value is one + that is eventually used when calling a class or function. This includes those + default config values (not necessarily provided by a user's conf file). + In contrast, an *inoperative* config value is one that is overwritten by another + value, e.g., by a hard-coded value in the code. + + This distinction between the two config types is useful for debugging, + because it helps avoid the case where a user thinks a provided config should + take effect but in fact it's shadowed. You can find this information in + "TEXT/config" tab in the Tensorboard. For details, see the next chapter + :doc:`./summary_metrics_and_tensorboard`. + +The other way is to use `partial `_ +which is a Python built-in helper function from the ``functools`` package. + +:: + + The partial() is used for partial function application which ``freezes'' + some portion of a function’s arguments and/or keywords resulting in a new + object with a simplified signature. + +In a word, ``partial`` creates a `closure `_ +(local named scope) that partially binds some arguments with the provided values. +So to achieve the same purpose, alternatively we could do + +.. code-block:: python + + alf.config('ActorDistributionNetwork', + continuous_projection_net_ctor=partial( + NormalProjectionNetwork, + activation=torch.tanh)) + +This avoids globally changing the activation function of :class:`.NormalProjectionNetwork`. +Moreover, to avoid globally changing anything about the algorithm, the entire +calling path can use ``partial`` in a nested way: + +.. code-block:: python + + alf.config('TrainerConfig', + algorithm_ctor=partial( + ActorCriticAlgorithm, + optimizer=alf.optimizers.Adam(lr=1e-3), + actor_network_ctor=partial( + ActorDistributionNetwork, + continuous_projection_net_ctor=partial( + NormalProjectionNetwork, + activation=torch.tanh)))) + +Of course, with ``partial``, you can also assign a partially evaluated class/function +to a variable and pass this closure around (e.g., to other conf files). + +.. code-block:: python + + algorithm_ctor = partial(ActorCriticAlgorithm, optimizer=alf.optimizers.Adam(lr=1e-3)) + alf.config('TrainerConfig', algorithm_ctor=algorithm_ctor) + +.. note:: + We always recommend the user to use ``partial`` whenever possible in order to + avoid global side effects. However, if you are sure that only one object instance + is going to be created or no harmful side effect will take place (e.g., + :class:`.TrainerConfig`, and :func:`.create_environment`), then + :func:`~alf.config_util.config` will be more convenient. + +The big picture of ALF +---------------------- + +In the previous section, you've probably already got a good idea of how easily +each component of ALF can be customized, just like unplug&plug an electrical +component on a circuit board. In fact, a conf file can do more than this by +defining completely new environments :doc:`./customize_environments_and_wrappers` +and algorithms :doc:`./customize_algorithms`. + +Once a conf file is provided to ALF trainer, the RL pipeline runs according to +the configuration. In general, there are two types of pipelines: on-policy and +off-policy, corresponding to on-policy algorithms (e.g., :class:`.ActorCriticAlgorithm`) +and off-policy algorithms (e.g., :class:`.SacAlgorithm`). + +Either pipeline type follows a simple alternation between "unroll" +(online data collection) and "update" (parameters updates). + +.. image:: images/pipeline.png + :width: 800 + :align: center + :alt: ALF pipeline + +1. "unroll": in this process, a behavior policy generates a batch of actions, each + output to one of the parallel environments, to collect a batch of experience + data per time step. The policy rolls out multiple time steps for data collection + before transitioning to "update". For on-policy algorithms, an inference + computational graph with grads will be preserved and passed to "update". For + off-policy algorithms, no computational graph is preserved and the data directly + go to a replay buffer. +2. "update": a loop of parameter updates are performed. On-policy algorithms compute + losses on all samples in the temp buffer while off-policy algorithms compute losses + on mini-batch samples from a replay buffer. *The loop length is forced to be 1 + for on-policy algorithms.* + +.. note:: + The concept of "episode" is orthogonal to the pipeline. A training iteration + might divide an episode into multiple segments. In order words, parameter + update could happen before a complete episode finishes. + +A conf file usually + +1. tweaks the schedule of a pipeline by changing the "unroll" interval (:attr:`.TrainerConfig.unroll_length`), + the "update" loop (:attr:`.TrainerConfig.num_updates_per_train_iter`), the mini-batch + shape (:attr:`.TrainerConfig.mini_batch_size` x :attr:`.TrainerConfig.mini_batch_length`), etc. +2. how on-policy/off-policy losses are computed, for example, which algorithms + using what networks computing what losses, as demonstrated in the previous + section. + +Although very rare, a user can customize a new training pipeline. We will talk about +this in :doc:`./customize_training_pipeline`. + +Which pipeline is used will be automatically determined based on the root algorithm +configured to :class:`.TrainerConfig`. The above example conf tells ALF to use the +on-policy training pipeline because an on-policy algorithm :class:`.ActorCriticAlgorithm` +is configured. In this very simple example, after 30 environments unroll 8 steps, +the trainer updates the model parameters once and the training finishes. + +ALF is flexible +--------------- + +Now we try another arbitrary environment which has continuous actions. +To do so, we just append + +.. code-block:: python + + alf.config("create_environment", env_name="LunarLanderContinuous-v2") + +to the example conf file, to replace the default ``CartPole-v0`` environment +with ``LunarLanderContinuous-v2``. The conf file can still be trained successfully. +In this training, the :class:`.ActorCriticAlgorithm` algorithm is again used, but on +continuous actions. It turns out that ALF can automatically adapt to different +action types without the user telling it what to do! + +As another example, we replace the algorithm with PPO by appending: + +.. code-block:: python + + from alf.algorithms.ppo_algorithm import PPOAlgorithm + alf.config("TrainerConfig", + algorithm_ctor=partial( + PPOAlgorithm, optimizer=alf.optimizers.Adam(lr=1e-3))) + +The conf file still works without any problem. + +ALF's flexibility is more than this. In fact, ALF can adapt to different observations +(e.g., image vs. vector), rewards (e.g., scalar vs. vector), and actions (e.g., +discrete vs. continuous). The reason is that ALF hard-codes very few things, and +it always assumes the most general scenario when handling observations, rewards, +and actions. The secret weapon for supporting this flexibility is :class:`.TensorSpec`. +A :class:`.TensorSpec` allows an API to describe the tensors that it accepts or returns, +before those tensors exist. This allows dynamic and flexible graph construction and +configuration. + +In summary, different components on a ALF pipeline are connected by using +:class:`.TensorSpec` to specify their I/O specs. This also happens within a component, +for example, between child algorithms, between networks, etc. + +More than training pipelines +---------------------------- + +Another major effort of ALF is providing an extensive set of high-quality tools +for RL research, including various algorithms, networks, layers, and environments. + +.. code-block:: bash + + alf/algorithms/ + alf/networks/ + alf/layers.py + alf/environments/ + +A user can easily experiment them via the conf file. In the example conf, +:class:`.ActorCriticAlgorithm`, :class:`.ActorDistributionNetwork`, and +:class:`.NormalProjectionNetwork` are representatives. + +Summary +------- + +In this chapter we've talked about ALF configuration and pipeline based on the +minimal example. We've shown that ALF is essentially a pipeline that connects +different components which can be customized by a conf file. Moreover, ALF +provides various arms for doing RL research. + +It might still not be unclear to a user what roles an algorithm plays in the training +pipelines. In the next chapter :doc:`./algorithm_interfaces`, we will explain +the most important common algorithm interfaces to fill in the gap.