Add portfolio optimization environment, architectures and algorithm (#…

…1146) * Add portfolio optimization env * Refactor POE * Add algorithms for portfolio optimization * Add parameters to EIIE * Update portfolio optimization example * Add readme to portfolio optimization agents * Update readme * Update portfolio optimization readme * Format code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add commentary to portfolio optimization example --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
AI4Finance-Foundation · Jan 11, 2024 · b800999 · b800999
1 parent a2863a9
commit b800999
Show file tree

Hide file tree

Showing 10 changed files with 3,927 additions and 0 deletions.
diff --git a/examples/FinRL_PortfolioOptimizationEnv_Demo.ipynb b/examples/FinRL_PortfolioOptimizationEnv_Demo.ipynb
diff --git a/finrl/agents/portfolio_optimization/README.md b/finrl/agents/portfolio_optimization/README.md
@@ -0,0 +1,87 @@
+# Portfolio Optimization Agents
+
+This directory contains architectures and algorithms commonly used in portfolio optimization agents.
+
+To instantiate the model, it's necessary to have an instance of [PortfolioOptimizationEnv](/finrl/meta/env_portfolio_optimization/). In the example below, we use the `DRLAgent` class to instantiate a policy gradient ("pg") model. With the dictionary `model_kwargs`, we can set the `PolicyGradient` class parameters and, whith the dictionary `policy_kwargs`, it's possible to change the parameters of the chosen architecture.
+
+```python
+from finrl.agents.portfolio_optimization.models import DRLAgent
+from finrl.agents.portfolio_optimization.architectures import EIIE
+
+# set PolicyGradient algorithm arguments
+model_kwargs = {
+    "lr": 0.01,
+    "policy": EIIE,
+}
+
+# set EIIE architecture arguments
+policy_kwargs = {
+    "k_size": 4
+}
+
+model = DRLAgent(train_env).get_model("pg", model_kwargs, policy_kwargs)
+```
+
+In the example below, the model is trained in 5 episodes (we define an episode as a complete period of the used environment).
+
+```python
+DRLAgent.train_model(model, episodes=5)
+```
+
+It's important that the architecture and the environment have the same `time_window` defined. By default, both of them use 50 timesteps as `time_window`. For more details about what is a time window, check this [article](https://doi.org/10.5753/bwaif.2023.231144).
+
+### Policy Gradient Algorithm
+
+The class `PolicyGradient` implements the Policy Gradient algorithm used in *Jiang et al* paper. This algorithm is inspired by DDPG (deep deterministic policy gradient), but there are a couple of differences:
+- DDPG is an actor-critic algorithm, so it has an actor and a critic neural network. The algorithm below, however, doesn't have a critic neural network and uses the portfolio value as value function: the policy will be updated to maximize the portfolio value.
+- DDPG usually makes use of a noise parameter in the action during training to create an exploratory behavior. PG algorithm, on the other hand, has a full-exploit approach.
+- DDPG randomly samples experiences from its replay buffer. The implemented policy gradient, however, samples a sequential batch of experiences in time, to make it possible to calculate the variation of the portfolio value in the batch and use it as value function.
+
+The algorithm was implemented as follows:
+1. Initializes policy network and replay buffer;
+2. For each episode, do the following:
+    1. For each period of `batch_size` timesteps, do the following:
+        1. For each timestep, define an action to be performed, simulate the timestep and save the experiences in the replay buffer.
+        2. After `batch_size` timesteps are simulated, sample the replay buffer.
+        4. Calculate the value function: $V = \sum\limits_{t=1}^{batch\_size} ln(\mu_{t}(W_{t} \cdot P_{t}))$, where $W_{t}$ is the action performed at timestep t, $P_{t}$ is the price variation vector at timestep t and $\mu_{t}$ is the transaction remainder factor at timestep t. Check *Jiang et al* paper for more details.
+        5. Perform gradient ascent in the policy network.
+    2. If, in the and of episode, there is sequence of remaining experiences in the replay buffer, perform steps 1 to 5 with the remaining experiences.
+
+### References
+
+If you are using one of them in your research, you can use the following references.
+
+#### EIIE Architecture and Policy Gradient algorithm
+
+[A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem](https://doi.org/10.48550/arXiv.1706.10059)
+```
+@misc{jiang2017deep,
+      title={A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem},
+      author={Zhengyao Jiang and Dixing Xu and Jinjun Liang},
+      year={2017},
+      eprint={1706.10059},
+      archivePrefix={arXiv},
+      primaryClass={q-fin.CP}
+}
+```
+
+#### EI3 Architecture
+
+[A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management](https://doi.org/10.1145/3357384.3357961)
+```
+@inproceedings{shi2018multiscale,
+               author = {Shi, Si and Li, Jianjun and Li, Guohui and Pan, Peng},
+               title = {A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management},
+               year = {2019},
+               isbn = {9781450369763},
+               publisher = {Association for Computing Machinery},
+               address = {New York, NY, USA},
+               url = {https://doi.org/10.1145/3357384.3357961},
+               doi = {10.1145/3357384.3357961},
+               booktitle = {Proceedings of the 28th ACM International Conference on Information and Knowledge Management},
+               pages = {1613–1622},
+               numpages = {10},
+               keywords = {portfolio management, reinforcement learning, inception network, convolution neural network},
+               location = {Beijing, China},
+               series = {CIKM '19} }
+```
diff --git a/finrl/agents/portfolio_optimization/__init__.py b/finrl/agents/portfolio_optimization/__init__.py
diff --git a/finrl/agents/portfolio_optimization/algorithms.py b/finrl/agents/portfolio_optimization/algorithms.py
@@ -0,0 +1,251 @@
+from __future__ import annotations
+
+import copy
+
+import numpy as np
+import torch
+from torch.optim import AdamW
+from torch.utils.data import DataLoader
+from tqdm import tqdm
+
+from .architectures import EIIE
+from .utils import PVM
+from .utils import ReplayBuffer
+from .utils import RLDataset
+
+
+class PolicyGradient:
+    """Class implementing policy gradient algorithm to train portfolio
+    optimization agents.
+
+    Note:
+        During testing, the agent is optimized through online learning.
+        The parameters of the policy is updated repeatedly after a constant
+        period of time. To disable it, set learning rate to 0.
+
+    Attributes:
+        train_env: Environment used to train the agent
+        train_policy: Policy used in training.
+        test_env: Environment used to test the agent.
+        test_policy: Policy after test online learning.
+    """
+
+    def __init__(
+        self,
+        env,
+        policy=EIIE,
+        policy_kwargs=None,
+        validation_env=None,
+        batch_size=100,
+        lr=1e-3,
+        optimizer=AdamW,
+        device="cpu",
+    ):
+        """Initializes Policy Gradient for portfolio optimization.
+
+        Args:
+          env: Training Environment.
+          policy: Policy architecture to be used.
+          policy_kwargs: Arguments to be used in the policy network.
+          validation_env: Validation environment.
+          batch_size: Batch size to train neural network.
+          lr: policy Neural network learning rate.
+          optimizer: Optimizer of neural network.
+          device: Device where neural network is run.
+        """
+        self.policy = policy
+        self.policy_kwargs = {} if policy_kwargs is None else policy_kwargs
+        self.validation_env = validation_env
+        self.batch_size = batch_size
+        self.lr = lr
+        self.optimizer = optimizer
+        self.device = device
+        self._setup_train(env, self.policy, self.batch_size, self.lr, self.optimizer)
+
+    def _setup_train(self, env, policy, batch_size, lr, optimizer):
+        """Initializes algorithm before training.
+
+        Args:
+          env: environment.
+          policy: Policy architecture to be used.
+          batch_size: Batch size to train neural network.
+          lr: Policy neural network learning rate.
+          optimizer: Optimizer of neural network.
+        """
+        # environment
+        self.train_env = env
+
+        # neural networks
+        self.train_policy = policy(**self.policy_kwargs).to(self.device)
+        self.train_optimizer = optimizer(self.train_policy.parameters(), lr=lr)
+
+        # replay buffer and portfolio vector memory
+        self.train_batch_size = batch_size
+        self.train_buffer = ReplayBuffer(capacity=batch_size)
+        self.train_pvm = PVM(self.train_env.episode_length, env.portfolio_size)
+
+        # dataset and dataloader
+        dataset = RLDataset(self.train_buffer)
+        self.train_dataloader = DataLoader(
+            dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True
+        )
+
+    def train(self, episodes=100):
+        """Training sequence.
+
+        Args:
+            episodes: Number of episodes to simulate.
+        """
+        for i in tqdm(range(1, episodes + 1)):
+            obs = self.train_env.reset()  # observation
+            self.train_pvm.reset()  # reset portfolio vector memory
+            done = False
+
+            while not done:
+                # define last_action and action and update portfolio vector memory
+                last_action = self.train_pvm.retrieve()
+                obs_batch = np.expand_dims(obs, axis=0)
+                last_action_batch = np.expand_dims(last_action, axis=0)
+                action = self.train_policy(obs_batch, last_action_batch)
+                self.train_pvm.add(action)
+
+                # run simulation step
+                next_obs, reward, done, info = self.train_env.step(action)
+
+                # add experience to replay buffer
+                exp = (obs, last_action, info["price_variation"], info["trf_mu"])
+                self.train_buffer.append(exp)
+
+                # update policy networks
+                if len(self.train_buffer) == self.train_batch_size:
+                    self._gradient_ascent()
+
+                obs = next_obs
+
+            # gradient ascent with episode remaining buffer data
+            self._gradient_ascent()
+
+            # validation step
+            if self.validation_env:
+                self.test(self.validation_env)
+
+    def _setup_test(self, env, policy, batch_size, lr, optimizer):
+        """Initializes algorithm before testing.
+
+        Args:
+          env: Environment.
+          policy: Policy architecture to be used.
+          batch_size: batch size to train neural network.
+          lr: policy neural network learning rate.
+          optimizer: Optimizer of neural network.
+        """
+        # environment
+        self.test_env = env
+
+        # process None arguments
+        policy = self.train_policy if policy is None else policy
+        lr = self.lr if lr is None else lr
+        optimizer = self.optimizer if optimizer is None else optimizer
+
+        # neural networks
+        # define policy
+        self.test_policy = copy.deepcopy(policy)
+        self.test_optimizer = optimizer(self.test_policy.parameters(), lr=lr)
+
+        # replay buffer and portfolio vector memory
+        self.test_buffer = ReplayBuffer(capacity=batch_size)
+        self.test_pvm = PVM(self.test_env.episode_length, env.portfolio_size)
+
+        # dataset and dataloader
+        dataset = RLDataset(self.test_buffer)
+        self.test_dataloader = DataLoader(
+            dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True
+        )
+
+    def test(
+        self, env, policy=None, online_training_period=10, lr=None, optimizer=None
+    ):
+        """Tests the policy with online learning.
+
+        Args:
+          env: Environment to be used in testing.
+          policy: Policy architecture to be used. If None, it will use the training
+            architecture.
+          online_training_period: Period in which an online training will occur. To
+            disable online learning, use a very big value.
+          batch_size: Batch size to train neural network. If None, it will use the
+            training batch size.
+          lr: Policy neural network learning rate. If None, it will use the training
+            learning rate
+          optimizer: Optimizer of neural network. If None, it will use the training
+            optimizer
+
+        Note:
+            To disable online learning, set learning rate to 0 or a very big online
+            training period.
+        """
+        self._setup_test(env, policy, online_training_period, lr, optimizer)
+
+        obs = self.test_env.reset()  # observation
+        self.test_pvm.reset()  # reset portfolio vector memory
+        done = False
+        steps = 0
+
+        while not done:
+            steps += 1
+            # define last_action and action and update portfolio vector memory
+            last_action = self.test_pvm.retrieve()
+            obs_batch = np.expand_dims(obs, axis=0)
+            last_action_batch = np.expand_dims(last_action, axis=0)
+            action = self.test_policy(obs_batch, last_action_batch)
+            self.test_pvm.add(action)
+
+            # run simulation step
+            next_obs, reward, done, info = self.test_env.step(action)
+
+            # add experience to replay buffer
+            exp = (obs, last_action, info["price_variation"], info["trf_mu"])
+            self.test_buffer.append(exp)
+
+            # update policy networks
+            if steps % online_training_period == 0:
+                self._gradient_ascent(test=True)
+
+            obs = next_obs
+
+    def _gradient_ascent(self, test=False):
+        """Performs the gradient ascent step in the policy gradient algorithm.
+
+        Args:
+            test: If true, it uses the test dataloader and policy.
+        """
+        # get batch data from dataloader
+        obs, last_actions, price_variations, trf_mu = (
+            next(iter(self.test_dataloader))
+            if test
+            else next(iter(self.train_dataloader))
+        )
+        obs = obs.to(self.device)
+        last_actions = last_actions.to(self.device)
+        price_variations = price_variations.to(self.device)
+        trf_mu = trf_mu.unsqueeze(1).to(self.device)
+
+        # define policy loss (negative for gradient ascent)
+        mu = (
+            self.test_policy.mu(obs, last_actions)
+            if test
+            else self.train_policy.mu(obs, last_actions)
+        )
+        policy_loss = -torch.mean(
+            torch.log(torch.sum(mu * price_variations * trf_mu, dim=1))
+        )
+
+        # update policy network
+        if test:
+            self.test_policy.zero_grad()
+            policy_loss.backward()
+            self.test_optimizer.step()
+        else:
+            self.train_policy.zero_grad()
+            policy_loss.backward()
+            self.train_optimizer.step()