Skip to content

Commit

Permalink
Add portfolio optimization environment, architectures and algorithm (#…
Browse files Browse the repository at this point in the history
…1146)

* Add portfolio optimization env

* Refactor POE

* Add algorithms for portfolio optimization

* Add parameters to EIIE

* Update portfolio optimization example

* Add readme to portfolio optimization agents

* Update readme

* Update portfolio optimization readme

* Format code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add commentary to portfolio optimization example

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
CaioSBC and pre-commit-ci[bot] authored Jan 11, 2024
1 parent a2863a9 commit b800999
Show file tree
Hide file tree
Showing 10 changed files with 3,927 additions and 0 deletions.
2,465 changes: 2,465 additions & 0 deletions examples/FinRL_PortfolioOptimizationEnv_Demo.ipynb

Large diffs are not rendered by default.

87 changes: 87 additions & 0 deletions finrl/agents/portfolio_optimization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Portfolio Optimization Agents

This directory contains architectures and algorithms commonly used in portfolio optimization agents.

To instantiate the model, it's necessary to have an instance of [PortfolioOptimizationEnv](/finrl/meta/env_portfolio_optimization/). In the example below, we use the `DRLAgent` class to instantiate a policy gradient ("pg") model. With the dictionary `model_kwargs`, we can set the `PolicyGradient` class parameters and, whith the dictionary `policy_kwargs`, it's possible to change the parameters of the chosen architecture.

```python
from finrl.agents.portfolio_optimization.models import DRLAgent
from finrl.agents.portfolio_optimization.architectures import EIIE

# set PolicyGradient algorithm arguments
model_kwargs = {
"lr": 0.01,
"policy": EIIE,
}

# set EIIE architecture arguments
policy_kwargs = {
"k_size": 4
}

model = DRLAgent(train_env).get_model("pg", model_kwargs, policy_kwargs)
```

In the example below, the model is trained in 5 episodes (we define an episode as a complete period of the used environment).

```python
DRLAgent.train_model(model, episodes=5)
```

It's important that the architecture and the environment have the same `time_window` defined. By default, both of them use 50 timesteps as `time_window`. For more details about what is a time window, check this [article](https://doi.org/10.5753/bwaif.2023.231144).

### Policy Gradient Algorithm

The class `PolicyGradient` implements the Policy Gradient algorithm used in *Jiang et al* paper. This algorithm is inspired by DDPG (deep deterministic policy gradient), but there are a couple of differences:
- DDPG is an actor-critic algorithm, so it has an actor and a critic neural network. The algorithm below, however, doesn't have a critic neural network and uses the portfolio value as value function: the policy will be updated to maximize the portfolio value.
- DDPG usually makes use of a noise parameter in the action during training to create an exploratory behavior. PG algorithm, on the other hand, has a full-exploit approach.
- DDPG randomly samples experiences from its replay buffer. The implemented policy gradient, however, samples a sequential batch of experiences in time, to make it possible to calculate the variation of the portfolio value in the batch and use it as value function.

The algorithm was implemented as follows:
1. Initializes policy network and replay buffer;
2. For each episode, do the following:
1. For each period of `batch_size` timesteps, do the following:
1. For each timestep, define an action to be performed, simulate the timestep and save the experiences in the replay buffer.
2. After `batch_size` timesteps are simulated, sample the replay buffer.
4. Calculate the value function: $V = \sum\limits_{t=1}^{batch\_size} ln(\mu_{t}(W_{t} \cdot P_{t}))$, where $W_{t}$ is the action performed at timestep t, $P_{t}$ is the price variation vector at timestep t and $\mu_{t}$ is the transaction remainder factor at timestep t. Check *Jiang et al* paper for more details.
5. Perform gradient ascent in the policy network.
2. If, in the and of episode, there is sequence of remaining experiences in the replay buffer, perform steps 1 to 5 with the remaining experiences.

### References

If you are using one of them in your research, you can use the following references.

#### EIIE Architecture and Policy Gradient algorithm

[A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem](https://doi.org/10.48550/arXiv.1706.10059)
```
@misc{jiang2017deep,
title={A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem},
author={Zhengyao Jiang and Dixing Xu and Jinjun Liang},
year={2017},
eprint={1706.10059},
archivePrefix={arXiv},
primaryClass={q-fin.CP}
}
```

#### EI3 Architecture

[A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management](https://doi.org/10.1145/3357384.3357961)
```
@inproceedings{shi2018multiscale,
author = {Shi, Si and Li, Jianjun and Li, Guohui and Pan, Peng},
title = {A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management},
year = {2019},
isbn = {9781450369763},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3357384.3357961},
doi = {10.1145/3357384.3357961},
booktitle = {Proceedings of the 28th ACM International Conference on Information and Knowledge Management},
pages = {1613–1622},
numpages = {10},
keywords = {portfolio management, reinforcement learning, inception network, convolution neural network},
location = {Beijing, China},
series = {CIKM '19} }
```
Empty file.
251 changes: 251 additions & 0 deletions finrl/agents/portfolio_optimization/algorithms.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
from __future__ import annotations

import copy

import numpy as np
import torch
from torch.optim import AdamW
from torch.utils.data import DataLoader
from tqdm import tqdm

from .architectures import EIIE
from .utils import PVM
from .utils import ReplayBuffer
from .utils import RLDataset


class PolicyGradient:
"""Class implementing policy gradient algorithm to train portfolio
optimization agents.
Note:
During testing, the agent is optimized through online learning.
The parameters of the policy is updated repeatedly after a constant
period of time. To disable it, set learning rate to 0.
Attributes:
train_env: Environment used to train the agent
train_policy: Policy used in training.
test_env: Environment used to test the agent.
test_policy: Policy after test online learning.
"""

def __init__(
self,
env,
policy=EIIE,
policy_kwargs=None,
validation_env=None,
batch_size=100,
lr=1e-3,
optimizer=AdamW,
device="cpu",
):
"""Initializes Policy Gradient for portfolio optimization.
Args:
env: Training Environment.
policy: Policy architecture to be used.
policy_kwargs: Arguments to be used in the policy network.
validation_env: Validation environment.
batch_size: Batch size to train neural network.
lr: policy Neural network learning rate.
optimizer: Optimizer of neural network.
device: Device where neural network is run.
"""
self.policy = policy
self.policy_kwargs = {} if policy_kwargs is None else policy_kwargs
self.validation_env = validation_env
self.batch_size = batch_size
self.lr = lr
self.optimizer = optimizer
self.device = device
self._setup_train(env, self.policy, self.batch_size, self.lr, self.optimizer)

def _setup_train(self, env, policy, batch_size, lr, optimizer):
"""Initializes algorithm before training.
Args:
env: environment.
policy: Policy architecture to be used.
batch_size: Batch size to train neural network.
lr: Policy neural network learning rate.
optimizer: Optimizer of neural network.
"""
# environment
self.train_env = env

# neural networks
self.train_policy = policy(**self.policy_kwargs).to(self.device)
self.train_optimizer = optimizer(self.train_policy.parameters(), lr=lr)

# replay buffer and portfolio vector memory
self.train_batch_size = batch_size
self.train_buffer = ReplayBuffer(capacity=batch_size)
self.train_pvm = PVM(self.train_env.episode_length, env.portfolio_size)

# dataset and dataloader
dataset = RLDataset(self.train_buffer)
self.train_dataloader = DataLoader(
dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True
)

def train(self, episodes=100):
"""Training sequence.
Args:
episodes: Number of episodes to simulate.
"""
for i in tqdm(range(1, episodes + 1)):
obs = self.train_env.reset() # observation
self.train_pvm.reset() # reset portfolio vector memory
done = False

while not done:
# define last_action and action and update portfolio vector memory
last_action = self.train_pvm.retrieve()
obs_batch = np.expand_dims(obs, axis=0)
last_action_batch = np.expand_dims(last_action, axis=0)
action = self.train_policy(obs_batch, last_action_batch)
self.train_pvm.add(action)

# run simulation step
next_obs, reward, done, info = self.train_env.step(action)

# add experience to replay buffer
exp = (obs, last_action, info["price_variation"], info["trf_mu"])
self.train_buffer.append(exp)

# update policy networks
if len(self.train_buffer) == self.train_batch_size:
self._gradient_ascent()

obs = next_obs

# gradient ascent with episode remaining buffer data
self._gradient_ascent()

# validation step
if self.validation_env:
self.test(self.validation_env)

def _setup_test(self, env, policy, batch_size, lr, optimizer):
"""Initializes algorithm before testing.
Args:
env: Environment.
policy: Policy architecture to be used.
batch_size: batch size to train neural network.
lr: policy neural network learning rate.
optimizer: Optimizer of neural network.
"""
# environment
self.test_env = env

# process None arguments
policy = self.train_policy if policy is None else policy
lr = self.lr if lr is None else lr
optimizer = self.optimizer if optimizer is None else optimizer

# neural networks
# define policy
self.test_policy = copy.deepcopy(policy)
self.test_optimizer = optimizer(self.test_policy.parameters(), lr=lr)

# replay buffer and portfolio vector memory
self.test_buffer = ReplayBuffer(capacity=batch_size)
self.test_pvm = PVM(self.test_env.episode_length, env.portfolio_size)

# dataset and dataloader
dataset = RLDataset(self.test_buffer)
self.test_dataloader = DataLoader(
dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True
)

def test(
self, env, policy=None, online_training_period=10, lr=None, optimizer=None
):
"""Tests the policy with online learning.
Args:
env: Environment to be used in testing.
policy: Policy architecture to be used. If None, it will use the training
architecture.
online_training_period: Period in which an online training will occur. To
disable online learning, use a very big value.
batch_size: Batch size to train neural network. If None, it will use the
training batch size.
lr: Policy neural network learning rate. If None, it will use the training
learning rate
optimizer: Optimizer of neural network. If None, it will use the training
optimizer
Note:
To disable online learning, set learning rate to 0 or a very big online
training period.
"""
self._setup_test(env, policy, online_training_period, lr, optimizer)

obs = self.test_env.reset() # observation
self.test_pvm.reset() # reset portfolio vector memory
done = False
steps = 0

while not done:
steps += 1
# define last_action and action and update portfolio vector memory
last_action = self.test_pvm.retrieve()
obs_batch = np.expand_dims(obs, axis=0)
last_action_batch = np.expand_dims(last_action, axis=0)
action = self.test_policy(obs_batch, last_action_batch)
self.test_pvm.add(action)

# run simulation step
next_obs, reward, done, info = self.test_env.step(action)

# add experience to replay buffer
exp = (obs, last_action, info["price_variation"], info["trf_mu"])
self.test_buffer.append(exp)

# update policy networks
if steps % online_training_period == 0:
self._gradient_ascent(test=True)

obs = next_obs

def _gradient_ascent(self, test=False):
"""Performs the gradient ascent step in the policy gradient algorithm.
Args:
test: If true, it uses the test dataloader and policy.
"""
# get batch data from dataloader
obs, last_actions, price_variations, trf_mu = (
next(iter(self.test_dataloader))
if test
else next(iter(self.train_dataloader))
)
obs = obs.to(self.device)
last_actions = last_actions.to(self.device)
price_variations = price_variations.to(self.device)
trf_mu = trf_mu.unsqueeze(1).to(self.device)

# define policy loss (negative for gradient ascent)
mu = (
self.test_policy.mu(obs, last_actions)
if test
else self.train_policy.mu(obs, last_actions)
)
policy_loss = -torch.mean(
torch.log(torch.sum(mu * price_variations * trf_mu, dim=1))
)

# update policy network
if test:
self.test_policy.zero_grad()
policy_loss.backward()
self.test_optimizer.step()
else:
self.train_policy.zero_grad()
policy_loss.backward()
self.train_optimizer.step()
Loading

0 comments on commit b800999

Please sign in to comment.