-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add portfolio optimization environment, architectures and algorithm (#…
…1146) * Add portfolio optimization env * Refactor POE * Add algorithms for portfolio optimization * Add parameters to EIIE * Update portfolio optimization example * Add readme to portfolio optimization agents * Update readme * Update portfolio optimization readme * Format code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add commentary to portfolio optimization example --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
a2863a9
commit b800999
Showing
10 changed files
with
3,927 additions
and
0 deletions.
There are no files selected for viewing
2,465 changes: 2,465 additions & 0 deletions
2,465
examples/FinRL_PortfolioOptimizationEnv_Demo.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
# Portfolio Optimization Agents | ||
|
||
This directory contains architectures and algorithms commonly used in portfolio optimization agents. | ||
|
||
To instantiate the model, it's necessary to have an instance of [PortfolioOptimizationEnv](/finrl/meta/env_portfolio_optimization/). In the example below, we use the `DRLAgent` class to instantiate a policy gradient ("pg") model. With the dictionary `model_kwargs`, we can set the `PolicyGradient` class parameters and, whith the dictionary `policy_kwargs`, it's possible to change the parameters of the chosen architecture. | ||
|
||
```python | ||
from finrl.agents.portfolio_optimization.models import DRLAgent | ||
from finrl.agents.portfolio_optimization.architectures import EIIE | ||
|
||
# set PolicyGradient algorithm arguments | ||
model_kwargs = { | ||
"lr": 0.01, | ||
"policy": EIIE, | ||
} | ||
|
||
# set EIIE architecture arguments | ||
policy_kwargs = { | ||
"k_size": 4 | ||
} | ||
|
||
model = DRLAgent(train_env).get_model("pg", model_kwargs, policy_kwargs) | ||
``` | ||
|
||
In the example below, the model is trained in 5 episodes (we define an episode as a complete period of the used environment). | ||
|
||
```python | ||
DRLAgent.train_model(model, episodes=5) | ||
``` | ||
|
||
It's important that the architecture and the environment have the same `time_window` defined. By default, both of them use 50 timesteps as `time_window`. For more details about what is a time window, check this [article](https://doi.org/10.5753/bwaif.2023.231144). | ||
|
||
### Policy Gradient Algorithm | ||
|
||
The class `PolicyGradient` implements the Policy Gradient algorithm used in *Jiang et al* paper. This algorithm is inspired by DDPG (deep deterministic policy gradient), but there are a couple of differences: | ||
- DDPG is an actor-critic algorithm, so it has an actor and a critic neural network. The algorithm below, however, doesn't have a critic neural network and uses the portfolio value as value function: the policy will be updated to maximize the portfolio value. | ||
- DDPG usually makes use of a noise parameter in the action during training to create an exploratory behavior. PG algorithm, on the other hand, has a full-exploit approach. | ||
- DDPG randomly samples experiences from its replay buffer. The implemented policy gradient, however, samples a sequential batch of experiences in time, to make it possible to calculate the variation of the portfolio value in the batch and use it as value function. | ||
|
||
The algorithm was implemented as follows: | ||
1. Initializes policy network and replay buffer; | ||
2. For each episode, do the following: | ||
1. For each period of `batch_size` timesteps, do the following: | ||
1. For each timestep, define an action to be performed, simulate the timestep and save the experiences in the replay buffer. | ||
2. After `batch_size` timesteps are simulated, sample the replay buffer. | ||
4. Calculate the value function: $V = \sum\limits_{t=1}^{batch\_size} ln(\mu_{t}(W_{t} \cdot P_{t}))$, where $W_{t}$ is the action performed at timestep t, $P_{t}$ is the price variation vector at timestep t and $\mu_{t}$ is the transaction remainder factor at timestep t. Check *Jiang et al* paper for more details. | ||
5. Perform gradient ascent in the policy network. | ||
2. If, in the and of episode, there is sequence of remaining experiences in the replay buffer, perform steps 1 to 5 with the remaining experiences. | ||
|
||
### References | ||
|
||
If you are using one of them in your research, you can use the following references. | ||
|
||
#### EIIE Architecture and Policy Gradient algorithm | ||
|
||
[A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem](https://doi.org/10.48550/arXiv.1706.10059) | ||
``` | ||
@misc{jiang2017deep, | ||
title={A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem}, | ||
author={Zhengyao Jiang and Dixing Xu and Jinjun Liang}, | ||
year={2017}, | ||
eprint={1706.10059}, | ||
archivePrefix={arXiv}, | ||
primaryClass={q-fin.CP} | ||
} | ||
``` | ||
|
||
#### EI3 Architecture | ||
|
||
[A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management](https://doi.org/10.1145/3357384.3357961) | ||
``` | ||
@inproceedings{shi2018multiscale, | ||
author = {Shi, Si and Li, Jianjun and Li, Guohui and Pan, Peng}, | ||
title = {A Multi-Scale Temporal Feature Aggregation Convolutional Neural Network for Portfolio Management}, | ||
year = {2019}, | ||
isbn = {9781450369763}, | ||
publisher = {Association for Computing Machinery}, | ||
address = {New York, NY, USA}, | ||
url = {https://doi.org/10.1145/3357384.3357961}, | ||
doi = {10.1145/3357384.3357961}, | ||
booktitle = {Proceedings of the 28th ACM International Conference on Information and Knowledge Management}, | ||
pages = {1613–1622}, | ||
numpages = {10}, | ||
keywords = {portfolio management, reinforcement learning, inception network, convolution neural network}, | ||
location = {Beijing, China}, | ||
series = {CIKM '19} } | ||
``` |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,251 @@ | ||
from __future__ import annotations | ||
|
||
import copy | ||
|
||
import numpy as np | ||
import torch | ||
from torch.optim import AdamW | ||
from torch.utils.data import DataLoader | ||
from tqdm import tqdm | ||
|
||
from .architectures import EIIE | ||
from .utils import PVM | ||
from .utils import ReplayBuffer | ||
from .utils import RLDataset | ||
|
||
|
||
class PolicyGradient: | ||
"""Class implementing policy gradient algorithm to train portfolio | ||
optimization agents. | ||
Note: | ||
During testing, the agent is optimized through online learning. | ||
The parameters of the policy is updated repeatedly after a constant | ||
period of time. To disable it, set learning rate to 0. | ||
Attributes: | ||
train_env: Environment used to train the agent | ||
train_policy: Policy used in training. | ||
test_env: Environment used to test the agent. | ||
test_policy: Policy after test online learning. | ||
""" | ||
|
||
def __init__( | ||
self, | ||
env, | ||
policy=EIIE, | ||
policy_kwargs=None, | ||
validation_env=None, | ||
batch_size=100, | ||
lr=1e-3, | ||
optimizer=AdamW, | ||
device="cpu", | ||
): | ||
"""Initializes Policy Gradient for portfolio optimization. | ||
Args: | ||
env: Training Environment. | ||
policy: Policy architecture to be used. | ||
policy_kwargs: Arguments to be used in the policy network. | ||
validation_env: Validation environment. | ||
batch_size: Batch size to train neural network. | ||
lr: policy Neural network learning rate. | ||
optimizer: Optimizer of neural network. | ||
device: Device where neural network is run. | ||
""" | ||
self.policy = policy | ||
self.policy_kwargs = {} if policy_kwargs is None else policy_kwargs | ||
self.validation_env = validation_env | ||
self.batch_size = batch_size | ||
self.lr = lr | ||
self.optimizer = optimizer | ||
self.device = device | ||
self._setup_train(env, self.policy, self.batch_size, self.lr, self.optimizer) | ||
|
||
def _setup_train(self, env, policy, batch_size, lr, optimizer): | ||
"""Initializes algorithm before training. | ||
Args: | ||
env: environment. | ||
policy: Policy architecture to be used. | ||
batch_size: Batch size to train neural network. | ||
lr: Policy neural network learning rate. | ||
optimizer: Optimizer of neural network. | ||
""" | ||
# environment | ||
self.train_env = env | ||
|
||
# neural networks | ||
self.train_policy = policy(**self.policy_kwargs).to(self.device) | ||
self.train_optimizer = optimizer(self.train_policy.parameters(), lr=lr) | ||
|
||
# replay buffer and portfolio vector memory | ||
self.train_batch_size = batch_size | ||
self.train_buffer = ReplayBuffer(capacity=batch_size) | ||
self.train_pvm = PVM(self.train_env.episode_length, env.portfolio_size) | ||
|
||
# dataset and dataloader | ||
dataset = RLDataset(self.train_buffer) | ||
self.train_dataloader = DataLoader( | ||
dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True | ||
) | ||
|
||
def train(self, episodes=100): | ||
"""Training sequence. | ||
Args: | ||
episodes: Number of episodes to simulate. | ||
""" | ||
for i in tqdm(range(1, episodes + 1)): | ||
obs = self.train_env.reset() # observation | ||
self.train_pvm.reset() # reset portfolio vector memory | ||
done = False | ||
|
||
while not done: | ||
# define last_action and action and update portfolio vector memory | ||
last_action = self.train_pvm.retrieve() | ||
obs_batch = np.expand_dims(obs, axis=0) | ||
last_action_batch = np.expand_dims(last_action, axis=0) | ||
action = self.train_policy(obs_batch, last_action_batch) | ||
self.train_pvm.add(action) | ||
|
||
# run simulation step | ||
next_obs, reward, done, info = self.train_env.step(action) | ||
|
||
# add experience to replay buffer | ||
exp = (obs, last_action, info["price_variation"], info["trf_mu"]) | ||
self.train_buffer.append(exp) | ||
|
||
# update policy networks | ||
if len(self.train_buffer) == self.train_batch_size: | ||
self._gradient_ascent() | ||
|
||
obs = next_obs | ||
|
||
# gradient ascent with episode remaining buffer data | ||
self._gradient_ascent() | ||
|
||
# validation step | ||
if self.validation_env: | ||
self.test(self.validation_env) | ||
|
||
def _setup_test(self, env, policy, batch_size, lr, optimizer): | ||
"""Initializes algorithm before testing. | ||
Args: | ||
env: Environment. | ||
policy: Policy architecture to be used. | ||
batch_size: batch size to train neural network. | ||
lr: policy neural network learning rate. | ||
optimizer: Optimizer of neural network. | ||
""" | ||
# environment | ||
self.test_env = env | ||
|
||
# process None arguments | ||
policy = self.train_policy if policy is None else policy | ||
lr = self.lr if lr is None else lr | ||
optimizer = self.optimizer if optimizer is None else optimizer | ||
|
||
# neural networks | ||
# define policy | ||
self.test_policy = copy.deepcopy(policy) | ||
self.test_optimizer = optimizer(self.test_policy.parameters(), lr=lr) | ||
|
||
# replay buffer and portfolio vector memory | ||
self.test_buffer = ReplayBuffer(capacity=batch_size) | ||
self.test_pvm = PVM(self.test_env.episode_length, env.portfolio_size) | ||
|
||
# dataset and dataloader | ||
dataset = RLDataset(self.test_buffer) | ||
self.test_dataloader = DataLoader( | ||
dataset=dataset, batch_size=batch_size, shuffle=False, pin_memory=True | ||
) | ||
|
||
def test( | ||
self, env, policy=None, online_training_period=10, lr=None, optimizer=None | ||
): | ||
"""Tests the policy with online learning. | ||
Args: | ||
env: Environment to be used in testing. | ||
policy: Policy architecture to be used. If None, it will use the training | ||
architecture. | ||
online_training_period: Period in which an online training will occur. To | ||
disable online learning, use a very big value. | ||
batch_size: Batch size to train neural network. If None, it will use the | ||
training batch size. | ||
lr: Policy neural network learning rate. If None, it will use the training | ||
learning rate | ||
optimizer: Optimizer of neural network. If None, it will use the training | ||
optimizer | ||
Note: | ||
To disable online learning, set learning rate to 0 or a very big online | ||
training period. | ||
""" | ||
self._setup_test(env, policy, online_training_period, lr, optimizer) | ||
|
||
obs = self.test_env.reset() # observation | ||
self.test_pvm.reset() # reset portfolio vector memory | ||
done = False | ||
steps = 0 | ||
|
||
while not done: | ||
steps += 1 | ||
# define last_action and action and update portfolio vector memory | ||
last_action = self.test_pvm.retrieve() | ||
obs_batch = np.expand_dims(obs, axis=0) | ||
last_action_batch = np.expand_dims(last_action, axis=0) | ||
action = self.test_policy(obs_batch, last_action_batch) | ||
self.test_pvm.add(action) | ||
|
||
# run simulation step | ||
next_obs, reward, done, info = self.test_env.step(action) | ||
|
||
# add experience to replay buffer | ||
exp = (obs, last_action, info["price_variation"], info["trf_mu"]) | ||
self.test_buffer.append(exp) | ||
|
||
# update policy networks | ||
if steps % online_training_period == 0: | ||
self._gradient_ascent(test=True) | ||
|
||
obs = next_obs | ||
|
||
def _gradient_ascent(self, test=False): | ||
"""Performs the gradient ascent step in the policy gradient algorithm. | ||
Args: | ||
test: If true, it uses the test dataloader and policy. | ||
""" | ||
# get batch data from dataloader | ||
obs, last_actions, price_variations, trf_mu = ( | ||
next(iter(self.test_dataloader)) | ||
if test | ||
else next(iter(self.train_dataloader)) | ||
) | ||
obs = obs.to(self.device) | ||
last_actions = last_actions.to(self.device) | ||
price_variations = price_variations.to(self.device) | ||
trf_mu = trf_mu.unsqueeze(1).to(self.device) | ||
|
||
# define policy loss (negative for gradient ascent) | ||
mu = ( | ||
self.test_policy.mu(obs, last_actions) | ||
if test | ||
else self.train_policy.mu(obs, last_actions) | ||
) | ||
policy_loss = -torch.mean( | ||
torch.log(torch.sum(mu * price_variations * trf_mu, dim=1)) | ||
) | ||
|
||
# update policy network | ||
if test: | ||
self.test_policy.zero_grad() | ||
policy_loss.backward() | ||
self.test_optimizer.step() | ||
else: | ||
self.train_policy.zero_grad() | ||
policy_loss.backward() | ||
self.train_optimizer.step() |
Oops, something went wrong.