Poster ∙ 21 January, 2025

Developing a Scalable Portfolio Allocator

Integrating a Continually Learning PPO Model into the Financial Portfolio Allocating Landscape

Reinforcement Learning for Portfolio Optimization

Project Overview

This project implements a Proximal Policy Optimization (PPO) agent to optimize portfolio allocations in a dynamic trading environment. The model operates in continuous action space and seeks to maximize returns by allocating capital among multiple assets while accounting for transaction costs and cash balance.

The primary goal is to simulate and train an intelligent agent that learns effective trading strategies, ultimately achieving positive portfolio growth.

Purpose

Why Portfolio Optimization?

In financial markets, portfolio optimization is a challenging task that requires balancing risk and return across multiple assets. Traditional methods, such as mean-variance optimization, rely on static assumptions and historical data, limiting their adaptability in dynamic markets. Reinforcement learning (RL), on the other hand, allows for adaptive strategies that respond to market conditions in real time.

This project demonstrates:

How RL can be applied to financial decision-making.
The integration of advanced neural network architectures (GRU) with RL algorithms.
Strategies for handling continuous action spaces in trading environments.

Key Features

1. Custom Trading Environment

A dynamic trading environment simulates real-world market conditions with:

Historical Data: Real asset prices, including daily open, close, high, and low prices.
Action Space: Continuous allocation of percentages to various stocks and cash, constrained to sum to 1.
Reward Function: Designed to balance portfolio growth with transaction costs, penalizing negative cash balances and encouraging diversification.

2. Actor-Critic Architecture

The PPO agent uses an actor-critic framework:

Actor Network: Outputs a Dirichlet distribution for portfolio allocations, ensuring non-negative values that sum to 1.
Critic Network: Estimates the value function to guide policy updates.
GRU Layer: Captures temporal dependencies in market data, allowing the model to learn from sequential trends.

3. Exploration and Exploitation

The model incorporates mechanisms to balance exploration and exploitation:

Entropy Regularization: Encourages diverse action sampling.
Reward Scaling: Amplifies portfolio changes to make learning signals more discernible.

4. Robust Training Pipeline

Generalized Advantage Estimation (GAE): Stabilizes policy updates by reducing variance.
Reward Normalization: Ensures that rewards remain within a manageable range.
Curriculum Learning: Gradually increases environment complexity.

Challenges and Solutions

1. Sparse Positive Rewards

Problem: Positive portfolio changes were rare, leading to slow learning.
Solution: Reward shaping was implemented to provide intermediate rewards for beneficial actions, such as reducing volatility or maintaining balance.

2. Exploration Difficulties

Problem: The agent often converged to suboptimal strategies due to insufficient exploration.
Solution: Concentration parameters in the Dirichlet distribution were adjusted dynamically to enhance exploration.

3. Portfolio Value Drift

Problem: Portfolio value approached zero due to improper share updates.
Solution: Fixed share calculations to ensure allocations reflect intended actions while preserving capital.

4. Continuous Action Space

Problem: Mapping infinite possible actions in continuous space to effective policies was complex.
Solution: Adopted a Dirichlet distribution to naturally constrain actions to valid portfolio allocations.

Results

Performance Highlights

The agent demonstrated the ability to reduce losses consistently.
Incremental improvements in reward structures enabled positive portfolio changes during evaluation.

Visualization

Key metrics and portfolio performance trends were visualized to track agent learning and decision-making over time. Sample outputs included:

Portfolio value over episodes.
Reward trends and loss metrics.
Allocation distributions.

Future Work

1. Enhanced Reward Functions

Introduce intrinsic motivation, such as diversity rewards, to encourage exploration.
Optimize for risk-adjusted returns, incorporating metrics like Sharpe Ratio.

2. Integration with Live Data

Extend the environment to handle streaming market data for real-time decision-making.

3. Advanced Architectures

Experiment with Transformers or LSTMs for improved sequential learning.
Implement multi-agent systems to simulate competition in trading.

4. Hyperparameter Optimization

Use automated tools like Optuna to fine-tune learning rates, entropy coefficients, and reward scaling factors.

Conclusion

This project showcases the potential of reinforcement learning in portfolio optimization, addressing complex challenges in continuous action spaces and dynamic environments. By leveraging advanced RL techniques and thoughtful design, the agent serves as a foundation for future innovations in financial decision-making.

Repository and Resources

Code

The complete implementation can be found here. The repository includes:

Environment setup.
PPO agent implementation.
Training and evaluation scripts.

Dependencies

Python
PyTorch
NumPy
Pandas
Gym

For detailed setup instructions, refer to the README.md file in the repository.