This project explores how Reinforcement Learning (RL)—specifically, Deep Q-Learning—can be used to develop algorithmic trading strategies. We simulate a trading agent that learns to make buy, sell, and hold decisions based on historical price data and technical indicators.
Techniques Used:
- Feature Engineering (RSI, EMA, MACD, OBV, etc.)
- Custom RL Environment (OpenAI Gym style)
- Policy Gradient Agent (A2C from
stable-baselines3) - Performance Evaluation via backtesting
Algorithmic trading uses automated systems to make trading decisions based on quantitative signals. In this project:
- We focus on equity trading, using the S&P 500 (^GSPC) index.
- Trades are simulated using hourly price data over 2 years.
- The goal is to maximize cumulative returns while learning from historical data.
Reinforcement Learning (RL) is a branch of machine learning where an agent learns to make decisions by interacting with an environment and receiving rewards or penalties.
- Agent: The trader
- Environment: The stock market (historical data)
- Action Space: Buy, Sell, Hold
- State: Technical indicators and price history
- Reward: Profit/loss after each action
Deep Q-Learning is an RL algorithm that approximates the Q-value function using deep neural networks. It helps the agent learn:
“What is the expected reward if I take this action from this state?”
Although this project uses A2C (Advantage Actor Critic) for simplicity, the architecture and environment are compatible with DQN or PPO as well.
Technical indicators help quantify trends, volatility, and momentum in price data.
| Indicator | Type | What it Shows |
|---|---|---|
| EMA (Exponential Moving Average) | Trend | Smoothed average of prices over 7, 14, 50, 200 steps |
| MACD (Moving Average Convergence Divergence) | Trend/Momentum | Difference between fast and slow EMAs to detect reversals |
| RSI (Relative Strength Index) | Momentum | Measures overbought/oversold conditions (0–100) |
| OBV (On-Balance Volume) | Volume | Cumulative volume indicator to show crowd interest |
| Bollinger Bands (BB) | Volatility | Upper/lower bands around a moving average showing price extremes |
| Section | Purpose |
|---|---|
| `01_data_collection | Downloads 2 years of S&P 500 data from Yahoo Finance |
| `02_feature_engineering | Adds indicators using the ta library |
| `03_rl_environment | Custom gym.Env simulating a trading environment |
| `04_agent_training | Trains a policy-based RL agent (A2C) |
| `05_evaluation_and_backtesting | Simulates agent trading and plots performance |
stable-baselines3for RL algorithms (A2C, PPO, DQN)yfinancefor historical price datatafor computing technical indicatorsgymfor environment simulation- Google Colab for easy execution
Run the rl_trading notebook in Google Colab for zero setup.
Install dependencies in each notebook:
!pip install yfinance ta stable-baselines3[extra] --quietExample for ^GSPC (S&P 500):
- Start Datetime : ('2023-05-30 13:30:00+0000', tz='UTC')
- End Datetime : ('2025-05-27 17:30:00+0000', tz='UTC')
The cumulative reward plot shows promising learning progress, though there is room for improvement. The agent begins with modest initial losses, which is expected as it explores the trading environment. By step 500, it starts developing an effective strategy, driving cumulative rewards into positive territory. The most impressive gains occur between steps 1000-2000, where rewards climb steadily to reach approximately 2500, demonstrating the agent's ability to identify and exploit profitable trading opportunities. While some volatility emerges in later stages (2000-3000 steps), the overall trend remains positive, suggesting the core strategy is sound. The minor pullbacks likely represent normal market fluctuations or temporary exploration phases rather than systemic failures. With additional training and minor parameter adjustments - such as refining the reward function or balancing exploration - the agent could potentially smooth out these fluctuations while maintaining its profitable trajectory. This performance indicates a successful learning process with a strategy that, while not perfect, shows clear potential for consistent profitability.
- Implementing more sophisticated RL algorithms (e.g., A2C, PPO).
- Adding more features or using alternative state representations (e.g., candlestick patterns as images for CNNs).
- More rigorous hyperparameter optimization.
- Portfolio optimization for multiple assets.
- Live trading integration
This project is licensed under the MIT License. See the LICENSE file for details.
This project is for educational purposes only and should not be considered financial advice. Trading financial markets involves substantial risk of loss.

