Deep Reinforcement Learning (DRL) is a machine learning method that combines the concept of Reinforcement Learning (RL) and Deep Learning (DL).

◼ Reinforcement Learning Part

Reinforcement Learning is a trial-and-error learning process where an agent learns to make sequential decisions by interacting with an environment, receiving rewards based on its actions.

The agent's primary objective here is to learn a policy that maximizes its cumulative reward over time by identifying the optimal action for each situation (state).

▫ Key Components

Traditional RL (and hence DRL) has key components:

Agent: A learner (deep neural network) that makes decisions
Environment: The world the agent interacts with
State: A snapshot of the environment at a particular moment
Action: A choice the agent makes based on the current state.
Policy: The agent's strategy and rules learned for choosing an action in a given state.
Reward: A feedback signal from the environment indicating the immediate consequence of the agent's action in that state. This can be positive or negative.

For example, let us consider a self-driving car: its controller (the agent) interprets the road image (state) and, guided by its learned policy, chooses to steer left (action). A smooth drive then yields a reward of 3.

The approach of RL is fundamentally different from other machine learning paradigms;

Supervised Learning: Relies on predefined input-output pairs.
Unsupervised Learning: Seeks inherent structures in unlabeled data.
RL agent (and DRL agent): Continuously learns through direct interaction and feedback from the environment.

◼ Deep Learning Part

The "Deep" in DRL implies the integration of Deep Learning techniques.

In DRL, the DRL agent is implemented as a multi-layered neural network, enabling it to process and learn from highly complex and high-dimensional states, such as raw pixel data from images or intricate sensor readings.

These "deep" neural networks excel at automatically learning patterns from raw data without explicit feature engineering.

◼ Addressing the Limitations of Traditional Reinforcement Learning

Traditional RL faces significant limitations when applied to complex, real-world scenarios. DRL emerges as a solution to overcome these challenges:

The Curse of Dimensionality & Handling Complex States:
- Traditional RL: Struggles with complex and high-dimensional states. As the number of states in an environment grows, the amount of memory and computation required to store and update value functions exponentially increases, quickly becoming unfeasible (later explain more in details).
- DRL Solution: Utilizes DRL agents backed by deep neural networks. They can map high-dimensional states to detailed context for making decisions.
Inability to Handle Raw Sensory Data & End-to-End Learning:
- Traditional RL: Requires predefined, low-dimensional representations of the environment. It cannot directly process raw sensory data (e.g., pixel data from a camera, audio signals) without manual feature extraction or state discretization.
- DRL Solution: Learns directly from raw sensory data without the need for manual feature engineering. The deep neural networks automatically extract relevant features and representations from this raw input, enabling the agent to learn complex behaviors from the ground up. This end-to-end approach simplifies the design process and allows for more nuanced learning.
Limited Generalization:
- Traditional RL: Struggles to generalize their learning to unseen states, often requiring explicit learning for every possible scenario.
- DRL Solution: DLR agents leverage the generalization capabilities of deep neural networks. They can apply learned knowledge to unseen states, making them more adaptable to real-world complexity.

Below diagram illustrates the key differences between DRL and traditional RL, highlighting how deep neural networks enhance the agent's ability to perceive and act in sophisticated environments:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure A. Comparison of DRL and traditional RL (Created by Kuriko IWAI)

In essence, DRL's use of deep neural networks as the agent allows it to process high-dimensional inputs, learn complex mappings, and generalize effectively, addressing the core limitations of the traditional RL.

◼ Use Cases

DRL has revolutionized various domains due to its ability to handle complexity and learn autonomously. Its use cases include:

Gaming: Achieving superhuman performance in strategic games (Go, Chess) and video games (Atari, StarCraft II, Dota 2) by learning complex strategies through self-play.
Robotics: Enabling robots to perform intricate tasks like grasping, manipulation, locomotion, and navigation in diverse and dynamic environments.
Autonomous Vehicles: Developing decision-making systems for self-driving cars, including path planning, collision avoidance, and adaptive control in varied traffic conditions.
Finance: Optimizing algorithmic trading strategies, portfolio management, risk assessment, and fraud detection by analyzing vast, real-time market data.
Healthcare: Personalizing treatment plans, assisting in medical diagnosis, optimizing drug discovery processes, and enhancing robotic surgical procedures.
Energy Management: Optimizing energy consumption in large data centers and smart grids by predicting usage patterns and managing power distribution.

The Core of Optimization Problems in Reinforcement Learning

Among all different types of DRL algorithms (or RL in general), its optimization problem is to find an optimal policy π∗ that can maximize the expected sum of discounted rewards by keep taking a “good“ action in each state.

Mathematically, it is defined with these states S and actions A as conditions:

\pi^* = \arg\max_\pi \mathbb{E} \left[ \sum_{k=0}^\infty \underbrace{\gamma^k R_{t+k+1} }_{\text{discounted reward}} \mid S_t = s, A_t = \pi(S_t) \right]

where:

π∗: The optimal policy that maximizes the expected cumulative discounted reward.
E[⋅]: The expected value (the average value over many possible discounted rewards).
γ_k: The discount factor γ at time k.
R_t+k+1: The reward received at time step t+k+1.
S_t = s: Under the condition where the state at time t is s.
A_t = π(S_t): Under the condition where the action at time t is determined by the policy π applied to state S_t.

Take a look at key components.

The discount factor γ controls the discount rate of rewards. The reward at time step t is discounted by a factor of γ^t. Since the discount factor ranges from zero to less than one:

\gamma \in [0, 1)

to make the discounted reward large, the algorithm is strongly incentivized to accrue positive rewards as soon as possible and postpone negative rewards as long as possible.

This is a similar concept to the interest rate in economic applications where a dollar today is worth more than a dollar tomorrow.

A policy π denotes any function that can map states to actions:

\pi: S \mapsto A

When the algorithm executes a policy π, it takes an action a such that:

a = \pi(s) \text { }s \in S, a \in A

Given that the policy represents actions, the expected sum of the discounted rewards is dictated by the state and policy:

V^\pi (s) = \mathbb{E} [R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2) + \cdots \mid s_0 = s, \pi] \cdots \text{(1)}

where Vπ(s) is a value function that represents expected total of the discount rewards.

Using the value function, the optimization problem is redefined:

\pi^* = \arg\max_\pi V^{\pi}(s)

Mathematical process of the continuous iteration from the time step zero to t looks like:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure B. Mathematical application of DRL (Created by Kuriko IWAI)

▫ Transition Probability and Bellman Optimality Equation

The Bellman Optimality Equation is a recursive relationship in RL that characterizes the optimal value function for an MDP.

It states that the optimal value of a state (or state-action pair) under an optimal policy is equal to the immediate reward plus the discounted optimal value of the next state:

V^{\pi}(s) = \underbrace{R(s) }{\text{immediate reward}} + \gamma \sum{s' \in S} P_{s\pi(s)}(s')V^{\pi}(s') \text{ } \cdots \text{(2)}

where:

P_sπ(s'): The transition probability that represents the probability of transitioning from state s to state s’ when taking action a.
s': The next state following to the state s.

The transition probability represents the probability of transitioning from state s to state s’ when taking action a. In other words, P_sπ(s’) answers the question:
“If I'm in state s and I choose to take action a, what is the probability that I will end up in state s’?”

Leveraging the concept, the value function also represents the expected long-term return (e.g., total accumulated reward) that can be achieved starting from state s′ and then following an optimal policy from that point onwards.

◼ Fundamental Iteration Approaches for Solving MDPs

There are two efficient algorithms for solving the optimization problem of DRL: 1) Policy Iteration and 2) Value Iteration.

▫ 1) Policy Iteration

Policy iteration is a method that directly optimizes a parameterized policy to maximize expected discount rewards. The algorithm first:

Initialize a policy π randomly, and then
Repeat until convergence:

V \colon = V^π

\pi(s) := \arg \max_{a \in A} \sum_{s'} P_{sa}(s')V(s') \text{ } \cdots \text{(3)}

where:

π(s): The policy at state s (or an action at time step s).
P_sa(s'): The transition probability.
V(s'): The value function (or state-value function) of state s′.

The initial reward R(s) and discount factor γ in the formula (2) are omitted from the objective function formula (3) because both are constant with respect to an action a.

Key characteristics of Policy Iteration include:

Explicitly maintains a policy: It directly works with and updates the policy.
Guaranteed to converge: Each policy improvement step guarantees a strictly better or equal policy, and since there's a finite number of policies, it will converge to the optimal one.
Policy evaluation can be computationally expensive: Solving the system of linear equations for value evaluation can be slow, especially for large state spaces. However, it often converges in fewer iterations of the overall policy iteration loop than the value iteration.

▫ 2) Value Iteration

Value iteration on the other hand, optimizes the value function first and then compute the optimal policy. Until converge, the algorithm leverages the formula (2) to find the optimal value function:

V(s) := R(s) + \max_{a \in A} \gamma \sum_{s'} P_{sa}(s')V(s')

This update is applied repeatedly for all states until the value function converges.

Once V∗(s) is obtained, the optimal policy π∗(s) is derived by choosing the action a that maximizes the expected return for each state s:

\pi^* = \arg\max_\pi V^{\pi}(s)

Key characteristics of Value Iteration include:

Directly computes optimal value function: It focuses on finding V∗(s) first, and then the policy is a byproduct.
Combines evaluation and improvement: Each iteration of Value Iteration implicitly performs a one-step policy evaluation and improvement simultaneously by taking the max over actions a.
Simpler to implement: The single update rule is generally easier to code.
May require more iterations: It often takes more iterations to converge to the optimal value function compared to Policy Iteration, but each iteration is computationally cheaper since it doesn't involve solving a system of linear equations.

▫ Comparison of Policy Iteration and Value Iteration

The below table shows basic comparison of policy iteration and value iteration:

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure C. Comparison of Policy Iteration and Value Iteration (Created by Kuriko IWAI)

The choice between Policy Iteration and Value Iteration often depends on the complexity of the MDP.

For small, tabular MDPs, Policy Iteration can feel more intuitive because it directly manages and improves the policy.

However, when dealing with larger tabular MDPs, Value Iteration often becomes the preferred method. Its simpler per-iteration cost makes it more efficient in these expanded scenarios.

In DRL, where state and action spaces are typically continuous or extremely high-dimensional, classical Policy Iteration or Value Iteration are not applied in their pure, tabular forms.

Instead, DRL adapts their fundamental principles of iterative improvement for values or policies by integrating deep neural networks into the system. As we discussed in the previous section, these networks enable the algorithms to handle the immense complexity of real-world problems.

The Learning Process: How DRL Agents Interact and Learn

With the optimal problem and core mathematical approaches being defined, the DRL agents (deep neural networks) implement the algorithm to solve the optimal problem, interacting with an environment in a DRL system.

As we saw in the previous section, this interaction forms a continuous learning loop until the convergence.

The DRL agent's primary function is to learn an optimal policy—a strategy that dictates the best action to take in any given state to maximize cumulative reward.

This learning process is iterative and experience-driven. Specifically, at each time step, the agent:

◼ Step 1. Observe the state of environment

The agent receives a representation of the current situation from the environment (state) as a numerical vector or a high-dimensional input like an image, and feed it to the deep neural network.

◼ Step 2. Select an action

Utilizing its deep neural network, the agent processes the observed state and, based on its current policy (which is encoded within the network's weights), chooses an action.

Exploration-exploitation strategies are employed to balance trying new actions (exploration) with leveraging known good actions (exploitation).

◼ Step 3. Receive a reward or penalty

After executing an action, the environment provides feedback in the form of a scalar reward signal.

A positive reward indicates a desirable outcome, while a negative reward (penalty) indicates an undesirable one.

This reward signal is the primary driver of the learning process, informing the agent whether its chosen action moved it closer to or further from the desired objective.

◼ Step 4. Update the policy

The received reward and the transition from the previous state to the new state along with the action taken make an experience tuple.

This experience is used to update the parameters (weights and biases) of the deep neural network.

◼ Tackling Exploration-Exploitation Dilemma

Exploration-Exploitation dilemma refers to the fundamental challenge an autonomous agent faces in deciding whether to:

Exploit its current knowledge to choose actions that are believed to yield the highest immediate reward.
Explore the environment by trying new, potentially suboptimal actions to gather more information, which could lead to discovering better long-term strategies and higher cumulative rewards.

This dilemma is crucial because an agent needs to learn about its environment (exploration) to find optimal policies, but also needs to leverage what it has learned (exploitation) to maximize its performance.

An overemphasis on exploitation might lead the agent to get stuck in a sub-optimal local optimum, while excessive exploration can lead to inefficient learning and poor performance by trying too many ineffective actions.

To tackle this challenge, here are three major exploration-exploitation strategies used in DRL, along with their mathematical formulations:

◼ Epsilon-Greedy Strategy

The epsilon-greedy (ϵ-greedy) strategy is one of the simplest and most widely used methods.

It balances exploration and exploitation by selecting a random action with a small probability ϵ (exploration) and choosing the action with the highest estimated Q-value (greedy action) with probability 1−ϵ (exploitation).

Mathematical Formulation

Let Q(s, a) be the estimated value (e.g., Q-value) of taking action a in state s.

This value represents the expected long-term reward if the agent takes action a in state s and then follows an optimal policy afterward.

The action selection rule for ϵ-greedy is to choose the action such that:

P(a \mid s)=\begin{cases} 1-\epsilon+\frac{\epsilon}{|A(s)|} & \text{if } a=\operatorname{argmax}{a^{\prime} \in A(s)} Q(s, a^{\prime}) \cdots \text{ when action a' has max. Q value} \\ \frac{\epsilon}{|A(s)|} & \text{if } a \neq \operatorname{argmax}{a^{\prime} \in A(s)} Q(s, a^{\prime}) \end{cases}

Where:

P(a|s) : The probability of choosing action a given state s.
ϵ: The exploration probability (a value between 0 and 1).
A(s): The set of all possible actions in state s.
|A(s)|: The number of all possible actions in state s.

Exploitation Part (1 - ϵ)

a = argmax Q(s, a') denotes the action a′ that has the maximum estimated Q-value in state s.

In such case, with the probability 1−ϵ, the agent chooses the "greedy" action.

The greedy action is the one that currently has the highest estimated Q-value for the current state s.

This is the exploitation part, where the agent leverages its current knowledge to maximize immediate reward.

Exploration Part (ϵ)

With probability ϵ, the agent chooses an action randomly from all available actions in state s. This includes the greedy action itself.

Since there are |A(s)| possible actions in the state s, each action (including the greedy one) has a probability of 1 / |A(s)| of being chosen if a random action is selected.

So, the probability of any non-greedy action being chosen is simply ϵ / |A(s)|, which is defined in the second line of the formula.

Decaying ϵ

A common practice in the ϵ-greedy strategy is to use a decaying ϵ, where ϵ starts high to encourage exploration early in training and gradually decreases over time to favor exploitation as the agent learns more about the optimal policy. For example,

\epsilon_t = \epsilon_{min} + (\epsilon_{max} -\epsilon_{min}) e^{-\lambda t}

where λ is a decay rate.

Applicable DRL Algorithms:

DQN (Deep Q-Network)
Double DQN (DDQN)
C51
QR-DQN
HER (Hindsight Experience Replay) (often combined with algorithms like DQN)

◼ 2. Upper Confidence Bound (UCB)

UCB is an optimistic exploration strategy that prioritizes actions based on both their estimated value and the uncertainty around that estimate.

It operates on the principle of optimism in the face of uncertainty, meaning it prefers actions that have been less explored but have the potential for high rewards.

Mathematical Formulation

For each action a in a given state s, the UCB value is calculated as:

USB(s, a) = Q(s, a) + \underbrace{c \sqrt{\frac{In N_t} {N(s, a)}} }_{\text{exploration bonus}}

Where:

Q(s, a): The current estimated value (e.g., Q-value) of taking action a in state s.
N_t: The total number of times any action has been taken up to time t (or the number of times state s has been visited, or a similar global count).
N(s, a): The number of times action a has been taken in state s.
c>0: An exploration parameter that controls the degree of exploration. A higher c encourages more exploration.

The second term acts as an exploration bonus.

Actions that have been taken fewer times (small N(s, a)) will have a larger bonus, encouraging exploration of less-visited state-action pairs.

And as an action is explored more, N(s, a) increases, and its exploration bonus decreases.

Applicable DRL Algorithms:

AlphaZero: Heavily relies on Monte Carlo Tree Search (MCTS), which uses a variant of UCB to guide its node selection, balancing exploration of less-visited nodes with exploitation of promising ones.

◼ 3. Thompson Sampling

Thompson Sampling is a Bayesian approach to the exploration-exploitation dilemma.

Instead of directly calculating an exploration bonus, it maintains a probability distribution over the value of each action.

At each step, it samples a value for each action from its respective posterior distribution and then chooses the action with the highest sampled value.

Mathematical Formulation:

Let us consider a case where a prior distribution over the true mean reward of each action is Bernoulli.

For each action a, define parameters of a uniform prior:

\alpha_a = 1 \text{ , }\beta_a = 1

where α_a / β_a represents the number of success / failure actions respectively.

At each time step t:

1. Sample from Posterior: For each action a, draw a sample θ_a from its posterior Beta distribution:

\theta_a \sim Beta(\alpha_a, \beta_a)

The sampled θ_a represents a plausible true reward probability for action a.

2. Select Action: Choose the optimal action a^∗ that maximizes the sampled value:

a^* = \arg \max \theta_a

3. Execute Action and Observe Reward: Execute the optimal action in the environment and observe the reward R ∈ {0, 1}.

4. Update Parameters: Update the parameters for the optimal action based on the observed reward:

\begin{align} \alpha_a ^* \gets \alpha_a ^* + 1 \text { if } R=1 \\ \\ \beta_a ^* \gets \beta_a ^* + 1 \text { if } R=0 \end{align}

Applicable DRL Algorithms:

SAC (Soft Actor-Critic): Conceptually aligning with the probabilistic sampling nature of Thompson Sampling.
World Models: Learning an accurate model of the environment often involves quantifying the uncertainty in the model's predictions.

This continuous cycle of observation, action, reward, and policy update allows the DRL agent to progressively learn complex behaviors and solve the optimal problem in dynamic environments without explicit programming for every possible scenario.

In the next section, I’ll detail major DRL algorithms to define the rules.

Major Categories of Deep Reinforcement Learning Algorithms

D/RL algorithms are the specific sets of rules with computational methods that agents use to learn how to make optimal decisions in an environment.

These algorithms define:

How the agent perceives its environment: How it takes in observations or states.
How it selects actions: Based on its current understanding and goals.
How it updates its knowledge: Using the rewards (or penalties) it receives from the environment to improve its future decision-making.

Different algorithms have different strengths and weaknesses, making them suitable for various types of problems and environments.

The below diagram shows categories of the algorithms. At the highest level, RL algorithms fall into two main paradigms: Model-Free RL and Model-Based RL. Then, these are further categorized by data acquisition type: Online RL (policy-based), Off-policy RL (value-based), and Offline RL.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure D. Categorizing D/RL algorithms (Created by Kuriko IWAI)

◼ Model-Free RL

Model-Free RL algorithms learn optimal policies or value functions directly from interactions with the environment, without explicitly building a model of the environment's dynamics.

The category is further split into:

▫ Online RL (Policy Optimization)

These methods directly optimize a parameterized policy to maximize expected returns.

Examples include Policy Gradient, A2C/A3C, PPO (Proximal Policy Optimization), and TRPO (Trust Region Policy Optimization).

Best when:

Continuous or High-Dimensional Action Spaces: Essential when the agent’s actions are not discrete choices (like pressing a button) but rather continuous values (like steering angles).
Learning Stochastic (Random) Policies: When the agent needs to take probabilistic behavior (e.g., sometimes turn left, sometimes turn right, with certain probabilities), rather than always performing the same action deterministically, this method’s inherent randomness is beneficial.
Stable Training: Compared to some pure value-based methods, this method tends to be more stable during training because they directly optimize the policy, incorporating mechanisms to constrain policy updates.

▫ Off-Policy RL (Q-Learning)

These algorithms learn an action-value function (Q-function) that estimates the expected return of taking a certain action in a given state.

Prominent examples include DQN (Deep Q-Network), C51, QR-DQN (Quantile Regression DQN), and HER (Hindsight Experience Replay).

Best when:

Discrete State and Action Spaces: Works well when the number of states and actions is relatively small and can be easily enumerated.
Model-Free Learning: No mathematical model of the environment's dynamics and its reward structure available at hand.
Off-Policy Learning: Learn the optimal policy efficiently by leveraging all experiences, even those from exploratory actions.

▫ Offline RL

This approach utilizes a fixed, pre-collected dataset of transitions (states, actions, rewards, next states) to learn a policy.

Crucially, the agent does not interact with the environment during the learning process. It learns entirely from data that was generated by a different (often unknown) policy.

Best when:

Interacting with the real environment is expensive, dangerous, or impractical. This is often the case in robotics, healthcare, or autonomous driving, where trial-and-error in the real world could lead to damage or harm.
Large datasets of past interactions are readily available. For instance, logs from previous system operations, user interactions, or simulations.
Safety and stability during deployment are paramount. Since the policy is learned from static data, there's less risk of exploring dangerous or suboptimal actions during training in a live environment.
Reproducibility is important. The fixed dataset ensures that experiments can be replicated precisely.

◼ Model-Based RL

Model-Based RL algorithms, on the other hand, attempt to learn or are given a model of the environment's dynamics (a transition function P(s′ | s, a) and a reward function R(s, a, s′)). This model is then used for planning or to improve policy learning.

This category is subdivided into:

▫ Given Environment Dynamics

In this scenario, the environment is already known or provided, allowing the agent to use it for planning and decision-making.

Best when: A highly accurate model of the environment is available or can be precisely defined like games with fixed rules like chess or Go.

This enables robust planning and can lead to very efficient learning and strong performance.

AlphaZero is a notable example, which leverages a planning algorithm (Monte Carlo Tree Search) with a learned value and policy network.

▫ Unknown Environment Dynamics

This approach focuses on trying to learn an unknown model of the environment's dynamics.

Best when: The environment dynamics are unknown but can be learned from data.

The approach is highly sample-efficient as it can generate synthetic experiences for training, reducing the need for costly real-world interactions. So, it is useful for safety-critical applications where simulating risks is crucial like designing and testing the aircraft engine.

Algorithm examples include World Models, I2A (Imagination-Augmented Agents), MBMF (Model-Based Model-Free), and MBVE (Model-Based Value Expansion).

Simulation

I’ll demonstrate the performance of a foundational DRL algorithm, DQN, and a traditional rule-based controller for comparison, using the CartPole-v1 environment from Gymnasium due to its simplicity.

◼ Setting Up the Environment

First, set up the environment.

1import gymnasium as gym
2
3env_name = "CartPole-v1"
4env = gym.make(env_name)
5
6state_size = env.observation_space.shape[0]
7action_size = env.action_space.n
8

◼ Initializing the DQN Agent

Initialize the DQNAgent class with state, action, discount factor for rewards, batch size, and loss function (MSE):

1import torch.nn as nn
2
3class DQNAgent:
4    def __init__(
5            self,
6            state_size,
7            action_size,
8            gamma=0.99,
9            batch_size=64
10        ):
11        self.state_size = state_size
12        self.action_size = action_size
13        self.gamma = gamma                  # discount factor for future rewards
14        self.batch_size = batch_size        # training mini-batch size
15        self.criterion = nn.MSELoss()       # loss function (MSE)
16

◼ Selecting an Action

This step involves implementing the Q-network's forward pass and the epsilon-greedy strategy.

First, define the Policy class as a neural network. Then, within the DQNAgent class, add the select_action function to choose an action with the maximum Q-value for exploitation, or a random action for exploration.

1import torch
2import torch.nn as nn
3import random
4
5# define policy network
6class Policy(nn.Module):
7    def __init__(self, state_size, action_size):
8        super(Policy, self).__init__()
9        self.fc1 = nn.Linear(state_size, 128)   # first fully connected layer
10        self.relu = nn.ReLU()                   # ReLU activation function
11        self.fc2 = nn.Linear(128, 128)          # second fully connected layer
12        self.fc3 = nn.Linear(128, action_size)  # output layer: q-values for each action
13
14    def forward_pass(self, x):                  # defines forward pass of the policy
15        x = self.relu(self.fc1(x))
16        x = self.relu(self.fc2(x))
17        return self.fc3(x)
18
19
20# update dqn agent
21class DQNAgent:
22    def __init__(
23            self,
24            state_size,
25            action_size,
26            gamma=0.99,
27            batch_size=64,
28            epsilon_start=1.0,
29        ):
30        self.state_size = state_size
31        self.action_size = action_size
32        self.gamma = gamma                  # discount factor for future rewards
33        self.batch_size = batch_size        # training mini-batch size
34        self.criterion = nn.MSELoss()     
35
36        # adding
37        self.epsilon = epsilon_start        # initial exploration rate
38        self.policy_net = Policy(state_size, action_size) # adding policy net
39
40
41    def select_action(self, state):         # epsilon-greedy to select an action
42        if random.random() < self.epsilon:                  # choose a random action (exploration) with prob epsilon
43            return random.randrange(self.action_size)
44        else:                                               # choose an action with max. q-value (exploitation)
45            state = torch.FloatTensor(state).unsqueeze(0)   # convert state to pytorch tensor
46            with torch.no_grad():
47                q_values = self.policy_net(state)
48            return torch.argmax(q_values).item()
49

◼ Taking an action based on the state

1agent = DQNAgent(state_size, action_size)
2
3# state from the environment
4state, _ = env.reset()
5
6# select an action based on the state
7action = agent.select_action(state)
8
9# receive next state, reward
10next_state, reward, done, truncated, _ = env.step(action)
11

◼ Adding memory

Although this is optional, I added memory to the DQN agent to store experiences (state, action, reward, next_state, done) and allow sampling mini-batches randomly.

This randomness can break correlations in the training data and stabilizes the learning process.

1import torch
2import torch.nn as nn
3import random
4from collections import deque
5
6# defines memory
7class ReplayBuffer:
8    def __init__(self, capacity):
9        self.buffer = deque(maxlen=capacity)
10
11    def push(self, state, action, reward, next_state, done):
12        experience = (state, action, reward, next_state, done)
13        self.buffer.append(experience)                # add an experience tuple to the buffer
14
15    def sample(self, batch_size):
16        return random.sample(self.buffer, batch_size) # randomly sample a batch of experiences from the buffer
17
18    def __len__(self):
19        return len(self.buffer)
20
21
22class DQNAgent:
23    def __init__(
24            self,
25            state_size,
26            action_size,
27            gamma=0.99,
28            batch_size=64,
29            epsilon_start=1.0,
30            replay_buffer_capacity=10000,
31        ):
32        self.state_size = state_size
33        self.action_size = action_size
34        self.gamma = gamma
35        self.batch_size = batch_size
36        self.criterion = nn.MSELoss()     
37        self.epsilon = epsilon_start
38        self.policy_net = Policy(state_size, action_size)
39
40        # added
41        self.memory = ReplayBuffer(replay_buffer_capacity)
42
43
44    def select_action(self, state):
45        if random.random() < self.epsilon:
46            return random.randrange(self.action_size)
47        else:
48            state = torch.FloatTensor(state).unsqueeze(0)
49            with torch.no_grad():
50                q_values = self.policy_net(state)
51            return torch.argmax(q_values).item()
52

◼ Learning from the environment

Lastly, define the learn function and relevant variables in the DQNAgent class.

1import torch
2import torch.nn as nn
3import torch.optim as optim
4import random
5
6
7class DQNAgent:
8    def __init__(
9            self,
10            state_size,
11            action_size,
12            gamma=0.99,
13            batch_size=64,
14            epsilon_start=1.0,
15            replay_buffer_capacity=10000,
16            learning_rate=0.001,
17            epsilon_end=0.01,
18            epsilon_decay=0.995,
19        ):
20        self.state_size = state_size
21        self.action_size = action_size
22        self.gamma = gamma
23        self.batch_size = batch_size
24        self.criterion = nn.MSELoss()     
25        self.epsilon = epsilon_start
26        self.policy_net = Policy(state_size, action_size)
27        self.memory = ReplayBuffer(replay_buffer_capacity)
28
29        # adding target net for learning (updating q vals)
30        self.target_net = Policy(state_size, action_size)
31
32        # adding optimizer to update the policy
33        self.optimizer = optim.Adam(self.policy_net.parameters(), lr=learning_rate)
34
35        # adding eplison decay
36        self.epsilon_end = epsilon_end      # minimum exploration rate
37        self.epsilon_decay = epsilon_decay  # epsilon decay rate
38
39
40    def select_action(self, state):
41        if random.random() < self.epsilon:
42            return random.randrange(self.action_size)
43        else:
44            state = torch.FloatTensor(state).unsqueeze(0)
45            with torch.no_grad():
46                q_values = self.policy_net(state)
47            return torch.argmax(q_values).item()
48
49
50    def learn(self):
51        if len(self.memory) < self.batch_size:
52            return
53
54        # sample a batch of experiences
55        experiences = self.memory.sample(self.batch_size)
56        states, actions, rewards, next_states, dones = zip(*experiences)
57
58        # convert to PyTorch tensors
59        states = torch.FloatTensor(states)
60        actions = torch.LongTensor(actions).unsqueeze(1)
61        rewards = torch.FloatTensor(rewards).unsqueeze(1)
62        next_states = torch.FloatTensor(next_states)
63        dones = torch.FloatTensor(dones).unsqueeze(1)
64
65        # compute q-values for the current state
66        current_q_values = self.policy_net(states).gather(1, actions)
67
68        # compute max q-value for the next states using the target network.
69        next_q_values = self.target_net(next_states).detach().max(1)[0].unsqueeze(1)
70
71        # computes the target q-values: R(s) + gamma * max(Q(s', a')) (if donse, stop updating)
72        target_q_values = rewards +  self.gamma * next_q_values if not dones.values else rewards
73
74        # compute loss between current q-values and target q-values
75        loss = self.criterion(current_q_values, target_q_values)
76
77        # optimize the policy network, updating weights
78        self.optimizer.zero_grad()
79        loss.backward()
80        self.optimizer.step()
81
82        # decay epsilon (exploration rate)
83        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
84

◼ Training DRL

To combine all components, I first initialize the environment, then the DQN agent, begin the training epochs, and finally close the environment.

1# train the drl
2import gymnasium as gym
3
4# start an env
5env_name = "CartPole-v1"
6env = gym.make(env_name)
7state_size = env.observation_space.shape[0]
8action_size = env.action_space.n
9
10# initiate dqn agent
11agent = DQNAgent(state_size, action_size)
12
13# training epochs
14scores = []
15num_episodes = 500
16update_target_every = 10
17for episode in range(1, num_episodes + 1):
18    state, info = env.reset()
19    score = 0
20    done = False
21    truncated = False
22
23    while not done and not truncated:
24        # select an action
25        action = agent.select_action(state)
26
27        # execute an action
28        next_state, reward, done, truncated, info = env.step(action)
29
30        # store the experience in the memory
31        agent.memory.push(state, action, reward, next_state, done)
32
33        # update state, rewards
34        state = next_state
35        score += reward
36
37        # learning from a batch experience
38        agent.learn()
39
40    scores.append(score)
41
42# close the env
43env.close()
44

◼ Rule-Based Controller

For the rule-based controller, set up a simulation where state[2] and state[3] indicate the pole angle and angular velocity respectively, and if the pole is leaning right, push right. If leaning left, push left, while considering angular velocity for better performance.

1
2 def rule_based_action(state):
3    if state[2] > 0.01:
4        return 1                 # push right
5    elif state[2] < -0.01:
6        return 0                 # push left
7    else:
8        if state[3] > 0:
9            return 1
10        else:
11            return 0
12

Then, Similar to the DQN agent, set up the same environment and train the rule-based method.

1import gymnasium as gym
2
3env = gym.make(env_name)
4scores = []
5num_episodes = 500
6
7for episode in range(1, num_episodes + 1):
8    state, info = env.reset()
9    score = 0
10    done = False
11    truncated = False
12
13    while not done and not truncated:
14        action = rule_based_action(state)   # agent chooses action based on simple rules
15        state, reward, done, truncated, info = env.step(action)
16        score += reward
17
18    scores.append(score)
19
20env.close()
21

◼ Results

The DQN agent achieves higher and more consistent performance, successfully solving the CartPole environment. This indicates that it has learned a more robust and generalized strategy for balancing the pole than the fixed rules.

Kernel Labs | Kuriko IWAI | kuriko-iwai.com

Figure E. Comparison of the cumulative rewards of the DQN agent and rule-based controller (Created by Kuriko IWAI)

Rule-Based Controller (Red line):

Initial Performance: The rule-based controller starts with a relatively high score (around 100), indicating it has some inherent understanding of how to balance the pole.
Lack of Learning: Throughout the 500 episodes, its performance fluctuates but generally stays within a certain range, mostly between 100 and 200, hence it does not show an upward trend over time. This is characteristic of a rule-based system; its logic is fixed and does not adapt or learn from experience. It performs as well as its hand-coded rules allow, but no better.
Below the Threshold: Critically, the rule-based controller fails to consistently cross and maintain the "CartPole Solved Threshold" of 195. While it touches or slightly exceeds it at times, it doesn't demonstrate a sustained ability to keep the pole balanced for the required duration.

DQN Agent (Blue Line):

Initial Performance (Exploration Phase): The DQN agent starts with very low scores (near 0) in the initial episodes without any prior knowledge. It must explore the environment to understand its dynamics and reward structure. This initial phase is where exploration (e.g., via ϵ-greedy) is most active.
Learning and Improvement: Around episode 50, the DQN agent's performance rapidly increases, demonstrating the learning capabilities of the deep reinforcement learning algorithm. It quickly surpasses the rule-based controller and begins to regularly cross the solved threshold.
Solving the Environment: The most significant observation is that the DQN agent consistently and significantly surpasses the "CartPole Solved Threshold" (195) for extended periods. For example, it reaches scores well over 300 multiple times, indicating it has learned a highly effective policy for balancing the pole.
Fluctuations and Generalization: While there are still fluctuations (e.g., dips around episodes 200-250), the overall trend is a much higher average performance than the rule-based method, and it consistently recovers to high scores. These fluctuations can be due to various factors like exploration, changes in experience replay buffer, or network updates.

Overall Summary:

Although the DQN agent has an initial learning curve, its ability to learn and adapt from experience ultimately allows it to outperform and consistently solve the CartPole problem.

Wrapping Up

In our experiment We will run a simulation over the DQN agent and the rule-based controller in the CartPole-v1 environment.

By comparing metrics like the average number of steps the pole remains balanced (i.e., the score) over many episodes, we saw the DRL agents' superior performance due to their ability to learn more nuanced and adaptive control policies compared to the fixed logic of the traditional method.

Overall, DRL represents a powerful paradigm for building intelligent systems that can learn complex behaviors from interaction.

By combining the goal-driven learning of reinforcement learning with the function approximation capabilities of deep learning, DRL has provided solutions across many domains.

Related Books for Further Understanding

These books cover the wide range of theories and practices; from fundamentals to PhD level.

Linear Algebra Done Right

Foundations of Machine Learning, second edition (Adaptive Computation and Machine Learning series)

Designing Machine Learning Systems: An Iterative Process for Production-Ready Applications

Machine Learning Design Patterns: Solutions to Common Challenges in Data Preparation, Model Building, and MLOps

Share What You Learned

Kuriko IWAI, "Deep Reinforcement Learning for Self-Evolving AI" in Kernel Labs

https://kuriko-iwai.com/deep-reinforcement-learning

Looking for Solutions?

Deploying ML Systems 👉 Book a briefing session
Hiring an ML Engineer 👉 Drop an email
Learn by Doing 👉 Enroll AI Engineering Masterclass

Written by Kuriko IWAI. All images, unless otherwise noted, are by the author. All experimentations on this blog utilize synthetic or licensed data.

Deep Reinforcement Learning for Self-Evolving AI

Building self-learning systems

Table of Contents

Introduction