Reinforcement Learning

Reinforcement learning is a type of machine learning that allows an agent to learn how to behave in an environment by performing actions and observing the rewards it receives. In reinforcement learning, an agent interacts with an environment by taking actions and receiving rewards. The agent’s goal is to learn a policy that maximizes the expected cumulative reward.

State

The state $s$ is a representation of the environment at a given time.

Action

The action $a$ is a decision made by the agent to transition from one state to another.

Reward

The reward $R$ is a scalar value that the agent receives after taking an action in a state.

Return

The return $G$ is the sum of rewards $R_n$ an agent receives after taking a sequence of actions. The return is often discounted to give more importance to immediate rewards. The discount is comparable to the time value of money in finance (interest rate). The discount factor $\gamma$ is a value between 0 and 1.

$$ G = R_{1} + \gamma R_{2} + \gamma^2 R_{3} + \ldots = \sum_{k=0}^{\infty} \gamma^k R_{1+k} $$

Policy

The policy $\pi$ is a function that maps states to actions.

State Action Value Function

The state-action value function $Q(s, a)$ is the expected return an agent receives after taking an action $a$ in a state $s$ and following the optimal policy afterwards.

The best possible return from state $s$ is therefore $\max_{a} Q(s, a)$. The best possible action to take in state $s$ is the action $a$ that maximizes $Q(s, a)$.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker.

Bellman Equation

Notation:

  • $s$: Current state
  • $R(s)$: Reward of current state
  • $a$: Current action
  • $s’$: Next state after taking action $a$
  • $a’$: Action to take in state $s'$

The Bellman equation is a recursive equation that decomposes the value of a state into the immediate reward $R(s)$ and the discounted value of the next state $Q(s’, a’)$.

$$ Q(s, a) = R(s) + \gamma \max_{a’} Q(s’, a’) $$

Deep Q-Network

Deep Q-Network (DQN) is a reinforcement learning algorithm that combines Q-learning with deep neural networks to approximate the state-action value function $Q(s, a)$. It uses experience replay and target networks to stabilize the training process.

Deep Reinforcment Learning uses neural networks to approximate the state-action value function $Q(s, a)$. Given a state $s$ action $a$ pair, the neural network outputs the expected return $Q(s, a)$. In this approach, the neural network would need to run for each possible action to determine the best action to take in a given state. This can be improved by computing all possible actions at the same time and picking the action with the highest value.

Training Set

The training set can be generated using the Bellman equation. The neural networks input values $x$ is $(s,a)$ (tuple of state and action) and the output $y$ is the expected return. The return can be calulated using the Bellman equation, because all required parameters are known.

$$ Q(s, a) = R(s) + \gamma E[\max_{a’} Q(s’, a’)] $$

The algorithm will get better at choosing the best action over time, because it will learn the expected return for each state-action pair.

Experience Replay

Experience replay is a technique that stores agent’s experiences (state, action, reward, next state) in a replay buffer. During training, the agent samples a batch of experiences from the replay buffer to update the Q-network.

Learning Algorithm

The learning algorithm works as follows:

  1. Initialize the Q-network randomly as guess of $Q(s, a)$.
  2. Repeat:
    1. Take actions to get experiences $(s, a, R(s), s’)$.
    2. Store experiences in replay buffer.
  3. Train:
    1. Create a training set from experiences using
      $x = (s, a)$ and $y = R(s) + \gamma \max_{a’} Q(s’, a’)$
      $Q(s’, a’)$ will be calculated by the neural network (as a random guess initially).
    2. Train the new Q-network ($Q_\text{new}$) using the training set.
    3. Set $Q = Q_\text{new}$ to update the Q-network.

Epsilon-Greedy Policy

How to choose actions (required in step 2.1.) while still learning?

The following options exist:

  • Choose $a$ that maximizes $Q(s, a)$
  • Choose $a$ that maximizes $Q(s, a)$ with probability $1 - \epsilon$ and a random action with probability $\epsilon$

The second option is called $\epsilon$-greedy policy. The point of choosing a random action (exploration step) is to explore the environment and learn more about it.

For example: It could be that the agent was initialized in a state where the expected return of an action is always very low. Which could lead to the agent never exploring this action.

This is also called the exploration-exploitation trade-off.

Mini-Batch and Soft Update

The training set is usually divided into mini-batches to train the neural network. The target network is updated using a soft update to stabilize the training process.