### Core Concepts in Reinforcement Learning By Example

Authors: Hannah Peterson and George Williams

You may recall from the previous post in this series on Reinforcement Learning (RL) that a Markov Decision Process (MDP) is a mathematical framework for modeling RL problems. To ground the mathematical theory of an MDP in reality, we identified each of its elements in the context of OpenAI Gym’s MountainCar game. Now that we understand them individually, let’s explore how they combine to form equations, continuing to refer to MountainCar when applicable. The field of RL consists of using different algorithms to approximate optimal solutions to the equations we derive here.

Disclaimer: All of the math involved in MDPs can easily get overwhelming, so I don’t go through step-by-step derivations for every equation. Additionally, there’s not one correct way to write these equations—they have multiple representations invoking a variety of notions, from which I picked those that I found to be the most intuitive. If you’re interested in lower-level explanation, I suggest taking a look at the resources I link to in the References section

### OpenAI Gym Basics

First, let’s explore some basic commands in Gym to help us see how an MDP is realized within the MountainCar game. These are the building blocks for the code we will be writing in the next post for getting the car agent to learn an intelligent decision-making process.

The command `env = gym.make("MountainCar-v0")` creates an instance of the MountainCar game environment. Recall that in this game, a state (i.e. a snapshot of the environment) is defined by the car’s horizontal position and velocity, of which there are theoretically infinite combinations. This is why when we access the environment’s `observation_space` (i.e. all its possible states) attribute, we find it is represented by a 2D box. In contrast, we see its `action_space` consists of 3 discrete values, corresponding to the 3 actions available to the car agent in a given state: accelerate to the left, don’t accelerate, or accelerate to the right, encoded as actions 0, 1 and 2, respectively. Actions are taken to transition between states, creating a series of states known in RL as an episode or trajectory.

To begin an episode, we invoke the game’s `reset()` function, which yields a vector with a randomly generated position and a velocity of 0, representing the episode’s starting state. Calling `env.action_space.sample()` generates a random action (0, 1, or 2) from the action space. We take an action by inputting it into the `step()` function, which returns a `[state, reward, done, info]` array. Here, `state` , `reward` , and `done` correspond to the new current state, the reward received, and a boolean of whether the agent has reached its goal as a result of taking the action (`info` is not relevant in this game—it’s just an empty dictionary). We can see that taking the random action leads to a new state with a slightly different position and non-zero velocity from the starting state.

Having familiarized ourselves with the basics of OpenAI Gym, we can now ask the all-important question of How does our car determine which actions to take? Above, we used the random `env.action_space.sample()` function, which samples from a uniform distribution across all actions—that is, each of our 3 actions has a 0.33 probability of being selected. With this naive decision-making approach, we can imagine the extremely high number of trials it would take for the car to reach the flag at the top of the hill. We’ll now go over the math involved in teaching our agent a more intelligent way of choosing between actions.

### Deciding Which Actions to Take

#### The Policy Function

A natural approach to this problem is to think, well, the flag is to the right of the car, so the car should just accelerate to the right to reach it; however, the game is structured such that the car is unable to overcome the force of gravity by only accelerating to the right—rather, it has to build up its momentum by moving left and right between the two slopes, alternating its acceleration. To determine when to switch between these actions, the cart must take into account its current position and velocity, i.e. its state. In RL, the function that decides which action should be taken in a particular state s is called a policy, π:

This function represents a distribution, taking a state as input and outputting a probability for each possible action, which is sampled from during gameplay to determine which action to take. As we discussed above, the policy for `env.action_space.sample()` is a uniform distribution across all 3 actions, one that does not take into account an input state s. Let’s take a look at what an optimal policy for MountainCar that does take state into account might look like, bearing in mind that this is a vastly simplified example: The agent navigates between states to reach the flag by taking actions outputted by its policy. The flag is at position x = 0.5, and states have a reward of -1 if the agent is at x < 0.5 and a reward of 0 otherwise. Note, this idealized example does not take into account stochasticity of the environment (which we’ll account for later).

Ideally, such a policy would know with 100% certainty the best action to take in any state, that is, it would output a probability of 1 for this action and 0 for the other two. Recall that the cart’s starting position is randomized for each episode, so there’s no one optimal route for the policy to learn. Additionally, we see that the agent refers to the policy at each state to determine which action to take; that is, it’s not as if the policy just takes in the episode’s starting state and outputs the optimal series of actions for the agent to follow through to the end—rather, it outputs one action per state. This is because, while a pre-planned approach could potentially work for a simple game like MountainCar, MDPs are also used to model dynamic environments, in which it’s imperative for the agent to make decisions based on real-time information about its state.

In reality, it’s not feasible to learn policies as decisive as the one above, rather, we use RL algorithms to arrive at decent approximations. In sum, the fundamental goal of RL is to find a policy that maximizes the reward the agent can expect to receive over the course of an episode by acting on it. There are two important things to note here:

1. That reward is the metric of interest
2. That we must rely on expectation

To address the first, you may notice that the policy equation we defined above doesn’t include a variable for reward. Indeed, the policy still needs some means of assigning rewards to actions so it can output high probabilities for “rewarding” actions and vice versa, which requires us to first find a way of mapping rewards to states, since actions are just transitions between states. Let’s discuss how to go about assigning rewards to states, for which we must take into account the second point—the notion of expectation.

#### Determining Expected Reward

The inclusion of the word expect in the fundamental goal of RL implies that:

1. The agent has some awareness of the future
2. The future is associated with uncertainty

Regarding the first point, our agent needs to map rewards to states in a way that isn’t based solely on how rewarding a given state is immediately upon visiting it but also on its potential for being a part of an overall rewarding trajectory of many states; that is, the agent’s calculation of reward needs to be structured such that it is still able to recognize states that may be penalizing in the short term but beneficial in the long term as rewarding. With this in mind, it makes sense to define the “rewardingness” of a particular state using a sum of rewards of future states.

But, as the second point addresses, the future is uncertain. Recall that MDPs are used to model environments that are stochastic in nature, meaning that instructing an agent to take a particular action doesn’t lead to it actually be carried out 100% of the time. Further, environments are often dynamic in ways that are extremely difficult to model directly, such as games played against a human opponent. Given these factors, it makes sense to weigh expected rewards less heavily the further in the future they occur because we are less certain we will actually receive them.

#### The State-Value Function

With these two points in mind, we can introduce the state-value function, which maps values to states in MDPs. The value of a state s is defined as the total reward an agent can expect to receive by starting in s and following a trajectory of successive states from s:

Here, the first term in the expectation operator is the immediate reward the agent expects to receive by being in s and the remaining terms are the rewards for successive states, which we see are weighted by γ. γ is a constant between 0 and 1 called the discount factor because it gets exponentiated to weigh future expected rewards less heavily in the sum of total expected rewards attributed to s, accounting for uncertainty of the future in this way. The closer γ is to 1, the more future reward is considered when attributing value to the present state and vice versa. We can group the discounted terms together, arriving at the following representation of the state-value function:

Additionally, it is the policy function π that determines the trajectory of states the agent takes, so π can be viewed as another parameter of the function:

This is what is known as the Bellman equation for the state-value function. A Bellman equation is a type of optimization equation that specifies that the “value” of a decision problem at a given point in time depends on the value generated from an initial choice and the value of resulting choices, and, hence, is recursive in nature.

#### The Action-Value Function

Given we now have a means of mapping a value to a given state, we can use the fact that an action is a transition between states to map values to state-action pairs as well. Using such a mapping, the agent in any state will have a way to estimate the value of taking each action available to it so it can choose the one with the highest chance of contributing to a maximum overall reward. This is achieved by the action-value function, which takes in a state s and action a and outputs the expected reward (a scalar value) for taking a from s and following policy π to take actions thereafter:

You may notice it is also a Bellman equation, and that it looks very similar to the state-value function. How are these two equations related? The value of a state s can be determined by taking the sum of the action-values for all actions, where each action-value term is weighted by the probability of taking the respective action from s, which, as we know, is determined by the policy. Thus, the state-value function can be re-written as:

In MountainCar, for example, say the car is following some policy π that outputs probabilities of 0.3, 0.1 and 0.6 and its action-value function outputs values of 1.2, 0.4 and 2.8 for the three actions given its current state s. Then the value of s is 0.3(1.2) + 0.1(0.4) + 0.6(2.8) = 2.08.

We can also define the action-value function in terms of the state-value function. Calculating the action-value of a state s and some action must take into account the stochastic nature of the environment, that is, that even if the agent chooses to take a, random interaction with the environment may cause it to take a different action. You may recall from the first blog that this randomness is encoded in an MDP as a matrix P where an entry P_ss’ is the probability of ending up in state s’ by choosing to take a from s. We can visualize this using an example in MountainCar. In this case, let’s assume a policy that takes in state s and outputs probabilities of 0.8, 0.06 and 0.14 for actions 0, 1 and 2, respectively. The agent draws from this distribution and chooses action 0 (accelerate to the left), as expected given it has the largest probability. Despite this choice, the probability P_ss’ that the agent actually takes this action is less than certain, equaling 0.9 for example, with the remaining probability being distributed among the other two actions:

This uncertainty in taking actions means that to calculate the expected reward (i.e. the action-value) for taking any action a in state s, the immediate expected reward for a and s must be added to the discounted sum of expected rewards (i.e. state-values) of all possible successor states, weighted by their probabilities of occurring:

We can substitute the new form of the state-value function we derived above into this formula to arrive at the following definition of the action-value function, which we see is the same as the initial equation with the expectation operator applied:

#### The Optimization Problem

The action-value function gives us a way of determining the value of any action a from any state s in a trajectory determined by policy π—the higher the output of the action-value function, the better that action is for accumulating a high reward over the course of the trajectory. Thus, the optimal policy will know which actions result in the highest action-values with 100% certainty and, as we discussed earlier with the first MountainCar example, will output a probability of 1 for those actions and 0 for the others:

With this, we arrive at the Bellman optimality equation for the action-value function:

This is what Reinforcement Learning aims to solve through a wide variety of approaches. In the next blog of this series, we’ll walk through the theory behind one popular approach called “Q-Learning” to arrive at an estimate of this optimal action-value function using MountainCar.