Core Concepts in Reinforcement Learning By Example

Authors: Hannah Peterson and George Williams

 

Source

You may recall from theĀ previous postĀ in this series on Reinforcement Learning (RL) that a Markov Decision Process (MDP) is a mathematical framework for modeling RL problems. To ground the mathematical theory of an MDP in reality, we identified each of its elements in the context of OpenAI Gymā€™sĀ MountainCarĀ game. Now that we understand them individually, letā€™s explore how they combine to form equations, continuing to refer to MountainCar when applicable. The field of RL consists of using different algorithms to approximate optimal solutions to the equations we derive here.

Disclaimer: All of the math involved in MDPs can easily get overwhelming, so I donā€™t go through step-by-step derivations for every equation. Additionally, thereā€™s not one correct way to write these equationsā€”they have multiple representations invoking a variety of notions, from which I picked those that I found to be the most intuitive. If youā€™re interested in lower-level explanation, I suggest taking a look at the resources I link to in the References section

OpenAI Gym Basics

First, letā€™s explore some basic commands in Gym to help us see how an MDP is realized within the MountainCar game. These are the building blocks for the code we will be writing in the next post for getting the car agent to learn an intelligent decision-making process.

The commandĀ env = gym.make("MountainCar-v0")Ā creates an instance of the MountainCar game environment. Recall that in this game, a state (i.e. a snapshot of the environment) is defined by the carā€™s horizontal position and velocity, of which there are theoretically infinite combinations. This is why when we access the environmentā€™sĀ observation_spaceĀ (i.e. all its possible states) attribute, we find it is represented by a 2D box. In contrast, we see itsĀ action_spaceĀ consists of 3 discrete values, corresponding to the 3 actions available to the car agent in a given state: accelerate to the left, donā€™t accelerate, or accelerate to the right, encoded as actions 0, 1 and 2, respectively. Actions are taken to transition between states, creating a series of states known in RL as anĀ episodeĀ orĀ trajectory.

A random starting state in the MountainCar game environment.

To begin an episode, we invoke the game’sĀ reset()Ā function, which yields a vector with a randomly generated position and a velocity of 0, representing the episodeā€™s starting state. CallingĀ env.action_space.sample()Ā generates a random action (0, 1, or 2) from the action space. We take an action by inputting it into theĀ step()Ā function, which returns aĀ [state, reward, done, info]Ā array. Here,Ā stateĀ ,Ā rewardĀ , andĀ doneĀ correspond to the new current state, the reward received, and a boolean of whether the agent has reached its goal as a result of taking the action (infoĀ is not relevant in this gameā€”itā€™s just an empty dictionary). We can see that taking the random action leads to a new state with a slightly different position and non-zero velocity from the starting state.

Having familiarized ourselves with the basics of OpenAI Gym, we can now ask the all-important question ofĀ How does our car determine which actions to take?Ā Above, we used the randomĀ env.action_space.sample()Ā function, which samples from a uniform distribution across all actionsā€”that is, each of our 3 actions has a 0.33 probability of being selected. With this naive decision-making approach, we can imagine the extremely high number of trials it would take for the car to reach the flag at the top of the hill. Weā€™ll now go over the math involved in teaching our agent a more intelligent way of choosing between actions.

Deciding Which Actions to Take

The Policy Function

A natural approach to this problem is to think, well, the flag is to the right of the car, so the car should just accelerate to the right to reach it; however, the game is structured such that the car is unable to overcome the force of gravity by only accelerating to the rightā€”rather, it has to build up its momentum by moving left and right between the two slopes, alternating its acceleration. To determine when to switch between these actions, the cart must take into account its current position and velocity, i.e. itsĀ state. In RL, the function that decides which actionĀ aĀ should be taken in a particular stateĀ sĀ is called aĀ policy, Ļ€:

This function represents a distribution, taking a state as input and outputting a probability for each possible action, which is sampled from during gameplay to determine which action to take. As we discussed above, the policy forĀ env.action_space.sample()Ā is a uniform distribution across all 3 actions, one that does not take into account an input stateĀ s. Letā€™s take a look at what an optimal policy for MountainCar that does take state into account might look like, bearing in mind that this is a vastly simplified example:

The agent navigates between states to reach the flag by taking actions outputted by its policy. The flag is at position x = 0.5, and states have a reward of -1 if the agent is at x < 0.5 and a reward of 0 otherwise. Note, this idealized example does not take into account stochasticity of the environment (which weā€™ll account for later).

Ideally, such a policy would know with 100% certainty the best action to take in any state, that is, it would output a probability of 1 for this action and 0 for the other two. Recall that the cartā€™s starting position is randomized for each episode, so thereā€™s no one optimal route for the policy to learn. Additionally, we see that the agent refers to the policyĀ at each stateĀ to determine which action to take; that is, itā€™s not as if the policy just takes in the episodeā€™s starting state and outputs the optimal series of actions for the agent to follow through to the endā€”rather, it outputs one action per state. This is because, while a pre-planned approach could potentially work for a simple game like MountainCar, MDPs are also used to modelĀ dynamicĀ environments, in which itā€™s imperative for the agent to make decisions based on real-time information about its state.

In reality, itā€™s not feasible to learn policies as decisive as the one above, rather, we use RL algorithms to arrive at decent approximations. In sum,Ā the fundamental goal of RL is to find a policy that maximizes the reward the agent can expect to receive over the course of an episode by acting on it.Ā There are two important things to note here:

  1. That reward is the metric of interest
  2. That we must rely on expectation

To address the first, you may notice that the policy equation we defined above doesnā€™t include a variable for reward. Indeed, the policy still needs some means of assigning rewards to actions so it can output high probabilities for ā€œrewardingā€ actions and vice versa, which requires us to first find a way of mapping rewards toĀ states, since actions are just transitions between states. Letā€™s discuss how to go about assigning rewards to states, for which we must take into account the second pointā€”the notion of expectation.

Determining Expected Reward

The inclusion of the wordĀ expectĀ in the fundamental goal of RL implies that:

  1. The agent has some awareness of the future
  2. The future is associated with uncertainty

Regarding the first point, our agent needs to map rewards to states in a way that isnā€™t based solely on how rewarding a given state is immediately upon visiting it but also on its potential for being a part of an overall rewarding trajectory of many states; that is, the agentā€™s calculation of reward needs to be structured such that it is still able to recognize states that may be penalizing in the short term but beneficial in the long term as rewarding. With this in mind, it makes sense to define the ā€œrewardingnessā€ of a particular state using a sum of rewards of future states.

But, as the second point addresses, the future is uncertain. Recall that MDPs are used to model environments that areĀ stochasticĀ in nature, meaning that instructing an agent to take a particular action doesnā€™t lead to it actually be carried out 100% of the time. Further, environments are oftenĀ dynamicĀ in ways that are extremely difficult to model directly, such as games played against a human opponent. Given these factors, it makes sense to weigh expected rewards less heavily the further in the future they occur because we are less certain we will actually receive them.

The State-Value Function

With these two points in mind, we can introduce theĀ state-value function, which maps values to states in MDPs. The value of a stateĀ sĀ is defined as theĀ total reward an agent can expect to receiveĀ by starting inĀ sĀ and following a trajectory of successive states fromĀ s:

Here, the first term in the expectation operator is theĀ immediate rewardĀ the agent expects to receive by being inĀ sĀ and the remaining terms are the rewards for successive states, which we see are weighted byĀ Ī³.Ā Ī³Ā is a constant between 0 and 1 called theĀ discount factorĀ because it gets exponentiated to weigh future expected rewards less heavily in the sum of total expected rewards attributed toĀ s, accounting for uncertainty of the future in this way. The closerĀ Ī³Ā is to 1, the more future reward is considered when attributing value to the present state and vice versa. We can group the discounted terms together, arriving at the following representation of the state-value function:

Additionally, it is the policy function Ļ€ that determines the trajectory of states the agent takes, so Ļ€ can be viewed as another parameter of the function:

This is what is known as theĀ Bellman equationĀ for the state-value function. A Bellman equation is a type of optimization equation that specifies that the ā€œvalueā€ of a decision problem at a given point in time depends on the value generated from an initial choice and the value of resulting choices, and, hence, is recursive in nature.

The Action-Value Function

Given we now have a means of mapping a value to a givenĀ state, we can use the fact that an action is a transition between states to map values toĀ state-action pairsĀ as well. Using such a mapping, the agent in any state will have a way to estimate the value of taking each action available to it so it can choose the one with the highest chance of contributing to a maximum overall reward. This is achieved by theĀ action-value function, which takes in a stateĀ sĀ and actionĀ aĀ and outputs the expected reward (a scalar value) for takingĀ aĀ fromĀ sĀ and following policy Ļ€ to take actions thereafter:

You may notice it is also a Bellman equation, and that it looks very similar to the state-value function. How are these two equations related? The value of a stateĀ sĀ can be determined by taking the sum of the action-values for all actions, where each action-value term is weighted by the probability of taking the respective actionĀ aĀ fromĀ s, which, as we know, is determined by the policy. Thus, the state-value function can be re-written as:

In MountainCar, for example, say the car is following some policy Ļ€ that outputs probabilities of 0.3, 0.1 and 0.6 and its action-value function outputs values of 1.2, 0.4 and 2.8 for the three actions given its current stateĀ s.Ā Then the value ofĀ sĀ is 0.3(1.2) + 0.1(0.4) + 0.6(2.8) = 2.08.

We can also define the action-value function in terms of the state-value function. Calculating the action-value of a stateĀ sĀ and some actionĀ aĀ must take into account the stochastic nature of the environment, that is, that even if the agent chooses to takeĀ a, random interaction with the environment may cause it to take a different action. You may recall from the first blog that this randomness is encoded in an MDP as a matrixĀ PĀ where an entryĀ P_ssā€™Ā is the probability of ending up in stateĀ sā€™Ā by choosing to takeĀ aĀ fromĀ s. We can visualize this using an example in MountainCar. In this case, letā€™s assume a policy that takes in stateĀ sĀ and outputs probabilities of 0.8, 0.06 and 0.14 for actions 0, 1 and 2, respectively. The agent draws from this distribution and chooses action 0 (accelerate to the left), as expected given it has the largest probability. Despite this choice, the probabilityĀ P_ssā€™Ā that the agent actually takes this action is less than certain, equaling 0.9 for example, with the remaining probability being distributed among the other two actions:

This uncertainty in taking actions means that to calculate the expected reward (i.e. the action-value) for taking any actionĀ aĀ in stateĀ s, the immediate expected reward forĀ aĀ andĀ sĀ must be added to the discounted sum of expected rewards (i.e. state-values) of all possible successor states, weighted by their probabilities of occurring:

We can substitute the new form of the state-value function we derived above into this formula to arrive at the following definition of the action-value function, which we see is the same as the initial equation with the expectation operator applied:

The Optimization Problem

The action-value function gives us a way of determining the value of any actionĀ aĀ from any stateĀ sĀ in a trajectory determined by policy Ļ€ā€”the higher the output of the action-value function, the better that action is for accumulating a high reward over the course of the trajectory. Thus, theĀ optimalĀ policy will know which actions result in the highest action-values with 100% certainty and, as we discussed earlier with the first MountainCar example, will output a probability of 1 for those actions and 0 for the others:

With this, we arrive at theĀ Bellman optimality equationĀ for the action-value function:

This is what Reinforcement Learning aims to solve through a wide variety of approaches. In the next blog of this series, weā€™ll walk through the theory behind one popular approach called ā€œQ-Learningā€ to arrive at an estimate of this optimal action-value function using MountainCar.

References

Ā©2024 GSI Technology, Inc. All Rights Reserved