Reinforcement Learning, Part 4: Policies and Learning Algorithms

This fourth post continues from the third post and will go over the remaining algorithms that reside within the agent. We’ll cover why we use neural networks to represent functions and why you may have to set up two neural networks in a powerful family of methods called actor-critic. In the previous post, we focused on policies and learning algorithms used in reinforcement learning and looked at the policy gradient method. In this post, we’re going to focus on the remaining algorithms; mainly the value function based algorithms and the actor-critic method. You can also look at this video which will explain this in even greater detail.

Value Function Based Learning

Figure 5: Value -Function Based Grid World

Let’s move on to value function-based learning algorithms. For these, let’s start with an example using the popular grid world as our environment. In this environment there are two discrete state variables: the X grid location and the Y grid location. There is only one state in grid world that has a positive reward, and the rest have negative rewards. The idea is that we want our agent to collect the most reward, which means getting to that positive reward in the fewest moves possible. The agent can only move one square at a time either up, down, left, or right and to us humans, it’s easy to see exactly which way to go to get to the reward. However, we must keep in mind that the agent knows nothing about the environment. It just knows that it can take one of four actions and it gets its location and reward back from the environment after it takes an action.

With a value function-based agent, a function would take in the state and one of the possible actions from that state and output the value of taking that action. The value is the sum of the total discounted rewards from that state. In this way, the policy would simply be to check the value of every possible action and choose the one with the highest value. We can think of this function as a critic since it’s looking at the possible actions and it’s criticising the agent’s choices. Since there are a finite number of states and actions in a grid world, we could use a lookup table to represent this function. This is called the Q-table, where there is a value for every state and action pairing.

How does the agent learn these values? Well, at first, we can initialise it to all zeroes, so all actions look the same to the agent. This is where the exploration rate, epsilon, comes in and it allows the agent to take a random action. After it takes that action, it gets to a new state and collects the reward from the environment. The agent uses that reward, that new information, to update the value of the action that it just took. It does that using the famous Bellman equation.

Figure 6: Bellman Equation

The Bellman equation allows the agent to solve the Q-table over time by breaking up the whole problem into multiple simpler steps. Rather than solving the true value of the state/action pair in one step, through dynamic programming, the agent will update the state/action pair each time it’s visited.

After the agent has taken an action, it receives a reward. Value is more than the instant reward from an action; it’s the maximum expected return into the future. Therefore, the value of the state/action pair is the reward that the agent just received, plus how much reward the agent expects to collect going forward. And we discount the future rewards by gamma so that, as we talked about, the agent doesn’t rely too much on rewards far in the future. This is now the new value of the state/action pair, s, a. And so, we compare this value to what was in the Q-table to get the error, or how far off the agent’s prediction was. The error is multiplied by a learning rate and the resulting delta value is added to the old estimate.

When the agent finds itself back in the same state a second time, it’s going to have this updated value and it will tweak them again when it chooses the same action. And it’ll keep doing this repeatedly until the true value of every state/action pair is sufficiently known to exploit the optimal path.


Figure 7: The Actor-Critic

Let’s extend this idea to an inverted pendulum where there are still two states, angular position, and angular velocity, except now the states are continuous. Value functions can handle continuous state spaces, except not with a lookup table. Hence, we’re going to need a neural network. The idea is the same. We input the state observations and an action, and the neural network returns a value.

You can see that this setup doesn’t work well for continuous action spaces, because how could you possibly try every infinite action and find the maximum value?  Even for a large action space this becomes computationally expensive. For now, let’s just say that the action space is discrete, and the agent can choose one of 5 possible torque values.

When you feed the network the observed state and an action it’ll return a value, and our policy would again be to check the value for every possible action and take the one with the highest value. Just like with grid world, our neural network would be initially set to junk values and the learning algorithm would use a version of the Bellman equation to determine what the new value should be and update the weights and biases in the network accordingly. And once the agent has explored enough of the state space, then it’s going to have a good approximation of the value function, and can select the optimal action, given any state.

This brings us to a merging of the two techniques into a class of algorithms called actor/critic. The actor is a network that is trying to take what it thinks is the best action given the current state, just like we had with the policy function method, and the critic is a second network that is trying to estimate the value of the state and the action that the actor took, like we had with the value-based methods. This works for continuous action spaces because the critic only needs to look at a single action, the one that the actor took, and not try to find the best action by evaluating all of them.

This is how it works. The actor chooses an action in the same way that a policy function algorithm would, and it’s applied to the environment. The critic estimates what it thinks the value of that state and action pair is and then it uses the reward from the environment to determine how accurate its value prediction was. The error is the difference between the new estimated value of the previous state and the old value of the previous state from the critic network. The new estimated value is based on the received reward and the discounted value of the current state. It can use the error as a sense of whether things went better or worse than the critic expected.

The critic uses this error to update itself in the same way that the value function would so that it has a better prediction the next time it’s in this state. The actor also updates itself with the response from the critic and, if available, the error term so that it can adjust its probabilities of taking that action again in the future. That is, the policy now ascends the reward slope in the direction that the critic recommends rather than using the rewards directly.

Both the actor and the critic are neural networks that are trying to learn the optimal behaviour.  The actor is learning the right actions using feedback from the critic to know what a good action is and what is bad, and the critic is learning the value function from the received rewards so that it can properly criticise the action that the actor takes. Therefore, you might have to set up two neural networks in your agent; each one plays a very specific role.

With actor-critic methods, the agent can take advantage of the best parts of policy and value function algorithms. Actor-critics can handle both continuous state and action spaces and speed up learning when the returned reward has high variance. In the next post we will go through an example of using an actor-critic-based algorithm to get a bipedal robot to walk.  In the meantime, you can take a look at this video that will explain what we just learnt in more detail.

What Can I Do Next?

Follow us