Reinforcement Learning, Part 3: Policies and Learning Algorithms Part 1

This third post introduces the algorithms that reside within the agent. We’ll cover why we use neural networks to represent functions and why you may have to set up two neural networks in a powerful family of methods called actor-critic. In the previous post, we focused mainly on setting up the environment: what the environment includes and how rewards from the environment are used to guide the agent’s behaviour. In this post, we’re going to focus on the setting up the agent—specifically, the last three steps in the RL workflow. This post will introduce a few topics at a high level that will give you a general understanding of policies and learning algorithms. You can also look at this video which will explain this in even greater detail.

Figure 1: RL Workflow

In this post we will address two main questions. One, why use neural networks to represent functions as opposed to something else, like tables or transfer functions? And two, why you may have to set up two neural networks and how they complement each other in a powerful family of methods called actor-critic.

In the last post, we ended by explaining how a policy is a function that takes in state observations and outputs actions. Then the idea of why tables, and specifically defined functions, aren’t a great solution in many cases was briefly introduced. This is because tables are impractical when the state and action space get large and because it’s difficult to craft the right function structure for complex environments. Using neural networks, however, can address both problems.

Figure 2: Tables and Functions

A neural network is a group of nodes, or artificial neurons, that are connected in a way that allows them to be a universal function approximator. This means that given the right combination of nodes and connections, we can set up the network to mimic any input and output relationship. This is good for us because we can use the hundreds of pixel values in a robotic vision system as the input into this function, and the output could be the actuator commands that drive the robot’s arms and legs. Even though the function might be extremely complex, we know that there is a neural network of some kind that can achieve it.

Figure 3: Neural Network

On the left are the input nodes, one for each input to the function, and on the right are the output nodes. In between are columns of nodes called hidden layers. This network has 2 inputs, 2 outputs, and 2 hidden layers of 3 nodes each. With a fully connected network, there is an arrow, or a weighted connection from each input node to each node in the next layer, and then from those nodes to the layer after that and again until the output nodes. The value of a node is equal to the sum of every input node times its respective weighting factor plus a bias. We can perform this calculation for every node in a layer and write it out in a compact matrix form as a system of linear equations.

We want to find a function that can take in many observations and transform them into a set of actions that will control some nonlinear environment. And since the structure of this function is often too complex for us to solve for directly, we want to approximate it with a neural network that learns the function over time. It’s tempting to think that we can just insert any network and let loose a reinforcement learning algorithm to find the right combination of weights and biases and we’re done. Unfortunately that’s not the case.

We must make a few choices about our neural network ahead of time to make sure it’s complex enough to approximate the function we’re looking for, but not too complex as to make training impossible or impossibly slow. For example, as we’ve already seen, we need to choose an activation function, and the number of hidden layers, and the number of neurons in each layer. But beyond that, we also have control over the internal structure of the network. Should it be fully connected, or should the connections skip layers like in a residual neural network? Should they loop back on themselves to create internal memory with recurrent neural networks? Should groups of neurons work together like with a convolutional neural network? And so on.

We have a lot of choices, but as with other control techniques there isn’t one right approach. A lot of times, it comes down to starting with a network structure that has already worked for the type of problem you’re trying to solve and tweak it from there.

Policy Gradient Methods

The structure looks straightforward, so now the question is, how do we approach training this type of network? To get a general feel for how this is done, let’s look at the Atari game Breakout.

If you’re not familiar, Breakout is this game where you are trying to eliminate bricks using a paddle to direct a bouncing ball. This game only has three actions: move the paddle left, right, or not at all. Additionally, there is a near-continuous state space that includes the position of the paddle, the position and velocity of the ball, and the location of the remaining bricks.

Figure 4: Policy Gradient Method

For this example, we’ll say the observation is a screen capture of the game, feeding in one frame at a time, and therefore there is one input into our neural net for every pixel. Now, with the network set there are many approaches to training it. Policy gradient methods can work with a stochastic policy, which means that rather than a deterministic choice—take a left or right—the policy would output the probability of taking a left or the probability of taking a right. A stochastic policy takes care of the exploration/exploitation problem because exploring is built into the probabilities. Now, when we learn, the agent just needs to update the probabilities. Is taking a left a better option than right? Push the probability that you take a left in this state a little bit higher. Then over time, the agent will nudge these probabilities around in the direction that produces the most reward.

How does it know whether the actions were good or not? The idea is this: execute the current policy, collect reward along the way, and update the network to increase the probabilities of actions that increased reward. What if the paddle went left, missing the ball, causing a negative reward? Then change the neural network to increase the probability of moving the paddle right next time the agent is in that state. Essentially, it’s taking the derivative of each weight and bias with respect to reward and adjusting them in the direction of positive reward increase. In this way, the learning algorithm is moving the weights and biases of the network to ascend the reward slope. Therefore, the term gradient is used in the name.

One of the downsides of policy gradient methods is that the naive approach of just following the direction of steepest ascent can converge on a local maximum rather than global. They also can converge slowly due to their sensitivity to noisy measurements, which happens, for example, when it takes a lot of sequential actions to receive a reward and the resulting cumulative reward has high variance between episodes. Imagine having to string together hundreds or thousands of sequential actions before the agent ever receives a single reward. You could see how time-consuming it could be to train an agent using that has an extremely sparse reward system.

In the next post we will go through  the value function based learning and also introduce the actor-critic.  In the meantime, you can take a look at this video that will explain what we just learnt in more detail.

What Can I Do Next?

Follow us