Reinforcement Learning, Part 2: Understanding the Environment and Rewards

In this post, we will build on our basic understanding of reinforcement learning by exploring the workflow. We will cover what an environment is and some of the benefits of training within a simulated environment. We will look at how we can achieve our RL objectives by having a reward function that we can use to incentivise the learning process so that our system can complete its goal.  You can also look at this video which will explain this in even greater detail. 

To get started with a reinforcement learning project, you should understand the RL workflow and how each part of the process contributes to solving the problem, and what some of the decisions are that you’ll have to make along the way.  

Figure 1: Reinforcement Learning Workflow

Environment

Firstly, we need an environment where our agent can learn, and therefore, we need to choose what should exist within the environment. Then we need to think about what we ultimately want our agent to do and craft a reward function that will incentivise the agent to do just that. We need to choose a way to represent a policy—how we want to structure the parameters and logic that make up the decision-making part of the agent. Once we have this set up, we choose a training algorithm and get to work finding the optimal policy. Finally, we need to exploit the policy by deploying it onto an agent in the field and verifying the results. To put this workflow into perspective, let’s think about each of these steps in the context of two examples: balancing an inverted pendulum and getting a robot to walk.

The environment is everything that exists outside the agent. Practically speaking, it’s where the agent sends actions and it’s what generates rewards and observations.

Figure 2: RL Environment

This can be a bit confusing at first, as in controls we think of the environment as everything outside of the controller and the plant; things like road imperfections, wind gusts, and other disturbances that impact the system you’re trying to control. But in reinforcement learning, the environment is everything outside the controller. This includes the plant dynamics as well. For the walking robot example, most of the robot is part of the environment. The agent is just the bit of software that is generating the actions and updating the policy through learning. It’s the brain of the robot, so to speak.

The reason this distinction is important is because with reinforcement learning, the agent does not need to know anything about the environment at all. This is called model-free RL, and it’s powerful because you can basically just deploy an RL-equipped agent into any system and, assuming you’ve given the policy access to the observations, actions, and enough internal states, the agent will learn how to collect the most reward on its own. This means the agent does not need to know initially anything about our walking robot. It will still figure out how to collect rewards without knowing, for example, how the joints move or how strong the actuators are or the lengths of the appendages.

Without any understanding of the environment, an agent needs to explore all areas of the state space to fill out its value function, which means that it’ll spend some time exploring low-reward areas during the learning process. However, as designers, we often know some parts of the state space that are not worth exploring and so by providing a model of the environment or part of the environment, we provide the agent with this knowledge. For example, take an agent that is trying to determine the fastest route to a destination. Should it go right or left at this point? Without a model, the agent would have to explore the entire map to know what the best action is. With a model, the agent could explore going right without having to take that action physically. It could then figure out that going right results in a dead end and our agent would then go left.

Model-based RL is very powerful, but the reason model-free RL is so popular right now is because people hope to use it to solve problems where developing a model is difficult, such as controlling a car or a robot from pixel observations. And, since model-free RL is the more general case, we’re going to focus on it for the rest of this series.

We know the agent learns by interacting with the environment, so it makes sense that we have a way for the agent to interact with it. This might be a physical environment or a simulation. For example, for the inverted pendulum, we may let the agent learn how to balance by running it with a physical pendulum setup. This might be a good solution since it’s probably hard for the hardware to damage itself or others. With the walking robot, however, this might not be such a good idea. As you could imagine, if the agent treats the robot and world like a black box, it knows nothing about, then it’s going to do a lot of falling and flailing before it even learns how to move its legs, let alone how to walk. Not only could this damage the hardware, but it would be extremely time consuming to have to pick the robot up each time. Not ideal.

An attractive alternative is to train your agent within a high-fidelity model of the environment and simulate the learning experience. And there are a lot of benefits to doing this.

The first comes from the idea of sample inefficiency. Learning is a process that requires lots of samples: lots of trials, errors, and corrections, often in the millions or tens of millions. And so, with a simulation, you can run the learning process at faster than real time and you can also spin up lots of simulations and run them all in parallel.

The other beneficial thing you can do with a model of the environment is simulate conditions that are hard to test for in the real world. For example, with the walking robot, you could simulate walking on a low-friction surface like ice, which would help the robot stay upright on all surfaces.

The beneficial part about needing a simulation is that for control problems we usually already have a good model of the system and environment since we typically need it from traditional control design. This is where if you already have a model built in MATLAB or Simulink, you can replace your existing controller with an RL agent, add a reward function to the environment, and start the learning process.

One of the difficulties here is figuring out how much of the environment to model—what to include and what to leave out. However, this is the same question you have when modelling a plant for controller design, and so you can use the same intuition about your system to build a RL environment model.

One approach is to start training on a simple model, find the right combination of hyper parameters that will let training succeed, and then add more complexity to the model later. Hyper-parameters are the knobs we can turn on the training algorithms that set things like the learning rate and the sample times, and we will cover this in more detail in a future post.

Reward Signal

With the environment set, the next step is to think about what you want your agent to do and how you’ll reward it for doing what you want. This is like the cost function in linear quadratic regulation (LQR), in which we think about performance versus effort.

Figure 3: RL Reward Signal

 

However, unlike LQR where the cost function is quadratic, in RL there’s really no restriction on creating a reward function. We can have sparse rewards, or rewards every time step, or rewards that only come at the very end of an episode after long periods of time. They can be calculated from a nonlinear function or calculated using thousands of parameters. It depends on what it takes to train your agent effectively.

Do you want to get an inverted pendulum to stand upright? Then maybe give more rewards to your agent as the angle from vertical gets smaller. Want to take controller effort into account? Then subtract rewards as actuator use increases. Want to encourage a robot to walk across the floor? Then give the agent a reward when it reaches some state that’s far away.

Making a reward function is easy. It can be any function you can think of. Making a good reward function, on the other hand, is hard. And unfortunately, there isn’t a straightforward way to craft your reward to guarantee your agent will converge on the solution you want. This boils down to two main reasons.

Often the goal you want to incentivise comes after a long sequence of actions; this is the sparse reward system. Therefore, your agent will stumble around for long periods of time, not receiving any rewards in the process. This would be the case for the walking robot by only giving a reward after the robot successfully walked 10 meters. The chance your agent will randomly stumble on the action sequence that produces the sparse reward is very unlikely. Imagine the luck needed to generate all the correct motor commands to keep a robot upright and walking, rather than just flopping around on the ground!

This sparse reward problem can be improved by shaping the reward—providing smaller intermediate rewards that coax the agent along the right path. But reward shaping comes with its own set of problems, and this is the second reason crafting a reward function is difficult. If you give an optimisation algorithm a shortcut, it’ll take it! And shortcuts are hidden within reward functions, and more so when you start shaping them. This causes your agent to converge on a solution that is optimal given the reward function, but not ideal.

An easy example to think about is give an intermediate reward if the body of the robot travelled 1 meter from its current spot. The optimal solution might not be to walk that 1 meter, but rather fall ungracefully toward the reward. To the learning algorithm, walking and falling both provide the same reward, but, to the designer, one result is preferred over the other.

Policy

Now that we have the environment which provides the rewards, we’re ready to start work on the agent itself. The agent is comprised of the policy and the learning algorithm, and these two things are intimately intertwined. Many learning algorithms require a specific policy structure and choosing an algorithm depends on the nature of the environment.

Figure 4: RL Policy

The policy is a function that takes in state observations and outputs actions, so really, any function with that input and output relationship can work. In that way of thinking, we could use a simple table to represent policies.

Tables are exactly what you would expect. They’re an array of numbers where you use an input as a lookup address and output the corresponding value. For example, a Q-function is a table that maps states and actions to value. So given a state, S, the policy would be to look up the value of every possible action from that state and choose the action with the highest value. And training an agent with a Q-function would consist of developing over time all the actions and their values for each state.

This type of representation falls apart when the number of action value pairs gets large or becomes infinite. This is the so-called curse of dimensionality. Imagine our inverted pendulum. The state of the pendulum can be any angle from -π to π, and the action you can take can be any motor torque from the negative limit to the positive limit. Trying to capture all of that in a table is not feasible. Now, we could represent the continuous nature of the state-action space with a continuous function. But setting up this function where we could learn the right parameters would require us to know the structure of the function ahead of time, which might be difficult for high degree of freedom systems or nonlinear systems.

Now that we have a better understanding of what the environment and rewards are the next step will be to learn more about policies and the different learning algorithms. In the next post we will look at the algorithms that reside within an agent and cover why we use neural networks to represent functions. In the meantime, you can take a look at this video that will explain what we just learnt in more detail.

What Can I Do Next?

Follow us