Soft Actor-Critic part 1: intuition and theoretical aspect
How to teach robustness to a deep reinforcement learning agent using the maximum entropy principle. In this essay, I cover the building blocks of the SAC algorithm and the relevant nuts and bolts of the Maximum Entropy RL framework.
Directeur du programme de baccalauréat en génie logiciel à l'
Université Laval
Acknowledgments
A big thank you to my angel Karen Cadet for her support and precious insight on the english language.
Soft Actor-Critic
Soft Actor-Critic (SAC) is an off-policy algorithm based on the Maximum EntropyReinforcementLearning framework.
The main idea behind Maximum Entropy RL is to frame the decision-making problem as a probabilistic graphical model from top to bottom and then solve it using tools borrowed from that same field. Under Maximum Entropy RL framework, a learning agent seeks to maximize both the return and the entropy simultaneously.
This approach benefits Deep Reinforcement Learning algorithms by giving them
the capacity to consider and learn many alternate paths leading to an optimal goal and the capacity to learn how to act optimally despite adverse circumstances.
SAC is an off-policy algorithm, which means it has the ability to train on samples coming from a different policy.
What is particular though is that, contrary to other off-policy algortihms, it’s stable. This mean that the algorithm is much less picky in terms of hyperparameter tuning.
SAC is curently the state of the artDeep Reinforcement Learning algorithm together with Twin Delayed Deep Deterministic policy gradient ( TD3 ) .
The learning curve of the Maximum Entropy RL framework is quite steep due to how much it rethinks the RL problem. Diving in depth into the Maximum Entropy RL theory is definitely required in order to understand how SAC work.
Tackling the applied part was arguably the hardest project I did so far, both in terms of component implementation and in terms of evil silent bugs.
Nevertheless, it’s worth every headache and nightmare I had in the process once it started working.
Since there are a lot of different notations across paper, I’ve decided to follow (for the most part) the convention established by Sutton & Barto in their book Reinforcement Learning: An Introduction
An overview of the Maximum Entropy framework
Goal:
Solving decision-making problems in a way that define optimality as a context-dependent and multi-solution concept.
The Maximum Entropy framework is an approach to solving decision-making problem that uses the formalism and tools of the field of Probabilistic Graphical Model.
The difference between Classical RL and Maximum Entropy RL is not obvious at first because in both cases, it’s all about dealing with probabilities, random variable,
expectation maximization and so on.
As we will see, those two are fundamentally different.
A other way of framing the decision-making problem
The classical approach used in RL is to formalize the decision-making problem using aProbabilistic Graphical Modelaugmented with a reward input and then seek to maximize the sum of cumulative reward using some kind of tools borrowed from Stochastic Optimization (in the broad sense).
The Maximum Entropy RL approach on the other end formalize the decision-making problem as aProbabilistic Graphical Model
and then solve a learning and inference problem using Probabilistic Graphical Model tools.
While the former can use Probabilistic Graphical Model to describe the RL problem, the later formalize the complete RL problem as Probabilistic Graphical Model.
In other word, the Maximum Entropy RLThis approach to tackling decision-making problem is not new in the literature and has many names: Maximum Entropy RL, KL-divergence control, stochastic optimal control. In this essay we will use the name Maximum Entropy RL.framework formalize and solve the entirety of the RL problem using probability theory.
How does it make the RL problem different (an intuition):
Consider an environment with a continuous action space.
The Classical RL approach would specify the agent policy \pi as a unimodal probability distribution for which the center is the maximal Q-value and indicate the optimal action for a given state.
In contrast, the Maximum Entropy RL approach would specify the policy as a multimodal distribution for which all mode centers are local maxima Q-values that each indicates good alternative action for a given state.
Why does it matter?
Because since in any given real life task, there is in general more than one way to do things, then an RL agent should be able to handle the following scenario:
Suppose the optimal way is simply impractical at a given time, meaning there is no choice to fallback to a lesser optimal way. Does he know how to handle non-optimal alternative way to do things?
Suppose there is more than one optimal way to do things all leading to the same end result, how does he choose between them?
Suppose now that there exist multiple equally optimal but different end result, how does he proceed now?
What about the case where there are many ways to do things and only one optimal way but we want the agent to relax is expectation regarding the end result, in other words, we don’t care whether the end result is optimal or near optimal? Will he be able to make good use of that relaxed requirement of near optimality?
Those are all legitimate scenario that an agent should be required to successfully handle in order to become effective, skillful, agile, nimble, resilient and capable of handling adverse condition.
The problem with Classical RL is that it converges (in expectation) to a deterministic policy \pi.
This is one of the keys takes away proof from the Deterministic Policy Gradient (DPG) paper (see appendix C in the supplementary material). They show that for a wide range of stochastic policy, policy gradient algorithms converge to a deterministic gradient as the variance goes to zero. The same idea goes for value-based algorithms
This mean that the algorithm will optimize for one and only one way to do things.
Once it starts believing that a path is more promising than the others, it will start to optimize for that believed-best path and it will discard all the alternate ones.
Even if the algorithm is forced to explore using whatever trick, those trick only promote believed unpromising path for consideration but it still results in an algorithm that
learn to optimize for one and only one way to do things.
On the other end, Maximum Entropy RL optimize for multiple alternate ways of doing things which lead to algorithms that exhibit the following property:
effective exploration
transfer learning capabilities out of the box
robustness and adaptability
Nuts and bolts (Key component related to SAC)
The MaxEnt policy \pi:
Recall the Classical RL stochastic policy \pi definition
which is modelled either as a categorical or a Gaussian distribution with his mean value representing the best action \mathbf{a} given satet \mathbf{s} at timestept.
How this distribution is defined is a arbitrary choice let to the algorithm design.
On the other end, Maximum Entropy RL defines the distribution explicitely either in terms of
advantageA^\pi as
or in terms of
Q-functionQ^\pi as
with \alpha being the temperature and \propto the symbol of proportionality.
We can observe that it’s an analogue of the Boltzmann distribution (aka Gibbs distribution) with the advantage being the negative energy found in the Boltzmann distribution.
Equivalently, it gives us a probability measure of a RL agent doing action \mathbf{a}_t in a given state \mathbf{s}_t as a function of that state energy A(\mathbf{s}_t, \mathbf{a}_t) (the advantage).
As the temperature\alpha decreases and approach to zero, the policy behaves like a standard greedy policy
This hyperparameter \alpha control the stochasticity of the policy and become very useful later on during training in order to adjust the trade-off between exploration and exploitation.
The MaxEnt objective J(\pi):
The RL objective derived from the Maximum Entropy RL framework is similar to the Classical RL one with the exception of an added entropy term \mathcal{H} and the temperature. \alpha
Key idea:
This objective seeks to maximize the expected returnand maximize the action entropy at the same time.
Moving part
The return: Same as in Classical RL
The entropy term: Can be viewed as a regularizer, an uniform prior over the policy or a way to tackle the exploration/exploitation trade off.
The temperature \alpha: Control the trade-off between exploration/exploitation.
\mathcal{H} tel us how wide is the distribution from which are sampled the random variables.
A wide distribution will produce high entropy random variable.
A narrow distribution will produce low entropy random variable.
The soft value function Q^\pi and V^\pi
Under the Maximum Entropy framework, both value function are redefined to handle the added entropy term.
First we need to rewrite the MaxEnt objective by expanding the entropy term and using the Q-function definition such that
This leads us to the definition of both value function. The state-action value function is defined as
and the state value function is defined as
or alternatively
We can also rewrite the Bellman equation in terms of Q_{soft}^\pi and V_{soft}^\pi
with p being the transition dynamic.
Without diving to deep into the Maximum Entropy framework, it’s valuable to point out that
Q_{soft}^\pi doesn’t work like a Classical RLreward-to-goQ^\pi
and it doesn’t have the same properties eitherFor those who are familiar with Probabilistic Graphical Model, it’s a backward message..
The Maximum Entropy RL Q-function general definition is
with \mathcal{O} being a optimality random variable. This definition can be interpreted as
the probability (on a logaritmic scale) of being optimal from timestep t until the trajectory end
given that we are in state \mathbf{s}_t and we take action \mathbf{a}_t.
Looking back at the initial formulation of the policy \pi_{MaxEnt} in its ratio form
we can appreciate how it uses both value-function in order to evaluate the quality of an action \mathbf{a}_twith regard to every other legal action \mathbf{a}'_t available in that state \mathbf{s}_t. This ratio is equivalent to the formulation of a conditional probability with the lowest possible value being close to 0.0 for a very bad action outcome. In effect, this probability like formulation mean that
it give a measure of the quality of an ongoing trajectory as per definition of a measure in measure theory.
Having this in mind leads us to an interesting question:
Is taking the value-function Q_{classical}^\pi alone a reliable way of assessing the quality of a trajectory?
It’s common to say in classical RL/DRL that the reward-to-goQ_{classical}^\pi is a measure of performance. Nevertheless, in the sense of measure theory, it’s not since it does not satisfy the property of the null empty set\mu(\emptyset)=0. As a matter of fact,
we have no guarantee that the environment dynamic was designed to produce
Take the case of an environment where the reward signal upper-bound is 0 by design. In that case, 0 would be the highest possible return but it certainly doesn’t mean that it’s a non-rewarding trajectory or that the trajectory was initialized on a terminal state.
For now, this is all we need to understand how the Soft Actor-Critic algorithm work.
More on the subject of Maximum Entropy RL in a futur post.
Building blocks of Soft Actor-Critic
The algorithm seeks to maximize the maximum entropy objective J(\pi_{MaxEnt}) by doing soft policy iteration, which is similar to regular policy iteration (more on this in the
algorithm anatomy section 1.4 ).
To do so, the algorithm will have to learn simultaneously the soft Q-functionQ_\theta^\pi and Maximum Entropy policy\pi_{MaxEnt}.
Because it’s learning both value and policy at the same time, Soft Actor-Critic (SAC for short) is considered a value-basedActor-Critic algorithm. This also means that it can be trained using off-policy samples.
off-policy learning capability is a very valuable and coveted ability: it means that the algorithm can learn from samples generated by another policy \pi distribution than the current one. Those samples could be coming from the same but older policy \pi_{older} (in other word samples generated earlier) or they could be coming from a totally different policy \pi' that is producing them elsewhere.
The key benefit here is that it can speed up training by reducing the overhead of having to produce new sample at every gradient step.
It’s particularly useful in environment where producing new samples is a long process, like in real life robotic.
Learning the soft Q-function
Recall that we talk earlier about Q_{soft}^\pi being a Probabilistic Graphical Model
backward message 1.2.1.0.1
. In order to be able to compute that value efficiently, we need to approximate it.
We can do this by representing it as a parametrized function Q_{\theta}^\pi (\mathbf{s}_t, \mathbf{a}_t ) of parameters \theta. We then learn \theta by minimizing the squared soft Bellman residual error
with \widehat{Q}_{soft}^\pi being the target.
In theory, V_{soft}^\pi(\mathbf{s}_t) value can be estimated directly using equation 6. However, representing V_{soft}^\pi explicitly has the added benefit of helping stabilize learning. We can represent it as a parametrized function V_{\psi}^\pi (\mathbf{s}_t) of parameters \psi. We then learn \psi by minimizing the squared residual error
with \widehat{V}_{soft}^\pi being the target.
We now need a way to represent and learn \pi_{MaxEnt} (\mathbf{a}_t | \mathbf{s}_t ).
Deriving the objective J(\pi_{MaxEnt}) of SAC:
Let first rewrite the Maximum Entropy RL objective for a single timestep in terms of Q_\theta^\pi
with the constantC being the partition function that is used to normalize the distribution. We can then learn this objective by minimizing the expected KL-divergence directly using this update rule
with \Pi being a family of policy.
Note:
The authors of the SAC paper as demonstrated that the constant can be omitted since it does not contribute to the gradient of J(\pi_{MaxEnt}) (see appendix B).
KL-divergence (aka. relative entropy)
It tel us how much different are two distributions
0 \leq D_{KL} \big(q \Vert p\big)
D_{KL} \big(q \Vert p\big) = 0 \ \Longrightarrow \ q and p are similar
Learning the policy \pi_{MaxEnt}
Concretely, we are going to approximate the policy \pi_{SAC} by representing it as a parametrized Gaussian distribution \pi_\phi (\mathbf{a}_t | \mathbf{s}_t ) of parameters \phi with a learnable mean and covariance.
We cannot directly differentiate a probability distribution but using the reparameterization trick, we can remodel the policy so that it exposes different parameters: in our case the mean and covariance of a Gaussian distribution. Using this trick, we can express the action
as
where \epsilon \sim \mathcal{N}(\mu, \Sigma) and define implicitly the policy \pi_\phi in terms of f_\phi
We then rewrite equation 11 as
Algorithm anatomy:
In order to be effective while tackling large continuous domain, the SAC algorithm uses an approximation of the soft policy iteration algorithm:
It used function approximator for the soft Q-functionQ_{soft}^\pi and the policy \pi_{SAC}
It alternates between soft policy evaluation and soft policy improvement instead of running each one to convergence separately like in Classical RLpolicy iteration.
Training soft function approximator: implementation details
The algorithm will learn
a parametrized soft state-value functionV_\psi^\pi(\mathbf{s}_t);
two parametrized soft Q-functionQ_\theta^\pi(\mathbf{s}_t, \mathbf{a}_t );
and a maximum entropy policy\pi_\phi (\mathbf{a}_t | \mathbf{s}_t );
with \psi, \theta and \phi modelled as neural networks.
Note that \pi_\phi will be reparameterized as a Gaussian distribution with a learnable mean and covariance.
The algorithm also uses a replay bufferD to collect and accumulate samples (\mathbf{s}_t, \mathbf{a}_t, \mathbf{s}_{t+1}, r_{t+1}, done_{t+1} ).
One of the key benefits of sampling tuple
randomly from a replay buffer is that it breaks temporal correlation which helps learning
Why learn the soft state-value functionV_\psi^\pi ?
As we talked
earlier,
there is no theoretical requirement for learning V_\psi(\mathbf{s}_t) since it can be recovered from equation 6 directly using Q_\theta(\mathbf{s}_t, \mathbf{a}_t ) and \pi_\phi (\mathbf{a}_t | \mathbf{s}_t ).
In practice, it can stabilize training
Why learn the two soft Q-functionQ_\theta^\pi ?
The policy improvement step is known to produce positive bias that reduces the performance in value-based method. This is a trick (aka clipped double-Q) that help reduce the impact of this problem. How it work is that the algorithm learn Q_{\theta 1}^\pi and Q_{\theta 2}^\pi separately then take the minimum between the two when training V_\psi^\pi.
In practice, the SAC authors founded that it significantly speed up training, especially on harder task
Detail regarding the Soft state value function
Like we
explained earlier
we can train the soft state value function by least squared regression
Observe that state \mathbf{s}_t is sampled from the replay buffer but not action \mathbf{a}_t which is sampled from the current policy \pi. It’s a critical detail that is easy to miss. Also, this is where we make use of our two learned soft Q-function.
Detail regarding the Soft Q-function
Again, we can train the soft Q-function by least squared regression
The learning target is represented by a copy of the main V_\psi^\pi(\mathbf{s}_t) with parameter noted \bar{\psi}.
Network weight from V_\psi^\pi are copied in a controlled manner to V_{\bar{\psi}}^\pi using exponential moving average and adjusted by a target smoothing coefficient hyperparameter \tau.
Detail regarding the Soft Actor-Critic policy
Policy \pi_\phi is trained using equation 13
Like we have explained earlier, the policy \pi is modelled using a Gaussian distribution. It’s important to consider the fact that Gaussian distributions are unbounded contrary to our policy which needs to produce action inside a bound reflecting the environment action space. Thus, enforcing those bounds is done by applying a squashing function
Next step
This concludes the theoretical part. Next step … the fun part: implementation and experimentation.
Come back soon for the sequel: Soft Actor-Critic (Part 2 - In Practice).
Cite my post as:
@article{lcoupal2020SoftActorCriticPart1,
author = {Coupal, Luc},
journal = {redleader962.github.io/blog},
title = {{Soft Actor-Critic part 1: intuition and theoretical aspect}},
year = {2020},
url = {https://redleader962.github.io/blog/2020/SAC-part-1-distillarized},
keywords = {Deep reinforcement learning,Reinforcement learning,Maximum Entropy,Soft Actor-Critic}
}