multipying negated gradients by actions for the loss in actor nn of DDPG The Next CEO of Stack Overflow2019 Community Moderator ElectionHow to design two different neural nets for actor and critic RL?Policy Gradients vs Value function, when implemented via DQNDeep RL: Proximal policy optimization gradient calculationHow does action get selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?RL's policy gradient (REINFORCE) pipeline clarificationNatural actor-critic with Q function approximationReinforcement learning - generating a matrix of continuous values with varying size for test data generation
Is micro rebar a better way to reinforce concrete than rebar?
Is it possible to search for a directory/file combination?
What's the best way to handle refactoring a big file?
How to place nodes around a circle from some initial angle?
Where do students learn to solve polynomial equations these days?
Why am I allowed to create multiple unique pointers from a single object?
Is "for causing autism in X" grammatical?
Circle x^2 + y^2 = n! doesn't hit any lattice points for any n except for 0, 1, 2 and 6 or does it?
How does Madhvacharya interpret Bhagavad Gita sloka 18.66?
WOW air has ceased operation, can I get my tickets refunded?
Plot of histogram similar to output from @risk
Would this house-rule that treats advantage as a +1 to the roll instead (and disadvantage as -1) and allows them to stack be balanced?
Why, when going from special to general relativity, do we just replace partial derivatives with covariant derivatives?
Would a completely good Muggle be able to use a wand?
Display a text message if the shortcode is not found?
Why is the US ranked as #45 in Press Freedom ratings, despite its extremely permissive free speech laws?
Return the Closest Prime Number
The exact meaning of 'Mom made me a sandwich'
How did the Bene Gesserit know how to make a Kwisatz Haderach?
Is it possible to replace duplicates of a character with one character using tr
Why don't programming languages automatically manage the synchronous/asynchronous problem?
What connection does MS Office have to Netscape Navigator?
Why do we use the plural of movies in this phrase "We went to the movies last night."?
Received an invoice from my ex-employer billing me for training; how to handle?
multipying negated gradients by actions for the loss in actor nn of DDPG
The Next CEO of Stack Overflow2019 Community Moderator ElectionHow to design two different neural nets for actor and critic RL?Policy Gradients vs Value function, when implemented via DQNDeep RL: Proximal policy optimization gradient calculationHow does action get selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?RL's policy gradient (REINFORCE) pipeline clarificationNatural actor-critic with Q function approximationReinforcement learning - generating a matrix of continuous values with varying size for test data generation
$begingroup$
In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py
# Define loss function using action value (Q value) gradients
action_gradients = layers.Input(shape=(self.action_size,))
loss = K.mean(-action_gradients * actions)
The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:
# Define optimizer and training function
optimizer = optimizers.Adam(lr=self.lr)
updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)
Next we get to train_n a K.function type function (also in class Actor):
self.train_fn = K.function(
inputs=[self.model.input, action_gradients, K.learning_phase()],
outputs=[],
updates=updates_op)
of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).
The following are the arguments when train_fn is called:
"""
inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
(note: test mode = 0)
outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
"""
So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.

I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?
Edit: I missed a comment on line 55
# Define loss function using action value (Q value) gradients
So now I know that the action_gradients input layer receive the action_gradients from the critic.
Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d
But still, why is the loss calculated as -action_gradients * actions ?
policy-gradients actor-critic
$endgroup$
add a comment |
$begingroup$
In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py
# Define loss function using action value (Q value) gradients
action_gradients = layers.Input(shape=(self.action_size,))
loss = K.mean(-action_gradients * actions)
The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:
# Define optimizer and training function
optimizer = optimizers.Adam(lr=self.lr)
updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)
Next we get to train_n a K.function type function (also in class Actor):
self.train_fn = K.function(
inputs=[self.model.input, action_gradients, K.learning_phase()],
outputs=[],
updates=updates_op)
of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).
The following are the arguments when train_fn is called:
"""
inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
(note: test mode = 0)
outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
"""
So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.

I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?
Edit: I missed a comment on line 55
# Define loss function using action value (Q value) gradients
So now I know that the action_gradients input layer receive the action_gradients from the critic.
Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d
But still, why is the loss calculated as -action_gradients * actions ?
policy-gradients actor-critic
$endgroup$
add a comment |
$begingroup$
In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py
# Define loss function using action value (Q value) gradients
action_gradients = layers.Input(shape=(self.action_size,))
loss = K.mean(-action_gradients * actions)
The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:
# Define optimizer and training function
optimizer = optimizers.Adam(lr=self.lr)
updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)
Next we get to train_n a K.function type function (also in class Actor):
self.train_fn = K.function(
inputs=[self.model.input, action_gradients, K.learning_phase()],
outputs=[],
updates=updates_op)
of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).
The following are the arguments when train_fn is called:
"""
inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
(note: test mode = 0)
outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
"""
So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.

I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?
Edit: I missed a comment on line 55
# Define loss function using action value (Q value) gradients
So now I know that the action_gradients input layer receive the action_gradients from the critic.
Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d
But still, why is the loss calculated as -action_gradients * actions ?
policy-gradients actor-critic
$endgroup$
In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py
# Define loss function using action value (Q value) gradients
action_gradients = layers.Input(shape=(self.action_size,))
loss = K.mean(-action_gradients * actions)
The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:
# Define optimizer and training function
optimizer = optimizers.Adam(lr=self.lr)
updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)
Next we get to train_n a K.function type function (also in class Actor):
self.train_fn = K.function(
inputs=[self.model.input, action_gradients, K.learning_phase()],
outputs=[],
updates=updates_op)
of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).
The following are the arguments when train_fn is called:
"""
inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
(note: test mode = 0)
outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
"""
So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.

I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?
Edit: I missed a comment on line 55
# Define loss function using action value (Q value) gradients
So now I know that the action_gradients input layer receive the action_gradients from the critic.
Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d
But still, why is the loss calculated as -action_gradients * actions ?
policy-gradients actor-critic
policy-gradients actor-critic
edited Mar 24 at 15:07
flexitarian33
asked Mar 24 at 14:42
flexitarian33flexitarian33
457
457
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389
Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.
Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.
Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.
As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:
L
2
(
p
,
l
)
1
2
(
p
−
l
)
2
Where
p
is a prediction from our neural net and
l
is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.
And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
w
. If we used vanilla SGD, we’d do:
w
t
+
1
w
t
−
α
∇
w
L
2
Okay, but there’s a catch here, our loss function was in terms of
p
, not
w
. However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
p
(
x
,
w
)
for which we know how to compute
∇
w
p
. Since we know that, we can use the chain rule of differentiation, which says:
∇
w
L
2
∇
w
p
∇
p
L
2
Where since w is a vector,
∇
w
p
will refer to its (transposed) Jacobian matrix.
In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute
∇
w
L
2
for you, and it will do it by decomposing it slightly differently than we did above.
Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.
Back to Actor Critic
Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.
Let’s ask our selves: What do we really want actor critic to do?
Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).
Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!
Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:
arg
max
x
f
(
x
)
arg
min
x
−
f
(
x
)
Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.
From that, we now have an insight: what if we think of
−
Q
(
s
,
a
)
as a loss function, where the actions
a
are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.
To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!
Indeed, if -Q is our loss function, with actions
a
acting as our “predictions” from our neural-net actor model
μ
lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:
∇
w
−
Q
∇
w
μ
∇
a
−
Q
And there it is!
(You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)
So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?
Okay, so now you’re probably asking yourself:
Why didn’t the paper just say “use autograd on -Q”???
After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.
The reason is if you’re not careful, running pure autograd will burn you!
In DDPG, recall that we’re training two neural networks: Q and
μ
. If you simply feed
μ
into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
μ
! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!
To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!
You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.
So instead, you simply have to make sure you apply the chain rule through Q to
μ
, without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.
There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.
** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47897%2fmultipying-negated-gradients-by-actions-for-the-loss-in-actor-nn-of-ddpg%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389
Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.
Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.
Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.
As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:
L
2
(
p
,
l
)
1
2
(
p
−
l
)
2
Where
p
is a prediction from our neural net and
l
is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.
And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
w
. If we used vanilla SGD, we’d do:
w
t
+
1
w
t
−
α
∇
w
L
2
Okay, but there’s a catch here, our loss function was in terms of
p
, not
w
. However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
p
(
x
,
w
)
for which we know how to compute
∇
w
p
. Since we know that, we can use the chain rule of differentiation, which says:
∇
w
L
2
∇
w
p
∇
p
L
2
Where since w is a vector,
∇
w
p
will refer to its (transposed) Jacobian matrix.
In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute
∇
w
L
2
for you, and it will do it by decomposing it slightly differently than we did above.
Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.
Back to Actor Critic
Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.
Let’s ask our selves: What do we really want actor critic to do?
Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).
Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!
Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:
arg
max
x
f
(
x
)
arg
min
x
−
f
(
x
)
Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.
From that, we now have an insight: what if we think of
−
Q
(
s
,
a
)
as a loss function, where the actions
a
are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.
To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!
Indeed, if -Q is our loss function, with actions
a
acting as our “predictions” from our neural-net actor model
μ
lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:
∇
w
−
Q
∇
w
μ
∇
a
−
Q
And there it is!
(You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)
So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?
Okay, so now you’re probably asking yourself:
Why didn’t the paper just say “use autograd on -Q”???
After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.
The reason is if you’re not careful, running pure autograd will burn you!
In DDPG, recall that we’re training two neural networks: Q and
μ
. If you simply feed
μ
into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
μ
! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!
To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!
You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.
So instead, you simply have to make sure you apply the chain rule through Q to
μ
, without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.
There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.
** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.
$endgroup$
add a comment |
$begingroup$
This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389
Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.
Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.
Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.
As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:
L
2
(
p
,
l
)
1
2
(
p
−
l
)
2
Where
p
is a prediction from our neural net and
l
is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.
And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
w
. If we used vanilla SGD, we’d do:
w
t
+
1
w
t
−
α
∇
w
L
2
Okay, but there’s a catch here, our loss function was in terms of
p
, not
w
. However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
p
(
x
,
w
)
for which we know how to compute
∇
w
p
. Since we know that, we can use the chain rule of differentiation, which says:
∇
w
L
2
∇
w
p
∇
p
L
2
Where since w is a vector,
∇
w
p
will refer to its (transposed) Jacobian matrix.
In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute
∇
w
L
2
for you, and it will do it by decomposing it slightly differently than we did above.
Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.
Back to Actor Critic
Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.
Let’s ask our selves: What do we really want actor critic to do?
Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).
Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!
Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:
arg
max
x
f
(
x
)
arg
min
x
−
f
(
x
)
Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.
From that, we now have an insight: what if we think of
−
Q
(
s
,
a
)
as a loss function, where the actions
a
are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.
To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!
Indeed, if -Q is our loss function, with actions
a
acting as our “predictions” from our neural-net actor model
μ
lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:
∇
w
−
Q
∇
w
μ
∇
a
−
Q
And there it is!
(You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)
So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?
Okay, so now you’re probably asking yourself:
Why didn’t the paper just say “use autograd on -Q”???
After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.
The reason is if you’re not careful, running pure autograd will burn you!
In DDPG, recall that we’re training two neural networks: Q and
μ
. If you simply feed
μ
into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
μ
! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!
To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!
You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.
So instead, you simply have to make sure you apply the chain rule through Q to
μ
, without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.
There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.
** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.
$endgroup$
add a comment |
$begingroup$
This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389
Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.
Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.
Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.
As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:
L
2
(
p
,
l
)
1
2
(
p
−
l
)
2
Where
p
is a prediction from our neural net and
l
is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.
And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
w
. If we used vanilla SGD, we’d do:
w
t
+
1
w
t
−
α
∇
w
L
2
Okay, but there’s a catch here, our loss function was in terms of
p
, not
w
. However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
p
(
x
,
w
)
for which we know how to compute
∇
w
p
. Since we know that, we can use the chain rule of differentiation, which says:
∇
w
L
2
∇
w
p
∇
p
L
2
Where since w is a vector,
∇
w
p
will refer to its (transposed) Jacobian matrix.
In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute
∇
w
L
2
for you, and it will do it by decomposing it slightly differently than we did above.
Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.
Back to Actor Critic
Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.
Let’s ask our selves: What do we really want actor critic to do?
Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).
Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!
Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:
arg
max
x
f
(
x
)
arg
min
x
−
f
(
x
)
Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.
From that, we now have an insight: what if we think of
−
Q
(
s
,
a
)
as a loss function, where the actions
a
are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.
To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!
Indeed, if -Q is our loss function, with actions
a
acting as our “predictions” from our neural-net actor model
μ
lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:
∇
w
−
Q
∇
w
μ
∇
a
−
Q
And there it is!
(You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)
So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?
Okay, so now you’re probably asking yourself:
Why didn’t the paper just say “use autograd on -Q”???
After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.
The reason is if you’re not careful, running pure autograd will burn you!
In DDPG, recall that we’re training two neural networks: Q and
μ
. If you simply feed
μ
into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
μ
! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!
To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!
You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.
So instead, you simply have to make sure you apply the chain rule through Q to
μ
, without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.
There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.
** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.
$endgroup$
This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389
Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.
Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.
Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.
As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:
L
2
(
p
,
l
)
1
2
(
p
−
l
)
2
Where
p
is a prediction from our neural net and
l
is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.
And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
w
. If we used vanilla SGD, we’d do:
w
t
+
1
w
t
−
α
∇
w
L
2
Okay, but there’s a catch here, our loss function was in terms of
p
, not
w
. However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
p
(
x
,
w
)
for which we know how to compute
∇
w
p
. Since we know that, we can use the chain rule of differentiation, which says:
∇
w
L
2
∇
w
p
∇
p
L
2
Where since w is a vector,
∇
w
p
will refer to its (transposed) Jacobian matrix.
In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute
∇
w
L
2
for you, and it will do it by decomposing it slightly differently than we did above.
Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.
Back to Actor Critic
Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.
Let’s ask our selves: What do we really want actor critic to do?
Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).
Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!
Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:
arg
max
x
f
(
x
)
arg
min
x
−
f
(
x
)
Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.
From that, we now have an insight: what if we think of
−
Q
(
s
,
a
)
as a loss function, where the actions
a
are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.
To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!
Indeed, if -Q is our loss function, with actions
a
acting as our “predictions” from our neural-net actor model
μ
lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:
∇
w
−
Q
∇
w
μ
∇
a
−
Q
And there it is!
(You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)
So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?
Okay, so now you’re probably asking yourself:
Why didn’t the paper just say “use autograd on -Q”???
After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.
The reason is if you’re not careful, running pure autograd will burn you!
In DDPG, recall that we’re training two neural networks: Q and
μ
. If you simply feed
μ
into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
μ
! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!
To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!
You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.
So instead, you simply have to make sure you apply the chain rule through Q to
μ
, without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.
There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.
** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.
answered Mar 24 at 15:43
flexitarian33flexitarian33
457
457
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47897%2fmultipying-negated-gradients-by-actions-for-the-loss-in-actor-nn-of-ddpg%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown