multipying negated gradients by actions for the loss in actor nn of DDPG The Next CEO of Stack Overflow2019 Community Moderator ElectionHow to design two different neural nets for actor and critic RL?Policy Gradients vs Value function, when implemented via DQNDeep RL: Proximal policy optimization gradient calculationHow does action get selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?RL's policy gradient (REINFORCE) pipeline clarificationNatural actor-critic with Q function approximationReinforcement learning - generating a matrix of continuous values with varying size for test data generation

Is micro rebar a better way to reinforce concrete than rebar?

Is it possible to search for a directory/file combination?

What's the best way to handle refactoring a big file?

How to place nodes around a circle from some initial angle?

Where do students learn to solve polynomial equations these days?

Why am I allowed to create multiple unique pointers from a single object?

Is "for causing autism in X" grammatical?

Circle x^2 + y^2 = n! doesn't hit any lattice points for any n except for 0, 1, 2 and 6 or does it?

How does Madhvacharya interpret Bhagavad Gita sloka 18.66?

WOW air has ceased operation, can I get my tickets refunded?

Plot of histogram similar to output from @risk

Would this house-rule that treats advantage as a +1 to the roll instead (and disadvantage as -1) and allows them to stack be balanced?

Why, when going from special to general relativity, do we just replace partial derivatives with covariant derivatives?

Would a completely good Muggle be able to use a wand?

Display a text message if the shortcode is not found?

Why is the US ranked as #45 in Press Freedom ratings, despite its extremely permissive free speech laws?

Return the Closest Prime Number

The exact meaning of 'Mom made me a sandwich'

How did the Bene Gesserit know how to make a Kwisatz Haderach?

Is it possible to replace duplicates of a character with one character using tr

Why don't programming languages automatically manage the synchronous/asynchronous problem?

What connection does MS Office have to Netscape Navigator?

Why do we use the plural of movies in this phrase "We went to the movies last night."?

Received an invoice from my ex-employer billing me for training; how to handle?



multipying negated gradients by actions for the loss in actor nn of DDPG



The Next CEO of Stack Overflow
2019 Community Moderator ElectionHow to design two different neural nets for actor and critic RL?Policy Gradients vs Value function, when implemented via DQNDeep RL: Proximal policy optimization gradient calculationHow does action get selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?RL's policy gradient (REINFORCE) pipeline clarificationNatural actor-critic with Q function approximationReinforcement learning - generating a matrix of continuous values with varying size for test data generation










0












$begingroup$


In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py



# Define loss function using action value (Q value) gradients
action_gradients = layers.Input(shape=(self.action_size,))
loss = K.mean(-action_gradients * actions)


The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:



# Define optimizer and training function
optimizer = optimizers.Adam(lr=self.lr)
updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)


Next we get to train_n a K.function type function (also in class Actor):



self.train_fn = K.function(
inputs=[self.model.input, action_gradients, K.learning_phase()],
outputs=[],
updates=updates_op)


of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).



The following are the arguments when train_fn is called:



 """
inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
(note: test mode = 0)
outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
"""


So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.



explanation of DPG



I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?



Edit: I missed a comment on line 55



# Define loss function using action value (Q value) gradients



So now I know that the action_gradients input layer receive the action_gradients from the critic.

Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d



But still, why is the loss calculated as -action_gradients * actions ?










share|improve this question











$endgroup$
















    0












    $begingroup$


    In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py



    # Define loss function using action value (Q value) gradients
    action_gradients = layers.Input(shape=(self.action_size,))
    loss = K.mean(-action_gradients * actions)


    The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:



    # Define optimizer and training function
    optimizer = optimizers.Adam(lr=self.lr)
    updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)


    Next we get to train_n a K.function type function (also in class Actor):



    self.train_fn = K.function(
    inputs=[self.model.input, action_gradients, K.learning_phase()],
    outputs=[],
    updates=updates_op)


    of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).



    The following are the arguments when train_fn is called:



     """
    inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
    (note: test mode = 0)
    outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
    update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
    """


    So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
    would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.



    explanation of DPG



    I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?



    Edit: I missed a comment on line 55



    # Define loss function using action value (Q value) gradients



    So now I know that the action_gradients input layer receive the action_gradients from the critic.

    Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d



    But still, why is the loss calculated as -action_gradients * actions ?










    share|improve this question











    $endgroup$














      0












      0








      0





      $begingroup$


      In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py



      # Define loss function using action value (Q value) gradients
      action_gradients = layers.Input(shape=(self.action_size,))
      loss = K.mean(-action_gradients * actions)


      The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:



      # Define optimizer and training function
      optimizer = optimizers.Adam(lr=self.lr)
      updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)


      Next we get to train_n a K.function type function (also in class Actor):



      self.train_fn = K.function(
      inputs=[self.model.input, action_gradients, K.learning_phase()],
      outputs=[],
      updates=updates_op)


      of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).



      The following are the arguments when train_fn is called:



       """
      inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
      (note: test mode = 0)
      outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
      update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
      """


      So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
      would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.



      explanation of DPG



      I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?



      Edit: I missed a comment on line 55



      # Define loss function using action value (Q value) gradients



      So now I know that the action_gradients input layer receive the action_gradients from the critic.

      Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d



      But still, why is the loss calculated as -action_gradients * actions ?










      share|improve this question











      $endgroup$




      In this Udacity project code that I have been combing through line by line to understand the implementation, I have stumbled on a part in class Actor where this appears on line 55 here: https://github.com/nyck33/autonomous_quadcopter/blob/master/actorSolution.py



      # Define loss function using action value (Q value) gradients
      action_gradients = layers.Input(shape=(self.action_size,))
      loss = K.mean(-action_gradients * actions)


      The above snippet seems to be creating an Input layer for action gradients to calculate loss for the Adam optimizer (in the following snippet) that gets used in the optimizer but where and how does anything get passed to this action_gradients layer? (https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) get_updates function for optimizer where the above loss is used:



      # Define optimizer and training function
      optimizer = optimizers.Adam(lr=self.lr)
      updates_op = optimizer.get_updates(params=self.model.trainable_weights, loss=loss)


      Next we get to train_n a K.function type function (also in class Actor):



      self.train_fn = K.function(
      inputs=[self.model.input, action_gradients, K.learning_phase()],
      outputs=[],
      updates=updates_op)


      of which action_gradients are the (gradient of the Q-value w.r.t. actions from the local critic network, not target critic network).



      The following are the arguments when train_fn is called:



       """
      inputs = [states, action_gradients from critic, K.learning_phase = 1 (training mode)],
      (note: test mode = 0)
      outputs = [] are blank because the gradients are given to us from critic and don't need to be calculated using a loss function for predicted and target actions.
      update_op = Adam optimizer(lr=0.001) using the action gradients from critic to update actor_local weights
      """


      So now I began to think that the formula for deterministic policy gradient here: https://pemami4911.github.io/blog/2016/08/21/ddpg-rl.html
      would be realized once the gradient of the Q-values w.r.t. actions were passed from the critic to the actor.



      explanation of DPG



      I think that the input layer for action_gradients mentioned at the top is trying to find the gradient of the output of actor w.r.t. to parameters so that it can do the multiplication pictured in the photo. However, to reiterate, how is anything passed to this layer and why is the loss calculated this way?



      Edit: I missed a comment on line 55



      # Define loss function using action value (Q value) gradients



      So now I know that the action_gradients input layer receive the action_gradients from the critic.

      Apparently it is a trick used by some implementations such as Openai Baselines: https://stats.stackexchange.com/questions/258472/computing-the-actor-gradient-update-in-the-deep-deterministic-policy-gradient-d



      But still, why is the loss calculated as -action_gradients * actions ?







      policy-gradients actor-critic






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 24 at 15:07







      flexitarian33

















      asked Mar 24 at 14:42









      flexitarian33flexitarian33

      457




      457




















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389



          Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.



          Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.



          Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.



          As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:



          L
          2
          (
          p
          ,
          l



          )



          1
          2
          (
          p

          l
          )
          2



          Where
          p
          is a prediction from our neural net and
          l
          is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.



          And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
          w
          . If we used vanilla SGD, we’d do:



          w
          t
          +



          1



          w
          t

          α

          w
          L
          2



          Okay, but there’s a catch here, our loss function was in terms of
          p
          , not
          w
          . However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
          p
          (
          x
          ,
          w
          )
          for which we know how to compute

          w
          p
          . Since we know that, we can use the chain rule of differentiation, which says:




          w
          L



          2




          w
          p

          p
          L
          2



          Where since w is a vector,

          w
          p
          will refer to its (transposed) Jacobian matrix.



          In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute

          w
          L
          2
          for you, and it will do it by decomposing it slightly differently than we did above.



          Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.



          Back to Actor Critic



          Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.



          Let’s ask our selves: What do we really want actor critic to do?



          Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).



          Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!



          Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:



          arg
          max
          x
          f
          (
          x



          )



          arg
          min
          x

          f
          (
          x
          )



          Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.



          From that, we now have an insight: what if we think of

          Q
          (
          s
          ,
          a
          )
          as a loss function, where the actions
          a
          are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.



          To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!



          Indeed, if -Q is our loss function, with actions
          a
          acting as our “predictions” from our neural-net actor model
          μ
          lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:




          w



          Q




          w
          μ

          a

          Q



          And there it is!



          (You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)



          So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?



          Okay, so now you’re probably asking yourself:



          Why didn’t the paper just say “use autograd on -Q”???



          After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.



          The reason is if you’re not careful, running pure autograd will burn you!



          In DDPG, recall that we’re training two neural networks: Q and
          μ
          . If you simply feed
          μ
          into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
          μ
          ! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!



          To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!



          You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.



          So instead, you simply have to make sure you apply the chain rule through Q to
          μ
          , without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.



          There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.



          ** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.






          share|improve this answer









          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47897%2fmultipying-negated-gradients-by-actions-for-the-loss-in-actor-nn-of-ddpg%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389



            Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.



            Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.



            Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.



            As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:



            L
            2
            (
            p
            ,
            l



            )



            1
            2
            (
            p

            l
            )
            2



            Where
            p
            is a prediction from our neural net and
            l
            is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.



            And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
            w
            . If we used vanilla SGD, we’d do:



            w
            t
            +



            1



            w
            t

            α

            w
            L
            2



            Okay, but there’s a catch here, our loss function was in terms of
            p
            , not
            w
            . However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
            p
            (
            x
            ,
            w
            )
            for which we know how to compute

            w
            p
            . Since we know that, we can use the chain rule of differentiation, which says:




            w
            L



            2




            w
            p

            p
            L
            2



            Where since w is a vector,

            w
            p
            will refer to its (transposed) Jacobian matrix.



            In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute

            w
            L
            2
            for you, and it will do it by decomposing it slightly differently than we did above.



            Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.



            Back to Actor Critic



            Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.



            Let’s ask our selves: What do we really want actor critic to do?



            Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).



            Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!



            Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:



            arg
            max
            x
            f
            (
            x



            )



            arg
            min
            x

            f
            (
            x
            )



            Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.



            From that, we now have an insight: what if we think of

            Q
            (
            s
            ,
            a
            )
            as a loss function, where the actions
            a
            are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.



            To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!



            Indeed, if -Q is our loss function, with actions
            a
            acting as our “predictions” from our neural-net actor model
            μ
            lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:




            w



            Q




            w
            μ

            a

            Q



            And there it is!



            (You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)



            So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?



            Okay, so now you’re probably asking yourself:



            Why didn’t the paper just say “use autograd on -Q”???



            After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.



            The reason is if you’re not careful, running pure autograd will burn you!



            In DDPG, recall that we’re training two neural networks: Q and
            μ
            . If you simply feed
            μ
            into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
            μ
            ! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!



            To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!



            You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.



            So instead, you simply have to make sure you apply the chain rule through Q to
            μ
            , without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.



            There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.



            ** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.






            share|improve this answer









            $endgroup$

















              0












              $begingroup$

              This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389



              Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.



              Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.



              Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.



              As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:



              L
              2
              (
              p
              ,
              l



              )



              1
              2
              (
              p

              l
              )
              2



              Where
              p
              is a prediction from our neural net and
              l
              is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.



              And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
              w
              . If we used vanilla SGD, we’d do:



              w
              t
              +



              1



              w
              t

              α

              w
              L
              2



              Okay, but there’s a catch here, our loss function was in terms of
              p
              , not
              w
              . However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
              p
              (
              x
              ,
              w
              )
              for which we know how to compute

              w
              p
              . Since we know that, we can use the chain rule of differentiation, which says:




              w
              L



              2




              w
              p

              p
              L
              2



              Where since w is a vector,

              w
              p
              will refer to its (transposed) Jacobian matrix.



              In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute

              w
              L
              2
              for you, and it will do it by decomposing it slightly differently than we did above.



              Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.



              Back to Actor Critic



              Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.



              Let’s ask our selves: What do we really want actor critic to do?



              Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).



              Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!



              Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:



              arg
              max
              x
              f
              (
              x



              )



              arg
              min
              x

              f
              (
              x
              )



              Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.



              From that, we now have an insight: what if we think of

              Q
              (
              s
              ,
              a
              )
              as a loss function, where the actions
              a
              are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.



              To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!



              Indeed, if -Q is our loss function, with actions
              a
              acting as our “predictions” from our neural-net actor model
              μ
              lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:




              w



              Q




              w
              μ

              a

              Q



              And there it is!



              (You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)



              So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?



              Okay, so now you’re probably asking yourself:



              Why didn’t the paper just say “use autograd on -Q”???



              After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.



              The reason is if you’re not careful, running pure autograd will burn you!



              In DDPG, recall that we’re training two neural networks: Q and
              μ
              . If you simply feed
              μ
              into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
              μ
              ! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!



              To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!



              You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.



              So instead, you simply have to make sure you apply the chain rule through Q to
              μ
              , without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.



              There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.



              ** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.






              share|improve this answer









              $endgroup$















                0












                0








                0





                $begingroup$

                This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389



                Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.



                Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.



                Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.



                As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:



                L
                2
                (
                p
                ,
                l



                )



                1
                2
                (
                p

                l
                )
                2



                Where
                p
                is a prediction from our neural net and
                l
                is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.



                And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
                w
                . If we used vanilla SGD, we’d do:



                w
                t
                +



                1



                w
                t

                α

                w
                L
                2



                Okay, but there’s a catch here, our loss function was in terms of
                p
                , not
                w
                . However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
                p
                (
                x
                ,
                w
                )
                for which we know how to compute

                w
                p
                . Since we know that, we can use the chain rule of differentiation, which says:




                w
                L



                2




                w
                p

                p
                L
                2



                Where since w is a vector,

                w
                p
                will refer to its (transposed) Jacobian matrix.



                In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute

                w
                L
                2
                for you, and it will do it by decomposing it slightly differently than we did above.



                Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.



                Back to Actor Critic



                Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.



                Let’s ask our selves: What do we really want actor critic to do?



                Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).



                Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!



                Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:



                arg
                max
                x
                f
                (
                x



                )



                arg
                min
                x

                f
                (
                x
                )



                Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.



                From that, we now have an insight: what if we think of

                Q
                (
                s
                ,
                a
                )
                as a loss function, where the actions
                a
                are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.



                To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!



                Indeed, if -Q is our loss function, with actions
                a
                acting as our “predictions” from our neural-net actor model
                μ
                lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:




                w



                Q




                w
                μ

                a

                Q



                And there it is!



                (You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)



                So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?



                Okay, so now you’re probably asking yourself:



                Why didn’t the paper just say “use autograd on -Q”???



                After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.



                The reason is if you’re not careful, running pure autograd will burn you!



                In DDPG, recall that we’re training two neural networks: Q and
                μ
                . If you simply feed
                μ
                into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
                μ
                ! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!



                To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!



                You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.



                So instead, you simply have to make sure you apply the chain rule through Q to
                μ
                , without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.



                There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.



                ** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.






                share|improve this answer









                $endgroup$



                This answer is not mine but a reply I received on Quora here: https://www.quora.com/Why-is-the-loss-for-DDPG-Actor-the-product-of-gradients-of-Q-values-actions/answer/James-MacGlashan?filter=&nsrc=2&snid3=4129828389



                Hopefully it helps someone rather than me just deleting the question and I am sorry for my lack of math jax knowledge.



                Let’s forget about DDPG, Q-functions, actors, and actions for a minute. I know, know, blasphemy, but bear with me for a moment and we’ll get back to that.



                Let’s just talk about loss functions. We’re all good with loss functions, right? They’re the backbone of our standard supervised deep learning literature. That is, in supervised learning, what we do is we define a loss function that defines how bad a prediction is from what it should be.



                As a grounding example, let’s revisit the L2 loss we use for a regression problem. We define this loss as:



                L
                2
                (
                p
                ,
                l



                )



                1
                2
                (
                p

                l
                )
                2



                Where
                p
                is a prediction from our neural net and
                l
                is the “label:” the value the prediction should have been. If we can tune our neural net parameters so that this loss is always zero, we win! In other words, we want to minimize this loss function.



                And how do we minimize it? The standard way these days is to use some flavor of stochastic gradient descent (SGD). And to do that, what we want to do is differentiate our loss in terms of our neural-net parameters
                w
                . If we used vanilla SGD, we’d do:



                w
                t
                +



                1



                w
                t

                α

                w
                L
                2



                Okay, but there’s a catch here, our loss function was in terms of
                p
                , not
                w
                . However, we describe our neural net predictions as a function of some input and the neural-net parameters w:
                p
                (
                x
                ,
                w
                )
                for which we know how to compute

                w
                p
                . Since we know that, we can use the chain rule of differentiation, which says:




                w
                L



                2




                w
                p

                p
                L
                2



                Where since w is a vector,

                w
                p
                will refer to its (transposed) Jacobian matrix.



                In our modern era of deep learning libraries, you typically don’t do that yourself. Instead, your library will use the autograd/backpropagation algorithm to compute

                w
                L
                2
                for you, and it will do it by decomposing it slightly differently than we did above.



                Despite that fact that modern autograd libraries will do things slightly differently, it will be important to see that we can express the gradient of our loss in terms of neural net parameters in that way.



                Back to Actor Critic



                Okay, now that we’ve refreshed our memory about loss functions and what applying the chain rule to a loss function looks like, let’s get back to Actor Critic and DDPG.



                Let’s ask our selves: What do we really want actor critic to do?



                Suppose someone handed you the actual Q-values for some policy. How can you improve that policy? Simple: by making a new policy that selects actions that maximize the Q-values you were handed. This is the foundation of RL. It’s why in Q-learning you set your policy to greedily select the action with the max Q-value (and maybe add some noise for exploration reasons).



                Okay, lets assume I have a Q-function: how do I find a policy that maximizes the Q-function if the actions are continuous?? I can’t do what Q-learning does by simply evaluating each action and choosing the max, because there are an infinite number of actions to evaluate!



                Wait a second though. We know how to maximize a function over continuous values. We’ve been doing it for ages in supervised learning! Note that maximizing a function is the same as minimizing the negative value of that function. That is:



                arg
                max
                x
                f
                (
                x



                )



                arg
                min
                x

                f
                (
                x
                )



                Okay cool, so if I want to maximize some function, I can just use regular SGD on the negative value of that function.



                From that, we now have an insight: what if we think of

                Q
                (
                s
                ,
                a
                )
                as a loss function, where the actions
                a
                are “predictions” from some neural net: the actor. In that case, we can use good ol’ SGD to train our actor to choose actions that maximize the Q-values.



                To do so of course requires that we have a representation of Q that we can differentiate. But if we simultaneously train a neural net to estimate Q, and we believe its estimates are good**, then we can easily use it as the “loss” function for our actor!



                Indeed, if -Q is our loss function, with actions
                a
                acting as our “predictions” from our neural-net actor model
                μ
                lets substitute that back into chain rule expression for loss functions that we wrote earlier and with which we’re familiar:




                w



                Q




                w
                μ

                a

                Q



                And there it is!



                (You may note that the DDPG paper reverses the order of the multiplication from the above. The order really just depends on what convention the paper takes the gradients and Jacobian matrices to mean. If you go back to the original deterministic policy gradients paper, you’ll see they write it out as we did above.)



                So why doesn’t DDPG just say to use autograd on -Q like we do with loss functions ?



                Okay, so now you’re probably asking yourself:



                Why didn’t the paper just say “use autograd on -Q”???



                After all, in supervised learning papers we never write out the chain rule in that way, we just assume practitioners have an autograd library and only write the forward version of the loss.



                The reason is if you’re not careful, running pure autograd will burn you!



                In DDPG, recall that we’re training two neural networks: Q and
                μ
                . If you simply feed
                μ
                into Q and run autograd, it’s going to produce gradients on the parameters of Q in addition to the parameters on
                μ
                ! But when optimizing the actor, we don’t want to change the Q function! In fact, if we did, it would be really easy to make any actor maximize Q by making Q always output huge values!



                To run autograd through Q would be like letting a supervised learning algorithm change the values of labels, which clearly is not right!



                You can’t simply “block gradients” on Q either, because then no gradients will flow back onto the actor.



                So instead, you simply have to make sure you apply the chain rule through Q to
                μ
                , without adjusting any of the parameters of Q. One way to do that is to compute the gradients of each manually and multiply them, as written in the paper.



                There are other ways to do it in autograd libraries too, but now you’re starting to get really specific about which library you’re using. So for an academic paper, it’s better to simply write out precisely what gradient should be computed for the actor, and then let the practitioner decide the best way to compute it in their library of choice.



                ** This assumption that a neural net’s estimates of the Q-function are good is really important. Because it’s an estimate, it will have errors, and a limitation of the DDPG algorithm is that your actor will exploit whatever errors exist in your neural net’s estimate of Q. Consequently, finding ways to ensure the Q-estimate is good is a very important area of work.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Mar 24 at 15:43









                flexitarian33flexitarian33

                457




                457



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47897%2fmultipying-negated-gradients-by-actions-for-the-loss-in-actor-nn-of-ddpg%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Is flight data recorder erased after every flight?When are black boxes used?What protects the location beacon (pinger) of a flight data recorder?Is there anywhere I can pick up raw flight data recorder information?Who legally owns the Flight Data Recorder?Constructing flight recorder dataWhy are FDRs and CVRs still two separate physical devices?What are the data elements shown on the GE235 flight data recorder (FDR) plot?Are CVR and FDR reset after every flight?What is the format of data stored by a Flight Data Recorder?How much data is stored in the flight data recorder per hour in a typical flight of an A380?Is a smart flight data recorder possible?

                    Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

                    Is there a general name for the setup in which payoffs are not known exactly but players try to influence each other's perception of the payoffs?Osborne, Nash equilibria and the correctness of beliefsIs there a name for this family of games (Binomial games?)?Perfect Bayesian EquilibriumCalculating mixed strategy equilibrium in battle of sexesPure Strategy SPNEIs there a commitment mechanism which allows players to achieve pareto optimal solutions?Extensive Form GamesAn $n$-player prisoner's dilemma where a coalition of 2 players is better off defectingTit-For-Stat Strategy Best RepliesPotential solutions of the $n$-player Prisoner's Dilemma