In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?Understanding the training phase of the tutorial “Using Keras and Deep Deterministic Policy Gradient to play TORCS” tutorialWhy do we normalize the discounted rewards when doing policy gradient reinforcement learning?How to get rid of the expectation in Monte Carlo Policy Gradient method?How an action gets selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?Understanding policy gradient theorem - What does it mean to take gradients of reward wrt policy parameters?

Yosemite Fire Rings - What to Expect?

What are some good ways to treat frozen vegetables such that they behave like fresh vegetables when stir frying them?

Creepy dinosaur pc game identification

Can a stoichiometric mixture of oxygen and methane exist as a liquid at standard pressure and some (low) temperature?

Can a Canadian Travel to the USA twice, less than 180 days each time?

Why "had" in "[something] we would have made had we used [something]"?

Extract more than nine arguments that occur periodically in a sentence to use in macros in order to typset

Why is so much work done on numerical verification of the Riemann Hypothesis?

Mimic lecturing on blackboard, facing audience

Is there an injective, monotonically increasing, strictly concave function from the reals, to the reals?

What is the highest possible scrabble score for placing a single tile

Why can Carol Danvers change her suit colours in the first place?

Limits and Infinite Integration by Parts

Calculate sum of polynomial roots

Invalid date error by date command

What should you do when eye contact makes your subordinate uncomfortable?

Does malloc reserve more space while allocating memory?

Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?

How do you make your own symbol when Detexify fails?

What happens if you are holding an Iron Flask with a demon inside and walk into an Antimagic Field?

What if a revenant (monster) gains fire resistance?

Why did the EU agree to delay the Brexit deadline?

Can I still be respawned if I die by falling off the map?

How should I respond when I lied about my education and the company finds out through background check?



In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?


Understanding the training phase of the tutorial “Using Keras and Deep Deterministic Policy Gradient to play TORCS” tutorialWhy do we normalize the discounted rewards when doing policy gradient reinforcement learning?How to get rid of the expectation in Monte Carlo Policy Gradient method?How an action gets selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?Understanding policy gradient theorem - What does it mean to take gradients of reward wrt policy parameters?













1












$begingroup$


In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
policy gradient formula



In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)



The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?










share|improve this question











$endgroup$
















    1












    $begingroup$


    In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
    policy gradient formula



    In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)



    The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?










    share|improve this question











    $endgroup$














      1












      1








      1





      $begingroup$


      In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
      policy gradient formula



      In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)



      The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?










      share|improve this question











      $endgroup$




      In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
      policy gradient formula



      In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)



      The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?







      reinforcement-learning policy-gradients






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 19 at 4:46







      liyuan

















      asked Mar 19 at 3:50









      liyuanliyuan

      205




      205




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$


          wouldn't longer trajectories get more weight?




          Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.




          Why would it work like that?




          If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.






          share|improve this answer











          $endgroup$












            Your Answer





            StackExchange.ifUsing("editor", function ()
            return StackExchange.using("mathjaxEditing", function ()
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            );
            );
            , "mathjax-editing");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47577%2fin-calculating-policy-gradients-wouldnt-longer-trajectories-have-more-weight-a%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$


            wouldn't longer trajectories get more weight?




            Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.




            Why would it work like that?




            If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.






            share|improve this answer











            $endgroup$

















              1












              $begingroup$


              wouldn't longer trajectories get more weight?




              Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.




              Why would it work like that?




              If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.






              share|improve this answer











              $endgroup$















                1












                1








                1





                $begingroup$


                wouldn't longer trajectories get more weight?




                Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.




                Why would it work like that?




                If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.






                share|improve this answer











                $endgroup$




                wouldn't longer trajectories get more weight?




                Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.




                Why would it work like that?




                If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Mar 19 at 19:12

























                answered Mar 19 at 8:19









                EsmailianEsmailian

                1,686114




                1,686114



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47577%2fin-calculating-policy-gradients-wouldnt-longer-trajectories-have-more-weight-a%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High