In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?Understanding the training phase of the tutorial “Using Keras and Deep Deterministic Policy Gradient to play TORCS” tutorialWhy do we normalize the discounted rewards when doing policy gradient reinforcement learning?How to get rid of the expectation in Monte Carlo Policy Gradient method?How an action gets selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?Understanding policy gradient theorem - What does it mean to take gradients of reward wrt policy parameters?
Yosemite Fire Rings - What to Expect?
What are some good ways to treat frozen vegetables such that they behave like fresh vegetables when stir frying them?
Creepy dinosaur pc game identification
Can a stoichiometric mixture of oxygen and methane exist as a liquid at standard pressure and some (low) temperature?
Can a Canadian Travel to the USA twice, less than 180 days each time?
Why "had" in "[something] we would have made had we used [something]"?
Extract more than nine arguments that occur periodically in a sentence to use in macros in order to typset
Why is so much work done on numerical verification of the Riemann Hypothesis?
Mimic lecturing on blackboard, facing audience
Is there an injective, monotonically increasing, strictly concave function from the reals, to the reals?
What is the highest possible scrabble score for placing a single tile
Why can Carol Danvers change her suit colours in the first place?
Limits and Infinite Integration by Parts
Calculate sum of polynomial roots
Invalid date error by date command
What should you do when eye contact makes your subordinate uncomfortable?
Does malloc reserve more space while allocating memory?
Does an advisor owe his/her student anything? Will an advisor keep a PhD student only out of pity?
How do you make your own symbol when Detexify fails?
What happens if you are holding an Iron Flask with a demon inside and walk into an Antimagic Field?
What if a revenant (monster) gains fire resistance?
Why did the EU agree to delay the Brexit deadline?
Can I still be respawned if I die by falling off the map?
How should I respond when I lied about my education and the company finds out through background check?
In calculating policy gradients, wouldn't longer trajectories have more weight according to the policy gradient formula?
Understanding the training phase of the tutorial “Using Keras and Deep Deterministic Policy Gradient to play TORCS” tutorialWhy do we normalize the discounted rewards when doing policy gradient reinforcement learning?How to get rid of the expectation in Monte Carlo Policy Gradient method?How an action gets selected in a Policy Gradient Method?Time horizon T in policy gradients (actor-critic)Policy Gradient Methods - ScoreFunction & Log(policy)Policy Gradients - gradient Log probabilities favor less likely actions?Understanding policy gradient theorem - What does it mean to take gradients of reward wrt policy parameters?
$begingroup$
In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)
The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?
reinforcement-learning policy-gradients
$endgroup$
add a comment |
$begingroup$
In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)
The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?
reinforcement-learning policy-gradients
$endgroup$
add a comment |
$begingroup$
In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)
The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?
reinforcement-learning policy-gradients
$endgroup$
In Sergey Levine's lecture on policy gradients (berkeley deep rl course), he show that policy gradient can be evaluated according to the formula
In this formula, wouldn't longer trajectories get more weight (in finite horizon situations), since the middle term, the sum over log pi, would involve more terms? (Why would it work like that?)
The specific example I have in mind is pacman, longer trajectories would contribute more to the gradient. Should it work like that?
reinforcement-learning policy-gradients
reinforcement-learning policy-gradients
edited Mar 19 at 4:46
liyuan
asked Mar 19 at 3:50
liyuanliyuan
205
205
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
wouldn't longer trajectories get more weight?
Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.
Why would it work like that?
If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47577%2fin-calculating-policy-gradients-wouldnt-longer-trajectories-have-more-weight-a%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
wouldn't longer trajectories get more weight?
Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.
Why would it work like that?
If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.
$endgroup$
add a comment |
$begingroup$
wouldn't longer trajectories get more weight?
Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.
Why would it work like that?
If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.
$endgroup$
add a comment |
$begingroup$
wouldn't longer trajectories get more weight?
Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.
Why would it work like that?
If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.
$endgroup$
wouldn't longer trajectories get more weight?
Not necessarily. Gradient $triangledown_theta$ could be negative or positive (1D analogy), therefore, larger number of gradients could have a smaller weight, which makes sense. A consistent short trajectory is more informative (has more weight) than an inconsistent long trajectory with sign-alternating policy gradients.
Why would it work like that?
If we are comparing two consistent trajectories, where most gradients are in the same direction, this formula makes sense again. A long consistent trajectory contains more useful information (more steps that confirm each other) than a short one. In real life, compare the informativeness of a successful week to a successful year for your policy learning.
edited Mar 19 at 19:12
answered Mar 19 at 8:19
EsmailianEsmailian
1,686114
1,686114
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47577%2fin-calculating-policy-gradients-wouldnt-longer-trajectories-have-more-weight-a%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown