Why a Random Reward in One-step Dynamics MDP? The Next CEO of Stack Overflow2019 Community Moderator ElectionWhat is the Q function and what is the V function in reinforcement learning?What is the reward function in the 10 armed test bed?Reward dependent on (state, action) versus (state, action, successor state)Cannot see what the “notation abuse” is, mentioned by author of bookWhat is the difference between “expected return” and “expected reward” in the context of RL?How is that possible that a reward function depends both on the next state and an action from current state?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?Time horizon T in policy gradients (actor-critic)Reinforcement learning: Discounting rewards in the REINFORCE algorithmAbout applying time series forecasting to problems better suited for reinforcement learning, like toy example “Jack's car rental”

Received an invoice from my ex-employer billing me for training; how to handle?

Why is information "lost" when it got into a black hole?

Is this "being" usage is essential?

How does Madhvacharya interpret Bhagavad Gita sloka 18.66?

Why this way of making earth uninhabitable in Interstellar?

Make solar eclipses exceedingly rare, but still have new moons

What is the difference between 翼 and 翅膀?

Is it ever safe to open a suspicious HTML file (e.g. email attachment)?

I want to make a picture in physics with TikZ. Can you help me?

The exact meaning of 'Mom made me a sandwich'

Why does the flight controls check come before arming the autobrake on the A320?

Running a General Election and the European Elections together

How to place nodes around a circle from some initial angle?

Why isn't the Mueller report being released completely and unredacted?

WOW air has ceased operation, can I get my tickets refunded?

Is it possible to replace duplicates of a character with one character using tr

Is there a way to bypass a component in series in a circuit if that component fails?

Does soap repel water?

Is French Guiana a (hard) EU border?

Why is the US ranked as #45 in Press Freedom ratings, despite its extremely permissive free speech laws?

Would a completely good Muggle be able to use a wand?

What was the first Unix version to run on a microcomputer?

What can we do to stop prior company from asking us questions?

What did we know about the Kessel run before the prologues?

Why a Random Reward in One-step Dynamics MDP?

The Next CEO of Stack Overflow

2019 Community Moderator ElectionWhat is the Q function and what is the V function in reinforcement learning?What is the reward function in the 10 armed test bed?Reward dependent on (state, action) versus (state, action, successor state)Cannot see what the “notation abuse” is, mentioned by author of bookWhat is the difference between “expected return” and “expected reward” in the context of RL?How is that possible that a reward function depends both on the next state and an action from current state?How is Importance-Sampling Used in Off-Policy Monte Carlo Prediction?Time horizon T in policy gradients (actor-critic)Reinforcement learning: Discounting rewards in the REINFORCE algorithmAbout applying time series forecasting to problems better suited for reinforcement learning, like toy example “Jack's car rental”

I am reading the 2018 book by Sutton & Barto on Reinforcement Learning and I am wondering the benefit of defining the one-step dynamics of an MDP as
$$
p(s',r|s,a) = Pr(S_t+1,R_t+1|S_t=s, A_t=a)
$$
where $S_t$ is the state and $A_t$ the action at time $t$. $R_t$ is the reward.

This formulation would be useful if we were to allow different rewards when transitioning from $s$ to $s'$ by taking an action $a$, but this does not make sense. I am used to the definition based on $p(s'|s,a)$ and $r(s,a,s')$, which of course can be derived from the one-step dynamics above.

Clearly, I am missing something. Any enlightenment would be really helpful. Thx!

edited Mar 24 at 5:31

Esmailian

2,212218

asked Mar 16 at 21:59

RLSelfStudy

283

$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39

$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46

add a comment |

Clearly, I am missing something. Any enlightenment would be really helpful. Thx!

edited Mar 24 at 5:31

Esmailian

2,212218

asked Mar 16 at 21:59

RLSelfStudy

283

$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39

$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46

add a comment |

Clearly, I am missing something. Any enlightenment would be really helpful. Thx!

edited Mar 24 at 5:31

Esmailian

2,212218

asked Mar 16 at 21:59

RLSelfStudy

283

Clearly, I am missing something. Any enlightenment would be really helpful. Thx!

machine-learning reinforcement-learning

edited Mar 24 at 5:31

Esmailian

2,212218

asked Mar 16 at 21:59

RLSelfStudy

283

edited Mar 24 at 5:31

Esmailian

2,212218

asked Mar 16 at 21:59

RLSelfStudy

283

edited Mar 24 at 5:31

Esmailian

2,212218

edited Mar 24 at 5:31

Esmailian

2,212218

edited Mar 24 at 5:31

Esmailian

2,212218

asked Mar 16 at 21:59

RLSelfStudy

283

asked Mar 16 at 21:59

RLSelfStudy

283

asked Mar 16 at 21:59

RLSelfStudy

283

$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39

$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46

add a comment |

$begingroup$
Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you
$endgroup$
– Neil Slater
Mar 16 at 22:39

$begingroup$
My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...
$endgroup$
– RLSelfStudy
Mar 16 at 22:46

Could you explain why, to you, that "allow different rewards when transitioning from 𝑠 to 𝑠′ by taking an action 𝑎" does not make sense? It makes sense to me, but I cannot explain it to you, unless you give more details about what is wrong with the idea to you

– Neil Slater
Mar 16 at 22:39

My understanding is that given a starting state and a target state, reachable by applying action $a$, there is only a single reward. If we have multiple rewards, then we are allowing the Markov Chain model (thought as a graph) being a multi-graph where we can go from $s$ to $s'$ (with $a$) over an edge with reward $r$ and another with reward $r'$. I thought this is not the right model ... but again ... I might be wrong ...

– RLSelfStudy
Mar 16 at 22:46

add a comment |

2 Answers
2

active

oldest

votes

In general, $R_t+1$ is is a random variable with conditional probability distribution $Pr(R_t+1=r|S_t=s,A_t=a)$. So it can potentially take on a different value each time action $a$ is taken in state $s$.

Some problems don't require any randomness in their reward function. Using the expected reward $r(s,a,s')$ is simpler in this case, since we don't have to worry about the reward's distribution. However, some problems do require randomness in their reward function. Consider the classic multi-armed bandit problem, for example. The payoff from a machine isn't generally deterministic.

As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.

answered Mar 17 at 0:39

Philip Raeisghasem

2785

add a comment |

State is just an observation of the environment, in many case, we can't get all the variables to fully describe the environment(or maybe it's too time-consuming or space consuming to cover every thing). Just imagine you are designing an robot, you can't and don't need to define a state covering the direction of wind, the density of the atmosphere etc.

So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.

So, we can say that, from one particular state to another particular state, the reward may be different, because the state is not the environment, and the environment can't never be the same, because time is flowing

answered Mar 24 at 2:50

苏东远

111

New contributor

$begingroup$
Very good explanation!
$endgroup$
– Esmailian
Mar 24 at 5:26

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47436%2fwhy-a-random-reward-in-one-step-dynamics-mdp%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.

answered Mar 17 at 0:39

Philip Raeisghasem

2785

add a comment |

As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.

answered Mar 17 at 0:39

Philip Raeisghasem

2785

add a comment |

As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.

answered Mar 17 at 0:39

Philip Raeisghasem

2785

As the basis for RL, we want the MDP to be as general as possible. We model reward in MDPs as a random variable because it gives us that generality. And because it is useful to do so.

answered Mar 17 at 0:39

Philip Raeisghasem

2785

answered Mar 17 at 0:39

Philip Raeisghasem

2785

answered Mar 17 at 0:39

Philip Raeisghasem

2785

answered Mar 17 at 0:39

Philip Raeisghasem

2785

add a comment |

So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.

answered Mar 24 at 2:50

苏东远

111

New contributor

$begingroup$
Very good explanation!
$endgroup$
– Esmailian
Mar 24 at 5:26

add a comment |

So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.

answered Mar 24 at 2:50

苏东远

111

New contributor

$begingroup$
Very good explanation!
$endgroup$
– Esmailian
Mar 24 at 5:26

add a comment |

So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.

answered Mar 24 at 2:50

苏东远

111

New contributor

So, although you are in the same state(the same just means the variables you care about have the same value, but not all dynamics of the environment), you are not totally in the same environment.

answered Mar 24 at 2:50

苏东远

111

New contributor

answered Mar 24 at 2:50

苏东远

111

New contributor

answered Mar 24 at 2:50

苏东远

111

answered Mar 24 at 2:50

苏东远

111

New contributor

苏东远 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Very good explanation!
$endgroup$
– Esmailian
Mar 24 at 5:26

add a comment |

$begingroup$
Very good explanation!
$endgroup$
– Esmailian
Mar 24 at 5:26

Very good explanation!

– Esmailian
Mar 24 at 5:26

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

IMWDBa,x6,t,w5T,wu5aMtBzWGmDq9t4QW

搜尋此網誌

Trjtdtk

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

2 Answers
2

2 Answers
2

2 Answers
2