RL: Collecting States (training data) in real-life. Must use fixed timestep?How to generate ratings without training data?Supervised Learning could be biased if we use obsolete dataIs RL applicable to environments that are totally RANDOM?How does Implicit Quantile-Regression Network (IQN) differ from QR-DQN?What strategies and algorithms are suited for using the time wasted in collecting big data?Reinforcement Learning on real time data over a web server

Consistent Linux device enumeration

Storage of electrolytic capacitors - how long?

Pre-Employment Background Check With Consent For Future Checks

Can anyone precisely describe what it means (or feels like) to play exactly what your "inner ear" is hearing?

Limit max CPU usage SQL SERVER with WSRM

Should a narrator ever describe things based on a character's view instead of facts?

Giving feedback to someone without sounding prejudiced

Bash: Why does this Brace Expression work this way?

Why didn’t Eve recognize the little cockroach as a living organism?

When should I pay my rent?

When and why was runway 07/25 at Kai Tak removed?

Mimic lecturing on blackboard, facing audience

How to add numbers in array using forEach

What should be the ideal length of sentences in a blog post for ease of reading?

Why is participating in the European Parliamentary elections used as a threat?

Why does the frost depth increase when the surface temperature warms up?

Extracting patterns from a text

Why would five hundred and five be same as one?

Asserting that Atheism and Theism are both faith based positions

PTIJ: Which Dr. Seuss books should one obtain?

The garden where everything is possible

Why do Radio Buttons not fill the entire outer circle?

How would a solely written language work mechanically

How to preserve electronics (computers, iPads and phones) for hundreds of years

RL: Collecting States (training data) in real-life. Must use fixed timestep?

How to generate ratings without training data?Supervised Learning could be biased if we use obsolete dataIs RL applicable to environments that are totally RANDOM?How does Implicit Quantile-Regression Network (IQN) differ from QR-DQN?What strategies and algorithms are suited for using the time wasted in collecting big data?Reinforcement Learning on real time data over a web server

I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.

To decide what action to perform, the network must perform a forward pass.

It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).

The situation is worsened if I run, say 100 games at once on the same computer.

I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.

Question:

Should I stick to a 'fixed timestep' approach, only asking to provide an action every 0.1 seconds? While it computes next action, I could pretend the network keeps outputting the most recent action for all the skipped frames. Good idea?

If that's the only option, then should I avoid at all costs situations where forward prop takes more than the 'fixed timestep'? (more than 0.1 sec in my case) So it's better to choose say, 0.2 seconds just to be safe.

Seems quite unreliable - is there a better way to do it?

Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)

edited 2 days ago

asked Mar 17 at 4:26

Kari

619422

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.

If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data

add a comment |

I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.

To decide what action to perform, the network must perform a forward pass.

It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).

The situation is worsened if I run, say 100 games at once on the same computer.

I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.

Question:

Seems quite unreliable - is there a better way to do it?

Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)

edited 2 days ago

asked Mar 17 at 4:26

Kari

619422

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.

If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data

add a comment |

I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.

To decide what action to perform, the network must perform a forward pass.

It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).

The situation is worsened if I run, say 100 games at once on the same computer.

I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.

Question:

Seems quite unreliable - is there a better way to do it?

Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)

edited 2 days ago

asked Mar 17 at 4:26

Kari

619422

I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.

To decide what action to perform, the network must perform a forward pass.

It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).

The situation is worsened if I run, say 100 games at once on the same computer.

I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.

Question:

Seems quite unreliable - is there a better way to do it?

Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)

reinforcement-learning

edited 2 days ago

asked Mar 17 at 4:26

Kari

619422

edited 2 days ago

asked Mar 17 at 4:26

Kari

619422

edited 2 days ago

asked Mar 17 at 4:26

Kari

619422

asked Mar 17 at 4:26

Kari

619422

asked Mar 17 at 4:26

Kari

619422

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.

If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.

If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data

add a comment |

1 Answer
1

active

oldest

votes

Your "fixed timestep" idea is actually very similar to a common technique called frame skipping. Instead of waiting a fixed amount of time, agents wait a fixed number of frames $k$ before choosing a new action. In the meantime, they repeat their most recently chosen action.

Frame skipping was included as part of the Atari 2600 Arcade Learning Environment. It was also used in the foundational DQN paper. Common values of $k$ are 3, 4, and 5. The value chosen depended on the game being played, since different games had important events happen at different time resolutions. In these papers, frame skipping enabled training to happen roughly $k$ times faster. So this is definitely a valid technique to try.

I actually think this would generally be less of a concern in the robotics application. Forward propagation usually happens much more quickly than real-world event timescales. As an example, Stanford famously applies RL to fly small helicopters, which requires considerable precision.

Finally, if your forward propagation really is taking too long, you should consider a faster architecture. One approach would be just to make your neural net smaller. You might consider policy distillation for compressing a large, trained network into a smaller one. Also, make sure you're not using some ridiculously slow activation function like sigmoid or tanh. ReLU is the common choice if you don't need a bounded output for a given neuron. If you do, I recommend softsign.

If your time bottleneck is actually in action selection, due to a large action space and using a value network, you should seriously consider switching to a policy-based method (e.g. actor critic). This would help because sampling from a distribution over actions would potentially be much faster than the $max$ operation involved in value-based methods. You can read more about this in Section 13.7 of Sutton and Barto's RL book.

edited 6 hours ago

answered 6 hours ago

Philip Raeisghasem

1835

1

$begingroup$
On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
$endgroup$
– Philip Raeisghasem
6 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47449%2frl-collecting-states-training-data-in-real-life-must-use-fixed-timestep%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

edited 6 hours ago

answered 6 hours ago

Philip Raeisghasem

1835

1

$begingroup$
On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
$endgroup$
– Philip Raeisghasem
6 hours ago

add a comment |

edited 6 hours ago

answered 6 hours ago

Philip Raeisghasem

1835

1

$begingroup$
On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
$endgroup$
– Philip Raeisghasem
6 hours ago

add a comment |

edited 6 hours ago

answered 6 hours ago

Philip Raeisghasem

1835

edited 6 hours ago

answered 6 hours ago

Philip Raeisghasem

1835

edited 6 hours ago

answered 6 hours ago

Philip Raeisghasem

1835

answered 6 hours ago

Philip Raeisghasem

1835

answered 6 hours ago

Philip Raeisghasem

1835

1

$begingroup$
On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
$endgroup$
– Philip Raeisghasem
6 hours ago

add a comment |

1

$begingroup$
On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
$endgroup$
– Philip Raeisghasem
6 hours ago

On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.

– Philip Raeisghasem
6 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

This question has an open bounty worth +50 reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50 reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50 reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50 reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.

1 Answer
1

1 Answer
1

1 Answer
1