RL: Collecting States (training data) in real-life. Must use fixed timestep?How to generate ratings without training data?Supervised Learning could be biased if we use obsolete dataIs RL applicable to environments that are totally RANDOM?How does Implicit Quantile-Regression Network (IQN) differ from QR-DQN?What strategies and algorithms are suited for using the time wasted in collecting big data?Reinforcement Learning on real time data over a web server

Consistent Linux device enumeration

Storage of electrolytic capacitors - how long?

Pre-Employment Background Check With Consent For Future Checks

Can anyone precisely describe what it means (or feels like) to play exactly what your "inner ear" is hearing?

Limit max CPU usage SQL SERVER with WSRM

Should a narrator ever describe things based on a character's view instead of facts?

Giving feedback to someone without sounding prejudiced

Bash: Why does this Brace Expression work this way?

Why didn’t Eve recognize the little cockroach as a living organism?

When should I pay my rent?

When and why was runway 07/25 at Kai Tak removed?

Mimic lecturing on blackboard, facing audience

How to add numbers in array using forEach

What should be the ideal length of sentences in a blog post for ease of reading?

Why is participating in the European Parliamentary elections used as a threat?

Why does the frost depth increase when the surface temperature warms up?

Extracting patterns from a text

Why would five hundred and five be same as one?

Asserting that Atheism and Theism are both faith based positions

PTIJ: Which Dr. Seuss books should one obtain?

The garden where everything is possible

Why do Radio Buttons not fill the entire outer circle?

How would a solely written language work mechanically

How to preserve electronics (computers, iPads and phones) for hundreds of years



RL: Collecting States (training data) in real-life. Must use fixed timestep?


How to generate ratings without training data?Supervised Learning could be biased if we use obsolete dataIs RL applicable to environments that are totally RANDOM?How does Implicit Quantile-Regression Network (IQN) differ from QR-DQN?What strategies and algorithms are suited for using the time wasted in collecting big data?Reinforcement Learning on real time data over a web server













3












$begingroup$


I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.



To decide what action to perform, the network must perform a forward pass.



It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).



The situation is worsened if I run, say 100 games at once on the same computer.



I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.



Question:



Should I stick to a 'fixed timestep' approach, only asking to provide an action every 0.1 seconds? While it computes next action, I could pretend the network keeps outputting the most recent action for all the skipped frames. Good idea?



If that's the only option, then should I avoid at all costs situations where forward prop takes more than the 'fixed timestep'? (more than 0.1 sec in my case) So it's better to choose say, 0.2 seconds just to be safe.




Seems quite unreliable - is there a better way to do it?



Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)










share|improve this question











$endgroup$





This question has an open bounty worth +50
reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.


The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.


If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data




















    3












    $begingroup$


    I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.



    To decide what action to perform, the network must perform a forward pass.



    It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).



    The situation is worsened if I run, say 100 games at once on the same computer.



    I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.



    Question:



    Should I stick to a 'fixed timestep' approach, only asking to provide an action every 0.1 seconds? While it computes next action, I could pretend the network keeps outputting the most recent action for all the skipped frames. Good idea?



    If that's the only option, then should I avoid at all costs situations where forward prop takes more than the 'fixed timestep'? (more than 0.1 sec in my case) So it's better to choose say, 0.2 seconds just to be safe.




    Seems quite unreliable - is there a better way to do it?



    Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)










    share|improve this question











    $endgroup$





    This question has an open bounty worth +50
    reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.


    The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.


    If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data


















      3












      3








      3





      $begingroup$


      I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.



      To decide what action to perform, the network must perform a forward pass.



      It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).



      The situation is worsened if I run, say 100 games at once on the same computer.



      I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.



      Question:



      Should I stick to a 'fixed timestep' approach, only asking to provide an action every 0.1 seconds? While it computes next action, I could pretend the network keeps outputting the most recent action for all the skipped frames. Good idea?



      If that's the only option, then should I avoid at all costs situations where forward prop takes more than the 'fixed timestep'? (more than 0.1 sec in my case) So it's better to choose say, 0.2 seconds just to be safe.




      Seems quite unreliable - is there a better way to do it?



      Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)










      share|improve this question











      $endgroup$




      I am using a Reinforcement Learning agent to play a 3D game, but have trouble with collecting the "current and next state" pairs.



      To decide what action to perform, the network must perform a forward pass.



      It performs forward pass in time $t$, but in the meantime the game could have already skipped like 10 frames or more (a varying amount).



      The situation is worsened if I run, say 100 games at once on the same computer.



      I don't have the ability to stop the game at each frame to do forwardprop. Anyway it wouldn't be possible were I to train, say a real-life robot to walk.



      Question:



      Should I stick to a 'fixed timestep' approach, only asking to provide an action every 0.1 seconds? While it computes next action, I could pretend the network keeps outputting the most recent action for all the skipped frames. Good idea?



      If that's the only option, then should I avoid at all costs situations where forward prop takes more than the 'fixed timestep'? (more than 0.1 sec in my case) So it's better to choose say, 0.2 seconds just to be safe.




      Seems quite unreliable - is there a better way to do it?



      Is there a paper that explores the alternatives? (I guess such a paper will be about real-life robot training)







      reinforcement-learning






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 2 days ago







      Kari

















      asked Mar 17 at 4:26









      KariKari

      619422




      619422






      This question has an open bounty worth +50
      reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.


      The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.


      If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data








      This question has an open bounty worth +50
      reputation from Kari ending ending at 2019-03-26 04:59:07Z">in 6 days.


      The question is widely applicable to a large audience. A detailed canonical answer is required to address all the concerns.


      If possible, please provide a paper that explores the timing-techniques. And/or help with outlining the possible techniques of collecting such data






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          Your "fixed timestep" idea is actually very similar to a common technique called frame skipping. Instead of waiting a fixed amount of time, agents wait a fixed number of frames $k$ before choosing a new action. In the meantime, they repeat their most recently chosen action.



          Frame skipping was included as part of the Atari 2600 Arcade Learning Environment. It was also used in the foundational DQN paper. Common values of $k$ are 3, 4, and 5. The value chosen depended on the game being played, since different games had important events happen at different time resolutions. In these papers, frame skipping enabled training to happen roughly $k$ times faster. So this is definitely a valid technique to try.



          I actually think this would generally be less of a concern in the robotics application. Forward propagation usually happens much more quickly than real-world event timescales. As an example, Stanford famously applies RL to fly small helicopters, which requires considerable precision.



          Finally, if your forward propagation really is taking too long, you should consider a faster architecture. One approach would be just to make your neural net smaller. You might consider policy distillation for compressing a large, trained network into a smaller one. Also, make sure you're not using some ridiculously slow activation function like sigmoid or tanh. ReLU is the common choice if you don't need a bounded output for a given neuron. If you do, I recommend softsign.



          If your time bottleneck is actually in action selection, due to a large action space and using a value network, you should seriously consider switching to a policy-based method (e.g. actor critic). This would help because sampling from a distribution over actions would potentially be much faster than the $max$ operation involved in value-based methods. You can read more about this in Section 13.7 of Sutton and Barto's RL book.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
            $endgroup$
            – Philip Raeisghasem
            6 hours ago










          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47449%2frl-collecting-states-training-data-in-real-life-must-use-fixed-timestep%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$

          Your "fixed timestep" idea is actually very similar to a common technique called frame skipping. Instead of waiting a fixed amount of time, agents wait a fixed number of frames $k$ before choosing a new action. In the meantime, they repeat their most recently chosen action.



          Frame skipping was included as part of the Atari 2600 Arcade Learning Environment. It was also used in the foundational DQN paper. Common values of $k$ are 3, 4, and 5. The value chosen depended on the game being played, since different games had important events happen at different time resolutions. In these papers, frame skipping enabled training to happen roughly $k$ times faster. So this is definitely a valid technique to try.



          I actually think this would generally be less of a concern in the robotics application. Forward propagation usually happens much more quickly than real-world event timescales. As an example, Stanford famously applies RL to fly small helicopters, which requires considerable precision.



          Finally, if your forward propagation really is taking too long, you should consider a faster architecture. One approach would be just to make your neural net smaller. You might consider policy distillation for compressing a large, trained network into a smaller one. Also, make sure you're not using some ridiculously slow activation function like sigmoid or tanh. ReLU is the common choice if you don't need a bounded output for a given neuron. If you do, I recommend softsign.



          If your time bottleneck is actually in action selection, due to a large action space and using a value network, you should seriously consider switching to a policy-based method (e.g. actor critic). This would help because sampling from a distribution over actions would potentially be much faster than the $max$ operation involved in value-based methods. You can read more about this in Section 13.7 of Sutton and Barto's RL book.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
            $endgroup$
            – Philip Raeisghasem
            6 hours ago















          1












          $begingroup$

          Your "fixed timestep" idea is actually very similar to a common technique called frame skipping. Instead of waiting a fixed amount of time, agents wait a fixed number of frames $k$ before choosing a new action. In the meantime, they repeat their most recently chosen action.



          Frame skipping was included as part of the Atari 2600 Arcade Learning Environment. It was also used in the foundational DQN paper. Common values of $k$ are 3, 4, and 5. The value chosen depended on the game being played, since different games had important events happen at different time resolutions. In these papers, frame skipping enabled training to happen roughly $k$ times faster. So this is definitely a valid technique to try.



          I actually think this would generally be less of a concern in the robotics application. Forward propagation usually happens much more quickly than real-world event timescales. As an example, Stanford famously applies RL to fly small helicopters, which requires considerable precision.



          Finally, if your forward propagation really is taking too long, you should consider a faster architecture. One approach would be just to make your neural net smaller. You might consider policy distillation for compressing a large, trained network into a smaller one. Also, make sure you're not using some ridiculously slow activation function like sigmoid or tanh. ReLU is the common choice if you don't need a bounded output for a given neuron. If you do, I recommend softsign.



          If your time bottleneck is actually in action selection, due to a large action space and using a value network, you should seriously consider switching to a policy-based method (e.g. actor critic). This would help because sampling from a distribution over actions would potentially be much faster than the $max$ operation involved in value-based methods. You can read more about this in Section 13.7 of Sutton and Barto's RL book.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
            $endgroup$
            – Philip Raeisghasem
            6 hours ago













          1












          1








          1





          $begingroup$

          Your "fixed timestep" idea is actually very similar to a common technique called frame skipping. Instead of waiting a fixed amount of time, agents wait a fixed number of frames $k$ before choosing a new action. In the meantime, they repeat their most recently chosen action.



          Frame skipping was included as part of the Atari 2600 Arcade Learning Environment. It was also used in the foundational DQN paper. Common values of $k$ are 3, 4, and 5. The value chosen depended on the game being played, since different games had important events happen at different time resolutions. In these papers, frame skipping enabled training to happen roughly $k$ times faster. So this is definitely a valid technique to try.



          I actually think this would generally be less of a concern in the robotics application. Forward propagation usually happens much more quickly than real-world event timescales. As an example, Stanford famously applies RL to fly small helicopters, which requires considerable precision.



          Finally, if your forward propagation really is taking too long, you should consider a faster architecture. One approach would be just to make your neural net smaller. You might consider policy distillation for compressing a large, trained network into a smaller one. Also, make sure you're not using some ridiculously slow activation function like sigmoid or tanh. ReLU is the common choice if you don't need a bounded output for a given neuron. If you do, I recommend softsign.



          If your time bottleneck is actually in action selection, due to a large action space and using a value network, you should seriously consider switching to a policy-based method (e.g. actor critic). This would help because sampling from a distribution over actions would potentially be much faster than the $max$ operation involved in value-based methods. You can read more about this in Section 13.7 of Sutton and Barto's RL book.






          share|improve this answer











          $endgroup$



          Your "fixed timestep" idea is actually very similar to a common technique called frame skipping. Instead of waiting a fixed amount of time, agents wait a fixed number of frames $k$ before choosing a new action. In the meantime, they repeat their most recently chosen action.



          Frame skipping was included as part of the Atari 2600 Arcade Learning Environment. It was also used in the foundational DQN paper. Common values of $k$ are 3, 4, and 5. The value chosen depended on the game being played, since different games had important events happen at different time resolutions. In these papers, frame skipping enabled training to happen roughly $k$ times faster. So this is definitely a valid technique to try.



          I actually think this would generally be less of a concern in the robotics application. Forward propagation usually happens much more quickly than real-world event timescales. As an example, Stanford famously applies RL to fly small helicopters, which requires considerable precision.



          Finally, if your forward propagation really is taking too long, you should consider a faster architecture. One approach would be just to make your neural net smaller. You might consider policy distillation for compressing a large, trained network into a smaller one. Also, make sure you're not using some ridiculously slow activation function like sigmoid or tanh. ReLU is the common choice if you don't need a bounded output for a given neuron. If you do, I recommend softsign.



          If your time bottleneck is actually in action selection, due to a large action space and using a value network, you should seriously consider switching to a policy-based method (e.g. actor critic). This would help because sampling from a distribution over actions would potentially be much faster than the $max$ operation involved in value-based methods. You can read more about this in Section 13.7 of Sutton and Barto's RL book.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited 6 hours ago

























          answered 6 hours ago









          Philip RaeisghasemPhilip Raeisghasem

          1835




          1835







          • 1




            $begingroup$
            On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
            $endgroup$
            – Philip Raeisghasem
            6 hours ago












          • 1




            $begingroup$
            On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
            $endgroup$
            – Philip Raeisghasem
            6 hours ago







          1




          1




          $begingroup$
          On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
          $endgroup$
          – Philip Raeisghasem
          6 hours ago




          $begingroup$
          On a related note, frame skipping can actually be preferred even if you didn't have time constraints. The paper Frame Skip Is a Powerful Parameter for Learning to Play Atari shows that a larger frame skip helps in learning strategies over long time scales.
          $endgroup$
          – Philip Raeisghasem
          6 hours ago

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47449%2frl-collecting-states-training-data-in-real-life-must-use-fixed-timestep%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown