Why is Distributional DQN faster than vanilla DQN?Catastrophic forgetting in linear semi-gradient RL agent?Why random sample from replay for DQN?state-action-reward-new state: confusion of termsQ-Learning: Target Network vs Double DQNDueling DQN - can't understand its mechanismPolicy Gradients vs Value function, when implemented via DQNDeep Q-Learning with large number of actionsWhy is my loss function for DQN converging too quickly?Why does exploration in DQN not lead to instability?DQN cannot learn or converge

Re-entry to Germany after vacation using blue card

Does tea made with boiling water cool faster than tea made with boiled (but still hot) water?

What is the optimal strategy for the Dictionary Game?

Can I criticise the more senior developers around me for not writing clean code?

Rivers without rain

How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?

Who was the lone kid in the line of people at the lake at the end of Avengers: Endgame?

How can I print the prosodic symbols in LaTeX?

Two field separators (colon and space) in awk

How to fry ground beef so it is well-browned

Critique of timeline aesthetic

"The cow" OR "a cow" OR "cows" in this context

Is Diceware more secure than a long passphrase?

Can SQL Server create collisions in system generated constraint names?

Is there a way to generate a list of distinct numbers such that no two subsets ever have an equal sum?

Solving a quadratic equation by completing the square

How do I deal with a coworker that keeps asking to make small superficial changes to a report, and it is seriously triggering my anxiety?

Overlay of two functions leaves gaps

Could the terminal length of components like resistors be reduced?

"Whatever a Russian does, they end up making the Kalashnikov gun"? Are there any similar proverbs in English?

How does Captain America channel this power?

How would 10 generations of living underground change the human body?

Was there a shared-world project before "Thieves World"?

What happens to Mjolnir (Thor's hammer) at the end of Endgame?



Why is Distributional DQN faster than vanilla DQN?


Catastrophic forgetting in linear semi-gradient RL agent?Why random sample from replay for DQN?state-action-reward-new state: confusion of termsQ-Learning: Target Network vs Double DQNDueling DQN - can't understand its mechanismPolicy Gradients vs Value function, when implemented via DQNDeep Q-Learning with large number of actionsWhy is my loss function for DQN converging too quickly?Why does exploration in DQN not lead to instability?DQN cannot learn or converge













1












$begingroup$


Recently I learned about Distributional approach to RL, which is a quite fascinating and break--through algorithm.



I have 2 questions:



What is it that makes it perform so much better during runtime than DQN? My understanding is that during runtime we will still have to select an action with the largest expected value. But to compute these expected values, we will now have to look at distributions of all possible actions at $x_t+1$, then select an action with a highest expected value. This would actually mean during extra work during runtime



What is the explanation for its faster converge than that of vanilla DQN? As I understand, the policy hasn't changed, we are still selecting the best action from state $x_t+1$, then use its best action's distribution for bootstrapping (adjusting) the distribution of our current state's best action.



Where does the Distributional part come into play and make the network be smarter about selecting the actions? (currently we still always select highest expected action as "the target distrib").










share|improve this question











$endgroup$











  • $begingroup$
    After learning more about this topic, "This would actually mean during extra work during runtime" - By now I realised this statement was not correct. During runtime we get the distribution at $x_t$; It's only the Training stage that "peeks into" x_t+1
    $endgroup$
    – Kari
    Jul 28 '18 at 21:18











  • $begingroup$
    Still curious about the second question though!
    $endgroup$
    – Kari
    Jul 28 '18 at 21:19















1












$begingroup$


Recently I learned about Distributional approach to RL, which is a quite fascinating and break--through algorithm.



I have 2 questions:



What is it that makes it perform so much better during runtime than DQN? My understanding is that during runtime we will still have to select an action with the largest expected value. But to compute these expected values, we will now have to look at distributions of all possible actions at $x_t+1$, then select an action with a highest expected value. This would actually mean during extra work during runtime



What is the explanation for its faster converge than that of vanilla DQN? As I understand, the policy hasn't changed, we are still selecting the best action from state $x_t+1$, then use its best action's distribution for bootstrapping (adjusting) the distribution of our current state's best action.



Where does the Distributional part come into play and make the network be smarter about selecting the actions? (currently we still always select highest expected action as "the target distrib").










share|improve this question











$endgroup$











  • $begingroup$
    After learning more about this topic, "This would actually mean during extra work during runtime" - By now I realised this statement was not correct. During runtime we get the distribution at $x_t$; It's only the Training stage that "peeks into" x_t+1
    $endgroup$
    – Kari
    Jul 28 '18 at 21:18











  • $begingroup$
    Still curious about the second question though!
    $endgroup$
    – Kari
    Jul 28 '18 at 21:19













1












1








1





$begingroup$


Recently I learned about Distributional approach to RL, which is a quite fascinating and break--through algorithm.



I have 2 questions:



What is it that makes it perform so much better during runtime than DQN? My understanding is that during runtime we will still have to select an action with the largest expected value. But to compute these expected values, we will now have to look at distributions of all possible actions at $x_t+1$, then select an action with a highest expected value. This would actually mean during extra work during runtime



What is the explanation for its faster converge than that of vanilla DQN? As I understand, the policy hasn't changed, we are still selecting the best action from state $x_t+1$, then use its best action's distribution for bootstrapping (adjusting) the distribution of our current state's best action.



Where does the Distributional part come into play and make the network be smarter about selecting the actions? (currently we still always select highest expected action as "the target distrib").










share|improve this question











$endgroup$




Recently I learned about Distributional approach to RL, which is a quite fascinating and break--through algorithm.



I have 2 questions:



What is it that makes it perform so much better during runtime than DQN? My understanding is that during runtime we will still have to select an action with the largest expected value. But to compute these expected values, we will now have to look at distributions of all possible actions at $x_t+1$, then select an action with a highest expected value. This would actually mean during extra work during runtime



What is the explanation for its faster converge than that of vanilla DQN? As I understand, the policy hasn't changed, we are still selecting the best action from state $x_t+1$, then use its best action's distribution for bootstrapping (adjusting) the distribution of our current state's best action.



Where does the Distributional part come into play and make the network be smarter about selecting the actions? (currently we still always select highest expected action as "the target distrib").







reinforcement-learning dqn






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 28 '18 at 19:32









Stephen Rauch

1,52551330




1,52551330










asked Jun 19 '18 at 1:01









KariKari

671525




671525











  • $begingroup$
    After learning more about this topic, "This would actually mean during extra work during runtime" - By now I realised this statement was not correct. During runtime we get the distribution at $x_t$; It's only the Training stage that "peeks into" x_t+1
    $endgroup$
    – Kari
    Jul 28 '18 at 21:18











  • $begingroup$
    Still curious about the second question though!
    $endgroup$
    – Kari
    Jul 28 '18 at 21:19
















  • $begingroup$
    After learning more about this topic, "This would actually mean during extra work during runtime" - By now I realised this statement was not correct. During runtime we get the distribution at $x_t$; It's only the Training stage that "peeks into" x_t+1
    $endgroup$
    – Kari
    Jul 28 '18 at 21:18











  • $begingroup$
    Still curious about the second question though!
    $endgroup$
    – Kari
    Jul 28 '18 at 21:19















$begingroup$
After learning more about this topic, "This would actually mean during extra work during runtime" - By now I realised this statement was not correct. During runtime we get the distribution at $x_t$; It's only the Training stage that "peeks into" x_t+1
$endgroup$
– Kari
Jul 28 '18 at 21:18





$begingroup$
After learning more about this topic, "This would actually mean during extra work during runtime" - By now I realised this statement was not correct. During runtime we get the distribution at $x_t$; It's only the Training stage that "peeks into" x_t+1
$endgroup$
– Kari
Jul 28 '18 at 21:18













$begingroup$
Still curious about the second question though!
$endgroup$
– Kari
Jul 28 '18 at 21:19




$begingroup$
Still curious about the second question though!
$endgroup$
– Kari
Jul 28 '18 at 21:19










1 Answer
1






active

oldest

votes


















1












$begingroup$

This is meant to be a comment, but I can't comment since I have insufficient reputation.



As for the second question, intuitively speaking, instead of taking a scalar value for an action, which initially, may be highly inaccurate and noisy, taking a distribution instead would be more accurate. I'd recommend https://flyyufelix.github.io/2017/10/24/distributional-bellman.html which explains the intuitive reason for using a distribution



In terms of convergence, actually, there is no guarantee of convergence. In the paper, however, explains that for distributional DQN to guarantee to converge, the gamma-contraction must be satisfied, which would be true if you measure the the distance between the distributions using wasserstein distance, but it would be impractical to try to minimize that distance, so distributional DQN uses cross entropy instead which you can find the gradients of, and perform backpropagation....etc



You may be interested in "Distributional Reinforcement Learning with Quantile Regression"
https://arxiv.org/pdf/1710.10044.pdf
which aims to improve the original distributional DQN algorithm






share|improve this answer











$endgroup$













    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "557"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33345%2fwhy-is-distributional-dqn-faster-than-vanilla-dqn%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1












    $begingroup$

    This is meant to be a comment, but I can't comment since I have insufficient reputation.



    As for the second question, intuitively speaking, instead of taking a scalar value for an action, which initially, may be highly inaccurate and noisy, taking a distribution instead would be more accurate. I'd recommend https://flyyufelix.github.io/2017/10/24/distributional-bellman.html which explains the intuitive reason for using a distribution



    In terms of convergence, actually, there is no guarantee of convergence. In the paper, however, explains that for distributional DQN to guarantee to converge, the gamma-contraction must be satisfied, which would be true if you measure the the distance between the distributions using wasserstein distance, but it would be impractical to try to minimize that distance, so distributional DQN uses cross entropy instead which you can find the gradients of, and perform backpropagation....etc



    You may be interested in "Distributional Reinforcement Learning with Quantile Regression"
    https://arxiv.org/pdf/1710.10044.pdf
    which aims to improve the original distributional DQN algorithm






    share|improve this answer











    $endgroup$

















      1












      $begingroup$

      This is meant to be a comment, but I can't comment since I have insufficient reputation.



      As for the second question, intuitively speaking, instead of taking a scalar value for an action, which initially, may be highly inaccurate and noisy, taking a distribution instead would be more accurate. I'd recommend https://flyyufelix.github.io/2017/10/24/distributional-bellman.html which explains the intuitive reason for using a distribution



      In terms of convergence, actually, there is no guarantee of convergence. In the paper, however, explains that for distributional DQN to guarantee to converge, the gamma-contraction must be satisfied, which would be true if you measure the the distance between the distributions using wasserstein distance, but it would be impractical to try to minimize that distance, so distributional DQN uses cross entropy instead which you can find the gradients of, and perform backpropagation....etc



      You may be interested in "Distributional Reinforcement Learning with Quantile Regression"
      https://arxiv.org/pdf/1710.10044.pdf
      which aims to improve the original distributional DQN algorithm






      share|improve this answer











      $endgroup$















        1












        1








        1





        $begingroup$

        This is meant to be a comment, but I can't comment since I have insufficient reputation.



        As for the second question, intuitively speaking, instead of taking a scalar value for an action, which initially, may be highly inaccurate and noisy, taking a distribution instead would be more accurate. I'd recommend https://flyyufelix.github.io/2017/10/24/distributional-bellman.html which explains the intuitive reason for using a distribution



        In terms of convergence, actually, there is no guarantee of convergence. In the paper, however, explains that for distributional DQN to guarantee to converge, the gamma-contraction must be satisfied, which would be true if you measure the the distance between the distributions using wasserstein distance, but it would be impractical to try to minimize that distance, so distributional DQN uses cross entropy instead which you can find the gradients of, and perform backpropagation....etc



        You may be interested in "Distributional Reinforcement Learning with Quantile Regression"
        https://arxiv.org/pdf/1710.10044.pdf
        which aims to improve the original distributional DQN algorithm






        share|improve this answer











        $endgroup$



        This is meant to be a comment, but I can't comment since I have insufficient reputation.



        As for the second question, intuitively speaking, instead of taking a scalar value for an action, which initially, may be highly inaccurate and noisy, taking a distribution instead would be more accurate. I'd recommend https://flyyufelix.github.io/2017/10/24/distributional-bellman.html which explains the intuitive reason for using a distribution



        In terms of convergence, actually, there is no guarantee of convergence. In the paper, however, explains that for distributional DQN to guarantee to converge, the gamma-contraction must be satisfied, which would be true if you measure the the distance between the distributions using wasserstein distance, but it would be impractical to try to minimize that distance, so distributional DQN uses cross entropy instead which you can find the gradients of, and perform backpropagation....etc



        You may be interested in "Distributional Reinforcement Learning with Quantile Regression"
        https://arxiv.org/pdf/1710.10044.pdf
        which aims to improve the original distributional DQN algorithm







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Apr 11 at 7:41

























        answered Apr 9 at 6:03









        user355843user355843

        262




        262



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33345%2fwhy-is-distributional-dqn-faster-than-vanilla-dqn%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

            Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

            Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High