Newton method and Vanishing GradientRegression problem - too complex for gradient descentSimple ANN visualisationWhy do CNNs with ReLU learn that well?Why is vanishing gradient a problem?What is the relationship between hard-sigmoid function and vanishing gradient descent problem?Error in Neural NetworkFinding perfect weights for modelsHow to understand backpropagation using derivativeWhy Gradient methods work in finding the parameters in Neural Networks?The mix of leaky Relu at the first layers of CNN along with conventional Relu for object detection

Can I use my Chinese passport to enter China after I acquired another citizenship?

Can a Bard use an arcane focus?

Can a Gentile theist be saved?

How will losing mobility of one hand affect my career as a programmer?

Why has "pence" been used in this sentence, not "pences"?

Giant Toughroad SLR 2 for 200 miles in two days, will it make it?

Why is .bash_history periodically wiped?

Are Warlocks Arcane or Divine?

does this mean what I think it means - 4th last time

Bob has never been a M before

Is camera lens focus an exact point or a range?

Visiting the UK as unmarried couple

Lightning Web Components - Not available in app builder

A social experiment. What is the worst that can happen?

What is the opposite of 'gravitas'?

What does the "3am" section means in manpages?

How did Monica know how to operate Carol's "designer"?

I2C signal and power over long range (10meter cable)

What is this type of notehead called?

How to deal with or prevent idle in the test team?

Modern Day Chaucer

How to deal with loss of decision making power over a change?

A car is moving at 40 km/h. A fly at 100 km/h, starts from wall towards the car(20 km away)flies to car and back. How many trips can it make?

You're three for three



Newton method and Vanishing Gradient


Regression problem - too complex for gradient descentSimple ANN visualisationWhy do CNNs with ReLU learn that well?Why is vanishing gradient a problem?What is the relationship between hard-sigmoid function and vanishing gradient descent problem?Error in Neural NetworkFinding perfect weights for modelsHow to understand backpropagation using derivativeWhy Gradient methods work in finding the parameters in Neural Networks?The mix of leaky Relu at the first layers of CNN along with conventional Relu for object detection













3












$begingroup$


I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.



Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.










share|improve this question









New contributor




Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$
















    3












    $begingroup$


    I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.



    Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
    While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.










    share|improve this question









    New contributor




    Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$














      3












      3








      3





      $begingroup$


      I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.



      Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
      While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.










      share|improve this question









      New contributor




      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I read the article on Vanishing Gradient problem, which states that the problem can be rectified by using ReLu based activation function.



      Now I am not able to understand that if using ReLu based activation function solves the problem, then why there are so many research papers suggesting the use of Newton's method based optimization algorithms for deep learning instead of Gradient Descent?
      While reading research papers, I was having the strong impression that vanishing gradient problem was the core reason for such suggestions but now I am confused whether Newton's method is really needed if Gradient Descent can be modified to rectify all the problems faced during machine learning.







      machine-learning neural-network optimization






      share|improve this question









      New contributor




      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question









      New contributor




      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question








      edited Mar 20 at 20:50









      Esmailian

      1,766115




      1,766115






      New contributor




      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked Mar 20 at 14:41









      AmanAman

      305




      305




      New contributor




      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Aman is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$


          Why there are so many research papers suggesting the use of Newton's
          method based optimization algorithms for deep learning instead of
          Gradient Descent?




          Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.




          Is Newton's method really needed if Gradient Descent can be modified
          to rectify all the problems faced during machine learning?




          Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.



          As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.



          Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.






          share|improve this answer











          $endgroup$












          • $begingroup$
            In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
            $endgroup$
            – Aman
            Mar 20 at 15:35











          • $begingroup$
            @Aman removed the controversial parts
            $endgroup$
            – Esmailian
            Mar 20 at 15:51










          • $begingroup$
            which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
            $endgroup$
            – Aman
            Mar 20 at 15:53










          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );






          Aman is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47679%2fnewton-method-and-vanishing-gradient%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2












          $begingroup$


          Why there are so many research papers suggesting the use of Newton's
          method based optimization algorithms for deep learning instead of
          Gradient Descent?




          Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.




          Is Newton's method really needed if Gradient Descent can be modified
          to rectify all the problems faced during machine learning?




          Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.



          As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.



          Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.






          share|improve this answer











          $endgroup$












          • $begingroup$
            In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
            $endgroup$
            – Aman
            Mar 20 at 15:35











          • $begingroup$
            @Aman removed the controversial parts
            $endgroup$
            – Esmailian
            Mar 20 at 15:51










          • $begingroup$
            which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
            $endgroup$
            – Aman
            Mar 20 at 15:53















          2












          $begingroup$


          Why there are so many research papers suggesting the use of Newton's
          method based optimization algorithms for deep learning instead of
          Gradient Descent?




          Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.




          Is Newton's method really needed if Gradient Descent can be modified
          to rectify all the problems faced during machine learning?




          Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.



          As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.



          Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.






          share|improve this answer











          $endgroup$












          • $begingroup$
            In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
            $endgroup$
            – Aman
            Mar 20 at 15:35











          • $begingroup$
            @Aman removed the controversial parts
            $endgroup$
            – Esmailian
            Mar 20 at 15:51










          • $begingroup$
            which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
            $endgroup$
            – Aman
            Mar 20 at 15:53













          2












          2








          2





          $begingroup$


          Why there are so many research papers suggesting the use of Newton's
          method based optimization algorithms for deep learning instead of
          Gradient Descent?




          Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.




          Is Newton's method really needed if Gradient Descent can be modified
          to rectify all the problems faced during machine learning?




          Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.



          As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.



          Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.






          share|improve this answer











          $endgroup$




          Why there are so many research papers suggesting the use of Newton's
          method based optimization algorithms for deep learning instead of
          Gradient Descent?




          Newton method has a faster convergence rate than gradient descent, and this is the main reason why it may be suggested as a replacement for gradient descent.




          Is Newton's method really needed if Gradient Descent can be modified
          to rectify all the problems faced during machine learning?




          Existence of vanishing gradient problem depends on the choice of "activation function" and the "depth" of network. Newton method and gradient descent would both face this problem for a function like Sigmoid, since in the flat extremes of Sigmoid both first and second order derivatives are small and exponentially vanishing by depth. In other words, the problem is solved for both methods by the choice of function.



          As a side note, 1st- and 2nd-order derivatives of Sigmoid go to zero at the same rate. Here is a graph of Sigmoid and its derivatives; zoom into the extremes.



          Historical note. Newton method predates the vanishing gradient problem (which was faced after the introduction of Backpropagation in 60s) by centuries.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 20 at 16:17

























          answered Mar 20 at 15:21









          EsmailianEsmailian

          1,766115




          1,766115











          • $begingroup$
            In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
            $endgroup$
            – Aman
            Mar 20 at 15:35











          • $begingroup$
            @Aman removed the controversial parts
            $endgroup$
            – Esmailian
            Mar 20 at 15:51










          • $begingroup$
            which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
            $endgroup$
            – Aman
            Mar 20 at 15:53
















          • $begingroup$
            In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
            $endgroup$
            – Aman
            Mar 20 at 15:35











          • $begingroup$
            @Aman removed the controversial parts
            $endgroup$
            – Esmailian
            Mar 20 at 15:51










          • $begingroup$
            which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
            $endgroup$
            – Aman
            Mar 20 at 15:53















          $begingroup$
          In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
          $endgroup$
          – Aman
          Mar 20 at 15:35





          $begingroup$
          In this paper, section 2.1 says that scale invariance is important as it eliminates the need to tweak the learning rate, then why scalability of gradient descent is preferred over the scale invariance of newton method.
          $endgroup$
          – Aman
          Mar 20 at 15:35













          $begingroup$
          @Aman removed the controversial parts
          $endgroup$
          – Esmailian
          Mar 20 at 15:51




          $begingroup$
          @Aman removed the controversial parts
          $endgroup$
          – Esmailian
          Mar 20 at 15:51












          $begingroup$
          which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
          $endgroup$
          – Aman
          Mar 20 at 15:53




          $begingroup$
          which is more worse based on efficiency , vanishing gradient in newton method or vanishing gradient on gradient descent. I'm new to machine learning, so I would appreciate if you can clear or atleast guide me with the above doubt. thank you.
          $endgroup$
          – Aman
          Mar 20 at 15:53










          Aman is a new contributor. Be nice, and check out our Code of Conduct.









          draft saved

          draft discarded


















          Aman is a new contributor. Be nice, and check out our Code of Conduct.












          Aman is a new contributor. Be nice, and check out our Code of Conduct.











          Aman is a new contributor. Be nice, and check out our Code of Conduct.














          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47679%2fnewton-method-and-vanishing-gradient%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

          Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High