Gradient Descent ConvergenceDoes gradient descent always converge to an optimum?Stochastic gradient descent based on vector operations?procedure for gradient descentRegression problem - too complex for gradient descentIntuition in Backpropagation (gradient descent)Algorithm to apply Lasso and Ridge in Gradient descentGradient descent and partial derivativesStochastic Gradient Descent BatchingLinear classifier and gradient descentHow do I approach learning Data Science/ML the 'rightest' way?

Is it possible to measure lightning discharges as Nikola Tesla?

Python "triplet" dictionary?

Is GOCE a satellite or aircraft?

Past Perfect Tense

How can Republicans who favour free markets, consistently express anger when they don't like the outcome of that choice?

What's the polite way to say "I need to urinate"?

Pulling the rope with one hand is as heavy as with two hands?

Subtleties of choosing the sequence of tenses in Russian

When to use 1/Ka vs Kb

Feels like I am getting dragged in office politics

Does a creature that is immune to a condition still make a saving throw?

Is thermodynamics only applicable to systems in equilibrium?

Examples of non trivial equivalence relations , I mean equivalence relations without the expression " same ... as" in their definition?

Find the coordinate of two line segments that are perpendicular

How to create an ad-hoc wireless network in Ubuntu

Did Henry V’s archers at Agincourt fight with no pants / breeches on because of dysentery?

Given what happens in Endgame, why doesn't Dormammu come back to attack the universe?

Build a trail cart

Multiple options for Pseudonyms

When did stoichiometry begin to be taught in U.S. high schools?

Transfer over $10k

Was it really necessary for the Lunar Module to have 2 stages?

What word means to make something obsolete?

A question regarding using the definite article



Gradient Descent Convergence


Does gradient descent always converge to an optimum?Stochastic gradient descent based on vector operations?procedure for gradient descentRegression problem - too complex for gradient descentIntuition in Backpropagation (gradient descent)Algorithm to apply Lasso and Ridge in Gradient descentGradient descent and partial derivativesStochastic Gradient Descent BatchingLinear classifier and gradient descentHow do I approach learning Data Science/ML the 'rightest' way?













1












$begingroup$


I'm a double major in Math and CS interested in Machine Learning. I'm currently taking the popular Coursera course by Prof. Andrew. He's talking and explaining Gradient Descent but I can't avoid noticing a few things. With my math background, I know that if I'm trying to find the global min/max of a function, I must first find all the critical points first. The course talks about convergence of GD, but is it really guaranteed to converge to the global min? How do I know it won't get stuck at a saddle point? Wouldn't be safer to do a 2nd derivative test to test it? If my function is differentiable it seems reasonable it converges to a local min, but not to the global min. I have tried looking for a better explanation but everyone seems to take it for granted without questioning. Can someone point me in the right direction?










share|improve this question









$endgroup$
















    1












    $begingroup$


    I'm a double major in Math and CS interested in Machine Learning. I'm currently taking the popular Coursera course by Prof. Andrew. He's talking and explaining Gradient Descent but I can't avoid noticing a few things. With my math background, I know that if I'm trying to find the global min/max of a function, I must first find all the critical points first. The course talks about convergence of GD, but is it really guaranteed to converge to the global min? How do I know it won't get stuck at a saddle point? Wouldn't be safer to do a 2nd derivative test to test it? If my function is differentiable it seems reasonable it converges to a local min, but not to the global min. I have tried looking for a better explanation but everyone seems to take it for granted without questioning. Can someone point me in the right direction?










    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      I'm a double major in Math and CS interested in Machine Learning. I'm currently taking the popular Coursera course by Prof. Andrew. He's talking and explaining Gradient Descent but I can't avoid noticing a few things. With my math background, I know that if I'm trying to find the global min/max of a function, I must first find all the critical points first. The course talks about convergence of GD, but is it really guaranteed to converge to the global min? How do I know it won't get stuck at a saddle point? Wouldn't be safer to do a 2nd derivative test to test it? If my function is differentiable it seems reasonable it converges to a local min, but not to the global min. I have tried looking for a better explanation but everyone seems to take it for granted without questioning. Can someone point me in the right direction?










      share|improve this question









      $endgroup$




      I'm a double major in Math and CS interested in Machine Learning. I'm currently taking the popular Coursera course by Prof. Andrew. He's talking and explaining Gradient Descent but I can't avoid noticing a few things. With my math background, I know that if I'm trying to find the global min/max of a function, I must first find all the critical points first. The course talks about convergence of GD, but is it really guaranteed to converge to the global min? How do I know it won't get stuck at a saddle point? Wouldn't be safer to do a 2nd derivative test to test it? If my function is differentiable it seems reasonable it converges to a local min, but not to the global min. I have tried looking for a better explanation but everyone seems to take it for granted without questioning. Can someone point me in the right direction?







      machine-learning regression gradient-descent






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 26 at 3:24









      bladeblade

      1084




      1084




















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          Gradient Descent does not always converge to Global minima. It only converges if function is convex and learning rate is appropriate.



          For most real life problems, function will have local minimums and we need to run training multiple times. One of the reason is to avoid local minima.






          share|improve this answer









          $endgroup$




















            0












            $begingroup$

            If you use a version called Backtracking Gradient Descent, then convergence to one single local minimum can be proven in most cases for most functions, including all Morse functions. Under the same assumptions, you can also prove convergence for backtracking versions of Momentum and NAG. More details can be found in my answer and the cited paper, as well as link to source codes on GitHub, in this link:



            Link






            share|improve this answer









            $endgroup$













              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47987%2fgradient-descent-convergence%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1












              $begingroup$

              Gradient Descent does not always converge to Global minima. It only converges if function is convex and learning rate is appropriate.



              For most real life problems, function will have local minimums and we need to run training multiple times. One of the reason is to avoid local minima.






              share|improve this answer









              $endgroup$

















                1












                $begingroup$

                Gradient Descent does not always converge to Global minima. It only converges if function is convex and learning rate is appropriate.



                For most real life problems, function will have local minimums and we need to run training multiple times. One of the reason is to avoid local minima.






                share|improve this answer









                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  Gradient Descent does not always converge to Global minima. It only converges if function is convex and learning rate is appropriate.



                  For most real life problems, function will have local minimums and we need to run training multiple times. One of the reason is to avoid local minima.






                  share|improve this answer









                  $endgroup$



                  Gradient Descent does not always converge to Global minima. It only converges if function is convex and learning rate is appropriate.



                  For most real life problems, function will have local minimums and we need to run training multiple times. One of the reason is to avoid local minima.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Mar 26 at 4:36









                  Shamit VermaShamit Verma

                  1,6891414




                  1,6891414





















                      0












                      $begingroup$

                      If you use a version called Backtracking Gradient Descent, then convergence to one single local minimum can be proven in most cases for most functions, including all Morse functions. Under the same assumptions, you can also prove convergence for backtracking versions of Momentum and NAG. More details can be found in my answer and the cited paper, as well as link to source codes on GitHub, in this link:



                      Link






                      share|improve this answer









                      $endgroup$

















                        0












                        $begingroup$

                        If you use a version called Backtracking Gradient Descent, then convergence to one single local minimum can be proven in most cases for most functions, including all Morse functions. Under the same assumptions, you can also prove convergence for backtracking versions of Momentum and NAG. More details can be found in my answer and the cited paper, as well as link to source codes on GitHub, in this link:



                        Link






                        share|improve this answer









                        $endgroup$















                          0












                          0








                          0





                          $begingroup$

                          If you use a version called Backtracking Gradient Descent, then convergence to one single local minimum can be proven in most cases for most functions, including all Morse functions. Under the same assumptions, you can also prove convergence for backtracking versions of Momentum and NAG. More details can be found in my answer and the cited paper, as well as link to source codes on GitHub, in this link:



                          Link






                          share|improve this answer









                          $endgroup$



                          If you use a version called Backtracking Gradient Descent, then convergence to one single local minimum can be proven in most cases for most functions, including all Morse functions. Under the same assumptions, you can also prove convergence for backtracking versions of Momentum and NAG. More details can be found in my answer and the cited paper, as well as link to source codes on GitHub, in this link:



                          Link







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Apr 9 at 3:31









                          TuyenTuyen

                          313




                          313



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47987%2fgradient-descent-convergence%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

                              Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

                              Is this part of the description of the Archfey warlock's Misty Escape feature redundant?When is entropic ward considered “used”?How does the reaction timing work for Wrath of the Storm? Can it potentially prevent the damage from the triggering attack?Does the Dark Arts Archlich warlock patrons's Arcane Invisibility activate every time you cast a level 1+ spell?When attacking while invisible, when exactly does invisibility break?Can I cast Hellish Rebuke on my turn?Do I have to “pre-cast” a reaction spell in order for it to be triggered?What happens if a Player Misty Escapes into an Invisible CreatureCan a reaction interrupt multiattack?Does the Fiend-patron warlock's Hurl Through Hell feature dispel effects that require the target to be on the same plane as the caster?What are you allowed to do while using the Warlock's Eldritch Master feature?