Gradient descent with infinite gradient value The 2019 Stack Overflow Developer Survey Results Are Inprocedure for gradient descentRegression problem - too complex for gradient descentStochastic gradient descent and different approachesStochastic Gradient Descent BatchingAdam optimizer for projected gradient descentUsing Mean Squared Error in Gradient DescentIs gradient descent slower for finite differences?Problem with Linear Regression and Gradient DescentIs Gradient Descent central to every optimizer?Understanding general approach to updating optimization function parameters

Using xargs with pdftk

Are spiders unable to hurt humans, especially very small spiders?

Why was M87 targeted for the Event Horizon Telescope instead of Sagittarius A*?

Match Roman Numerals

Should I use my personal e-mail address, or my workplace one, when registering to external websites for work purposes?

Output the Arecibo Message

Is an up-to-date browser secure on an out-of-date OS?

Deal with toxic manager when you can't quit

Old scifi movie from the 50s or 60s with men in solid red uniforms who interrogate a spy from the past

Multiply Two Integer Polynomials

How to check whether the reindex working or not in Magento?

What could be the right powersource for 15 seconds lifespan disposable giant chainsaw?

The difference between dialogue marks

What is the most efficient way to store a numeric range?

Pokemon Turn Based battle (Python)

Why isn't the circumferential light around the M87 black hole's event horizon symmetric?

Is "plugging out" electronic devices an American expression?

Is Sun brighter than what we actually see?

How to display lines in a file like ls displays files in a directory?

Ubuntu Server install with full GUI

How to manage monthly salary

When should I buy a clipper card after flying to Oakland?

Keeping a retro style to sci-fi spaceships?

Why doesn't shell automatically fix "useless use of cat"?



Gradient descent with infinite gradient value



The 2019 Stack Overflow Developer Survey Results Are Inprocedure for gradient descentRegression problem - too complex for gradient descentStochastic gradient descent and different approachesStochastic Gradient Descent BatchingAdam optimizer for projected gradient descentUsing Mean Squared Error in Gradient DescentIs gradient descent slower for finite differences?Problem with Linear Regression and Gradient DescentIs Gradient Descent central to every optimizer?Understanding general approach to updating optimization function parameters










1












$begingroup$


Given a function $f(x)$ and $fracpartial f(x)partial x_i=fracf^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)f(x)$. When $f(x)to0$, $fracpartial f(x)partial x_i$ could be infinitely large. ($f^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)$ is always non-zero)



I have very little experience in deal with this situation in gradient descent process...In my code, $f(x)$ is in continuous domain but for purpose to simulate some real world process, $f(x)$ is sampled to be discrete and would return values uniformly distributed over $[0,1]$. Assume discrete $f(x)$ has $N$ identity values, at the beginning there is a training set of size $M$ ($M$ is very large), $x_i, f(x_i)=frack_iN_i=1..M (k_i in 1, 2, ..., N)$.



I found that setting $1/f(x)$ to some value like $0.01$ when $f(x)=0$ would reach the optimizim easily but slightly slower than ideal process, while set to much smaller value like $0.00001$ would let $f(x)=0$ have a great impact on the process and failed to form a descent curve.



Is the method replacing infinitely large values to some large but finite values correct? Or there are any better ways to deal with the infinite gradient problem?



Thanks in advance!










share|improve this question









$endgroup$
















    1












    $begingroup$


    Given a function $f(x)$ and $fracpartial f(x)partial x_i=fracf^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)f(x)$. When $f(x)to0$, $fracpartial f(x)partial x_i$ could be infinitely large. ($f^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)$ is always non-zero)



    I have very little experience in deal with this situation in gradient descent process...In my code, $f(x)$ is in continuous domain but for purpose to simulate some real world process, $f(x)$ is sampled to be discrete and would return values uniformly distributed over $[0,1]$. Assume discrete $f(x)$ has $N$ identity values, at the beginning there is a training set of size $M$ ($M$ is very large), $x_i, f(x_i)=frack_iN_i=1..M (k_i in 1, 2, ..., N)$.



    I found that setting $1/f(x)$ to some value like $0.01$ when $f(x)=0$ would reach the optimizim easily but slightly slower than ideal process, while set to much smaller value like $0.00001$ would let $f(x)=0$ have a great impact on the process and failed to form a descent curve.



    Is the method replacing infinitely large values to some large but finite values correct? Or there are any better ways to deal with the infinite gradient problem?



    Thanks in advance!










    share|improve this question









    $endgroup$














      1












      1








      1





      $begingroup$


      Given a function $f(x)$ and $fracpartial f(x)partial x_i=fracf^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)f(x)$. When $f(x)to0$, $fracpartial f(x)partial x_i$ could be infinitely large. ($f^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)$ is always non-zero)



      I have very little experience in deal with this situation in gradient descent process...In my code, $f(x)$ is in continuous domain but for purpose to simulate some real world process, $f(x)$ is sampled to be discrete and would return values uniformly distributed over $[0,1]$. Assume discrete $f(x)$ has $N$ identity values, at the beginning there is a training set of size $M$ ($M$ is very large), $x_i, f(x_i)=frack_iN_i=1..M (k_i in 1, 2, ..., N)$.



      I found that setting $1/f(x)$ to some value like $0.01$ when $f(x)=0$ would reach the optimizim easily but slightly slower than ideal process, while set to much smaller value like $0.00001$ would let $f(x)=0$ have a great impact on the process and failed to form a descent curve.



      Is the method replacing infinitely large values to some large but finite values correct? Or there are any better ways to deal with the infinite gradient problem?



      Thanks in advance!










      share|improve this question









      $endgroup$




      Given a function $f(x)$ and $fracpartial f(x)partial x_i=fracf^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)f(x)$. When $f(x)to0$, $fracpartial f(x)partial x_i$ could be infinitely large. ($f^2(x1,...,x_i+pi/2,...,x_n)-f^2(x1,...,x_i-pi/2,...,x_n)$ is always non-zero)



      I have very little experience in deal with this situation in gradient descent process...In my code, $f(x)$ is in continuous domain but for purpose to simulate some real world process, $f(x)$ is sampled to be discrete and would return values uniformly distributed over $[0,1]$. Assume discrete $f(x)$ has $N$ identity values, at the beginning there is a training set of size $M$ ($M$ is very large), $x_i, f(x_i)=frack_iN_i=1..M (k_i in 1, 2, ..., N)$.



      I found that setting $1/f(x)$ to some value like $0.01$ when $f(x)=0$ would reach the optimizim easily but slightly slower than ideal process, while set to much smaller value like $0.00001$ would let $f(x)=0$ have a great impact on the process and failed to form a descent curve.



      Is the method replacing infinitely large values to some large but finite values correct? Or there are any better ways to deal with the infinite gradient problem?



      Thanks in advance!







      machine-learning optimization gradient-descent






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Mar 30 at 10:05









      raycosineraycosine

      82




      82




















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$


          Is the method replacing infinitely large values to some large but
          finite values correct?




          Yes. For example, the same problem happens for the logarithm in cross-entropy loss function, i.e. $p_i textlog(p'_i)$ when $p'_i rightarrow 0$. This is avoided by replacing $textlog(x)$ with $hattextlog(x) = textlog(x+epsilon)$ for some small $epsilon$.



          Similarly, you are changing $f(x)$ in the denominator to $hatf(x) = max(epsilon, f(x))$.



          However, I would suggest $hatf(x) = f(x) + epsilon$ instead of a cut-off threshold. This way, the difference in $f(x_1) < f(x_2) < epsilon$ would not be ignored unlike the max cut-off.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thank you Esmailian!
            $endgroup$
            – raycosine
            Mar 30 at 10:22











          Your Answer





          StackExchange.ifUsing("editor", function ()
          return StackExchange.using("mathjaxEditing", function ()
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          );
          );
          , "mathjax-editing");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48255%2fgradient-descent-with-infinite-gradient-value%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1












          $begingroup$


          Is the method replacing infinitely large values to some large but
          finite values correct?




          Yes. For example, the same problem happens for the logarithm in cross-entropy loss function, i.e. $p_i textlog(p'_i)$ when $p'_i rightarrow 0$. This is avoided by replacing $textlog(x)$ with $hattextlog(x) = textlog(x+epsilon)$ for some small $epsilon$.



          Similarly, you are changing $f(x)$ in the denominator to $hatf(x) = max(epsilon, f(x))$.



          However, I would suggest $hatf(x) = f(x) + epsilon$ instead of a cut-off threshold. This way, the difference in $f(x_1) < f(x_2) < epsilon$ would not be ignored unlike the max cut-off.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thank you Esmailian!
            $endgroup$
            – raycosine
            Mar 30 at 10:22















          1












          $begingroup$


          Is the method replacing infinitely large values to some large but
          finite values correct?




          Yes. For example, the same problem happens for the logarithm in cross-entropy loss function, i.e. $p_i textlog(p'_i)$ when $p'_i rightarrow 0$. This is avoided by replacing $textlog(x)$ with $hattextlog(x) = textlog(x+epsilon)$ for some small $epsilon$.



          Similarly, you are changing $f(x)$ in the denominator to $hatf(x) = max(epsilon, f(x))$.



          However, I would suggest $hatf(x) = f(x) + epsilon$ instead of a cut-off threshold. This way, the difference in $f(x_1) < f(x_2) < epsilon$ would not be ignored unlike the max cut-off.






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thank you Esmailian!
            $endgroup$
            – raycosine
            Mar 30 at 10:22













          1












          1








          1





          $begingroup$


          Is the method replacing infinitely large values to some large but
          finite values correct?




          Yes. For example, the same problem happens for the logarithm in cross-entropy loss function, i.e. $p_i textlog(p'_i)$ when $p'_i rightarrow 0$. This is avoided by replacing $textlog(x)$ with $hattextlog(x) = textlog(x+epsilon)$ for some small $epsilon$.



          Similarly, you are changing $f(x)$ in the denominator to $hatf(x) = max(epsilon, f(x))$.



          However, I would suggest $hatf(x) = f(x) + epsilon$ instead of a cut-off threshold. This way, the difference in $f(x_1) < f(x_2) < epsilon$ would not be ignored unlike the max cut-off.






          share|improve this answer











          $endgroup$




          Is the method replacing infinitely large values to some large but
          finite values correct?




          Yes. For example, the same problem happens for the logarithm in cross-entropy loss function, i.e. $p_i textlog(p'_i)$ when $p'_i rightarrow 0$. This is avoided by replacing $textlog(x)$ with $hattextlog(x) = textlog(x+epsilon)$ for some small $epsilon$.



          Similarly, you are changing $f(x)$ in the denominator to $hatf(x) = max(epsilon, f(x))$.



          However, I would suggest $hatf(x) = f(x) + epsilon$ instead of a cut-off threshold. This way, the difference in $f(x_1) < f(x_2) < epsilon$ would not be ignored unlike the max cut-off.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Mar 30 at 10:45

























          answered Mar 30 at 10:21









          EsmailianEsmailian

          2,991320




          2,991320







          • 1




            $begingroup$
            Thank you Esmailian!
            $endgroup$
            – raycosine
            Mar 30 at 10:22












          • 1




            $begingroup$
            Thank you Esmailian!
            $endgroup$
            – raycosine
            Mar 30 at 10:22







          1




          1




          $begingroup$
          Thank you Esmailian!
          $endgroup$
          – raycosine
          Mar 30 at 10:22




          $begingroup$
          Thank you Esmailian!
          $endgroup$
          – raycosine
          Mar 30 at 10:22

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48255%2fgradient-descent-with-infinite-gradient-value%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Marja Vauras Lähteet | Aiheesta muualla | NavigointivalikkoMarja Vauras Turun yliopiston tutkimusportaalissaInfobox OKSuomalaisen Tiedeakatemian varsinaiset jäsenetKasvatustieteiden tiedekunnan dekaanit ja muu johtoMarja VaurasKoulutusvienti on kestävyys- ja ketteryyslaji (2.5.2017)laajentamallaWorldCat Identities0000 0001 0855 9405n86069603utb201588738523620927

          Which is better: GPT or RelGAN for text generation?2019 Community Moderator ElectionWhat is the difference between TextGAN and LM for text generation?GANs (generative adversarial networks) possible for text as well?Generator loss not decreasing- text to image synthesisChoosing a right algorithm for template-based text generationHow should I format input and output for text generation with LSTMsGumbel Softmax vs Vanilla Softmax for GAN trainingWhich neural network to choose for classification from text/speech?NLP text autoencoder that generates text in poetic meterWhat is the interpretation of the expectation notation in the GAN formulation?What is the difference between TextGAN and LM for text generation?How to prepare the data for text generation task

          Is this part of the description of the Archfey warlock's Misty Escape feature redundant?When is entropic ward considered “used”?How does the reaction timing work for Wrath of the Storm? Can it potentially prevent the damage from the triggering attack?Does the Dark Arts Archlich warlock patrons's Arcane Invisibility activate every time you cast a level 1+ spell?When attacking while invisible, when exactly does invisibility break?Can I cast Hellish Rebuke on my turn?Do I have to “pre-cast” a reaction spell in order for it to be triggered?What happens if a Player Misty Escapes into an Invisible CreatureCan a reaction interrupt multiattack?Does the Fiend-patron warlock's Hurl Through Hell feature dispel effects that require the target to be on the same plane as the caster?What are you allowed to do while using the Warlock's Eldritch Master feature?