Square Root Regularization and High LossSpeed decay proof for L2 regularization and non-normalizied weight initiationHeuristic argument for Weight decay and regularizationNeural Network Performs Bad On MNISTWeight decay in neural networkWhy does my loss value start at approximately -10,000 and my accuracy not improve?Loss for CNN decreases and settles but training accuracy does not improveWhat is the intuition behind Ridge Regression and Adapting Gradient Descent algorithms?Regularization term in Matrix FactorizationWeight update to fully convolutional network when supervision is only for a patchHow to balance Keras loss functions of different magnitudes

How to assert on pagereference where the endpoint of pagereference is predefined

Can I use 1000v rectifier diodes instead of 600v rectifier diodes?

Visa for volunteering in England

Unexpected email from Yorkshire Bank

Can commander tax be proliferated?

What is the limiting factor for a CAN bus to exceed 1Mbps bandwidth?

Was Hulk present at this event?

Why is the SNP putting so much emphasis on currency plans?

I’ve officially counted to infinity!

What word means "to make something obsolete"?

How did Arya get back her dagger from Sansa?

Is it cheaper to drop cargo than to land it?

Why is this a valid proof for the harmonic series?

Has any spacecraft ever had the ability to directly communicate with civilian air traffic control?

What was the state of the German rail system in 1944?

Proof that when f'(x) < f(x), f(x) =0

Feels like I am getting dragged into office politics

Why are there synthetic chemicals in our bodies? Where do they come from?

Is Cola "probably the best-known" Latin word in the world? If not, which might it be?

Write to EXCEL from SQL DB using VBA script

Does hiding behind 5-ft-wide cover give full cover?

Why do freehub and cassette have only one position that matches?

What is the most remote airport from the center of the city it supposedly serves?

Pigeonhole Principle Problem



Square Root Regularization and High Loss


Speed decay proof for L2 regularization and non-normalizied weight initiationHeuristic argument for Weight decay and regularizationNeural Network Performs Bad On MNISTWeight decay in neural networkWhy does my loss value start at approximately -10,000 and my accuracy not improve?Loss for CNN decreases and settles but training accuracy does not improveWhat is the intuition behind Ridge Regression and Adapting Gradient Descent algorithms?Regularization term in Matrix FactorizationWeight update to fully convolutional network when supervision is only for a patchHow to balance Keras loss functions of different magnitudes













2












$begingroup$


I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:



for p in model.parameters():
loss += lambda * torch.sqrt(p.abs()).sum()


p.abs() is the absolute value of p, i.e the weights, torch.sqrt() is the square root and .sum() is the sum of the result for the individual weights. lambda is the penalization factor.

With no regularization, the loss settles around 0.4. With lambda=100, the loss settles somewhere around 0.45, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001, the final value for the loss is around 0.44. Now if I use l1/2 with the same lambda it settles around 5000! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).

Is this to be expected or is this some error in my code or pytorch's SGD?



One more note If use l1/2 with a small lambda like 0.001, the loss comes down to around 0.5 and then becomes nan around epoch 70. For lambda=0.01, it becomes ~1.0 and then nan around the same epoch. For lambda=0.1, loss becomes 5 but no nan anymore for this lambda or any value higher (in 120 epochs total). For lambda=1.0, loss settles at ~50- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50, regardless of lambda...










share|improve this question











$endgroup$
















    2












    $begingroup$


    I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:



    for p in model.parameters():
    loss += lambda * torch.sqrt(p.abs()).sum()


    p.abs() is the absolute value of p, i.e the weights, torch.sqrt() is the square root and .sum() is the sum of the result for the individual weights. lambda is the penalization factor.

    With no regularization, the loss settles around 0.4. With lambda=100, the loss settles somewhere around 0.45, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001, the final value for the loss is around 0.44. Now if I use l1/2 with the same lambda it settles around 5000! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).

    Is this to be expected or is this some error in my code or pytorch's SGD?



    One more note If use l1/2 with a small lambda like 0.001, the loss comes down to around 0.5 and then becomes nan around epoch 70. For lambda=0.01, it becomes ~1.0 and then nan around the same epoch. For lambda=0.1, loss becomes 5 but no nan anymore for this lambda or any value higher (in 120 epochs total). For lambda=1.0, loss settles at ~50- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50, regardless of lambda...










    share|improve this question











    $endgroup$














      2












      2








      2





      $begingroup$


      I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:



      for p in model.parameters():
      loss += lambda * torch.sqrt(p.abs()).sum()


      p.abs() is the absolute value of p, i.e the weights, torch.sqrt() is the square root and .sum() is the sum of the result for the individual weights. lambda is the penalization factor.

      With no regularization, the loss settles around 0.4. With lambda=100, the loss settles somewhere around 0.45, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001, the final value for the loss is around 0.44. Now if I use l1/2 with the same lambda it settles around 5000! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).

      Is this to be expected or is this some error in my code or pytorch's SGD?



      One more note If use l1/2 with a small lambda like 0.001, the loss comes down to around 0.5 and then becomes nan around epoch 70. For lambda=0.01, it becomes ~1.0 and then nan around the same epoch. For lambda=0.1, loss becomes 5 but no nan anymore for this lambda or any value higher (in 120 epochs total). For lambda=1.0, loss settles at ~50- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50, regardless of lambda...










      share|improve this question











      $endgroup$




      I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:



      for p in model.parameters():
      loss += lambda * torch.sqrt(p.abs()).sum()


      p.abs() is the absolute value of p, i.e the weights, torch.sqrt() is the square root and .sum() is the sum of the result for the individual weights. lambda is the penalization factor.

      With no regularization, the loss settles around 0.4. With lambda=100, the loss settles somewhere around 0.45, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001, the final value for the loss is around 0.44. Now if I use l1/2 with the same lambda it settles around 5000! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).

      Is this to be expected or is this some error in my code or pytorch's SGD?



      One more note If use l1/2 with a small lambda like 0.001, the loss comes down to around 0.5 and then becomes nan around epoch 70. For lambda=0.01, it becomes ~1.0 and then nan around the same epoch. For lambda=0.1, loss becomes 5 but no nan anymore for this lambda or any value higher (in 120 epochs total). For lambda=1.0, loss settles at ~50- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50, regardless of lambda...







      neural-network loss-function pytorch regularization






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 9 at 11:10









      Esmailian

      4,081422




      4,081422










      asked Apr 9 at 0:29









      user2268997user2268997

      1133




      1133




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.



          Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.



          However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):



          enter image description here



          As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.



          For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
          $$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
          where $L(theta) = l(theta) + lambda sqrt$.



          You can spot the trouble! The more we get close to zero, the less we get close to zero!






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Beautifully explained. Thanks!
            $endgroup$
            – user2268997
            Apr 9 at 17:32











          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "557"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48921%2fsquare-root-regularization-and-high-loss%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2












          $begingroup$

          The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.



          Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.



          However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):



          enter image description here



          As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.



          For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
          $$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
          where $L(theta) = l(theta) + lambda sqrt$.



          You can spot the trouble! The more we get close to zero, the less we get close to zero!






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Beautifully explained. Thanks!
            $endgroup$
            – user2268997
            Apr 9 at 17:32















          2












          $begingroup$

          The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.



          Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.



          However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):



          enter image description here



          As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.



          For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
          $$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
          where $L(theta) = l(theta) + lambda sqrt$.



          You can spot the trouble! The more we get close to zero, the less we get close to zero!






          share|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Beautifully explained. Thanks!
            $endgroup$
            – user2268997
            Apr 9 at 17:32













          2












          2








          2





          $begingroup$

          The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.



          Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.



          However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):



          enter image description here



          As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.



          For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
          $$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
          where $L(theta) = l(theta) + lambda sqrt$.



          You can spot the trouble! The more we get close to zero, the less we get close to zero!






          share|improve this answer











          $endgroup$



          The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.



          Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.



          However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):



          enter image description here



          As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.



          For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
          $$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
          where $L(theta) = l(theta) + lambda sqrt$.



          You can spot the trouble! The more we get close to zero, the less we get close to zero!







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 9 at 12:11

























          answered Apr 9 at 10:55









          EsmailianEsmailian

          4,081422




          4,081422







          • 1




            $begingroup$
            Beautifully explained. Thanks!
            $endgroup$
            – user2268997
            Apr 9 at 17:32












          • 1




            $begingroup$
            Beautifully explained. Thanks!
            $endgroup$
            – user2268997
            Apr 9 at 17:32







          1




          1




          $begingroup$
          Beautifully explained. Thanks!
          $endgroup$
          – user2268997
          Apr 9 at 17:32




          $begingroup$
          Beautifully explained. Thanks!
          $endgroup$
          – user2268997
          Apr 9 at 17:32

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48921%2fsquare-root-regularization-and-high-loss%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

          Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

          Do these cracks on my tires look bad? The Next CEO of Stack OverflowDry rot tire should I replace?Having to replace tiresFishtailed so easily? Bad tires? ABS?Filling the tires with something other than air, to avoid puncture hassles?Used Michelin tires safe to install?Do these tyre cracks necessitate replacement?Rumbling noise: tires or mechanicalIs it possible to fix noisy feathered tires?Are bad winter tires still better than summer tires in winter?Torque converter failure - Related to replacing only 2 tires?Why use snow tires on all 4 wheels on 2-wheel-drive cars?