Square Root Regularization and High LossSpeed decay proof for L2 regularization and non-normalizied weight initiationHeuristic argument for Weight decay and regularizationNeural Network Performs Bad On MNISTWeight decay in neural networkWhy does my loss value start at approximately -10,000 and my accuracy not improve?Loss for CNN decreases and settles but training accuracy does not improveWhat is the intuition behind Ridge Regression and Adapting Gradient Descent algorithms?Regularization term in Matrix FactorizationWeight update to fully convolutional network when supervision is only for a patchHow to balance Keras loss functions of different magnitudes

How to assert on pagereference where the endpoint of pagereference is predefined

Can I use 1000v rectifier diodes instead of 600v rectifier diodes?

Visa for volunteering in England

Unexpected email from Yorkshire Bank

Can commander tax be proliferated?

What is the limiting factor for a CAN bus to exceed 1Mbps bandwidth?

Was Hulk present at this event?

Why is the SNP putting so much emphasis on currency plans?

I’ve officially counted to infinity!

What word means "to make something obsolete"?

How did Arya get back her dagger from Sansa?

Is it cheaper to drop cargo than to land it?

Why is this a valid proof for the harmonic series?

Has any spacecraft ever had the ability to directly communicate with civilian air traffic control?

What was the state of the German rail system in 1944?

Proof that when f'(x) < f(x), f(x) =0

Feels like I am getting dragged into office politics

Why are there synthetic chemicals in our bodies? Where do they come from?

Is Cola "probably the best-known" Latin word in the world? If not, which might it be?

Write to EXCEL from SQL DB using VBA script

Does hiding behind 5-ft-wide cover give full cover?

Why do freehub and cassette have only one position that matches?

What is the most remote airport from the center of the city it supposedly serves?

Pigeonhole Principle Problem

Square Root Regularization and High Loss

Speed decay proof for L2 regularization and non-normalizied weight initiationHeuristic argument for Weight decay and regularizationNeural Network Performs Bad On MNISTWeight decay in neural networkWhy does my loss value start at approximately -10,000 and my accuracy not improve?Loss for CNN decreases and settles but training accuracy does not improveWhat is the intuition behind Ridge Regression and Adapting Gradient Descent algorithms?Regularization term in Matrix FactorizationWeight update to fully convolutional network when supervision is only for a patchHow to balance Keras loss functions of different magnitudes

I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:

for p in model.parameters():
 loss += lambda * torch.sqrt(p.abs()).sum()

p.abs() is the absolute value of p, i.e the weights, torch.sqrt() is the square root and .sum() is the sum of the result for the individual weights. lambda is the penalization factor.

With no regularization, the loss settles around 0.4. With lambda=100, the loss settles somewhere around 0.45, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001, the final value for the loss is around 0.44. Now if I use l1/2 with the same lambda it settles around 5000! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).

Is this to be expected or is this some error in my code or pytorch's SGD?

One more note If use l1/2 with a small lambda like 0.001, the loss comes down to around 0.5 and then becomes nan around epoch 70. For lambda=0.01, it becomes ~1.0 and then nan around the same epoch. For lambda=0.1, loss becomes 5 but no nan anymore for this lambda or any value higher (in 120 epochs total). For lambda=1.0, loss settles at ~50- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50, regardless of lambda...

edited Apr 9 at 11:10

Esmailian

4,081422

asked Apr 9 at 0:29

user2268997

1133

add a comment |

for p in model.parameters():
 loss += lambda * torch.sqrt(p.abs()).sum()

edited Apr 9 at 11:10

Esmailian

4,081422

asked Apr 9 at 0:29

user2268997

1133

add a comment |

for p in model.parameters():
 loss += lambda * torch.sqrt(p.abs()).sum()

edited Apr 9 at 11:10

Esmailian

4,081422

asked Apr 9 at 0:29

user2268997

1133

for p in model.parameters():
 loss += lambda * torch.sqrt(p.abs()).sum()

neural-network loss-function pytorch regularization

edited Apr 9 at 11:10

Esmailian

4,081422

asked Apr 9 at 0:29

user2268997

1133

edited Apr 9 at 11:10

Esmailian

4,081422

asked Apr 9 at 0:29

user2268997

1133

edited Apr 9 at 11:10

Esmailian

4,081422

edited Apr 9 at 11:10

Esmailian

4,081422

edited Apr 9 at 11:10

Esmailian

4,081422

asked Apr 9 at 0:29

user2268997

1133

asked Apr 9 at 0:29

user2268997

1133

asked Apr 9 at 0:29

user2268997

1133

add a comment |

1 Answer
1

active

oldest

votes

The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.

Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.

However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):

enter image description here

As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.

For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
$$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
where $L(theta) = l(theta) + lambda sqrt$.

You can spot the trouble! The more we get close to zero, the less we get close to zero!

edited Apr 9 at 12:11

answered Apr 9 at 10:55

Esmailian

4,081422

1

$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48921%2fsquare-root-regularization-and-high-loss%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.

enter image description here

As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.

You can spot the trouble! The more we get close to zero, the less we get close to zero!

edited Apr 9 at 12:11

answered Apr 9 at 10:55

Esmailian

4,081422

1

$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32

add a comment |

The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.

enter image description here

As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.

You can spot the trouble! The more we get close to zero, the less we get close to zero!

edited Apr 9 at 12:11

answered Apr 9 at 10:55

Esmailian

4,081422

1

$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32

add a comment |

The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.

enter image description here

As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.

You can spot the trouble! The more we get close to zero, the less we get close to zero!

edited Apr 9 at 12:11

answered Apr 9 at 10:55

Esmailian

4,081422

The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.

enter image description here

As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.

You can spot the trouble! The more we get close to zero, the less we get close to zero!

edited Apr 9 at 12:11

answered Apr 9 at 10:55

Esmailian

4,081422

edited Apr 9 at 12:11

answered Apr 9 at 10:55

Esmailian

4,081422

answered Apr 9 at 10:55

Esmailian

4,081422

answered Apr 9 at 10:55

Esmailian

4,081422

1

$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32

add a comment |

1

$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32

Beautifully explained. Thanks!

– user2268997
Apr 9 at 17:32

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

0vjY73tu,gdDYkgbSygEgLzG1D2KET,B5B4oBoNuGDb2 29G eqSAk2G4,FS0Aq,nem6cnc6XK2Q S81 XF1lRQQ U5Q,ucme4

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Tähtien Talli Jäsenet | Lähteet | NavigointivalikkoSuomen Hippos – Tähtien Talli

1 Answer
1

1 Answer
1

1 Answer
1