Square Root Regularization and High LossSpeed decay proof for L2 regularization and non-normalizied weight initiationHeuristic argument for Weight decay and regularizationNeural Network Performs Bad On MNISTWeight decay in neural networkWhy does my loss value start at approximately -10,000 and my accuracy not improve?Loss for CNN decreases and settles but training accuracy does not improveWhat is the intuition behind Ridge Regression and Adapting Gradient Descent algorithms?Regularization term in Matrix FactorizationWeight update to fully convolutional network when supervision is only for a patchHow to balance Keras loss functions of different magnitudes
How to assert on pagereference where the endpoint of pagereference is predefined
Can I use 1000v rectifier diodes instead of 600v rectifier diodes?
Visa for volunteering in England
Unexpected email from Yorkshire Bank
Can commander tax be proliferated?
What is the limiting factor for a CAN bus to exceed 1Mbps bandwidth?
Was Hulk present at this event?
Why is the SNP putting so much emphasis on currency plans?
I’ve officially counted to infinity!
What word means "to make something obsolete"?
How did Arya get back her dagger from Sansa?
Is it cheaper to drop cargo than to land it?
Why is this a valid proof for the harmonic series?
Has any spacecraft ever had the ability to directly communicate with civilian air traffic control?
What was the state of the German rail system in 1944?
Proof that when f'(x) < f(x), f(x) =0
Feels like I am getting dragged into office politics
Why are there synthetic chemicals in our bodies? Where do they come from?
Is Cola "probably the best-known" Latin word in the world? If not, which might it be?
Write to EXCEL from SQL DB using VBA script
Does hiding behind 5-ft-wide cover give full cover?
Why do freehub and cassette have only one position that matches?
What is the most remote airport from the center of the city it supposedly serves?
Pigeonhole Principle Problem
Square Root Regularization and High Loss
Speed decay proof for L2 regularization and non-normalizied weight initiationHeuristic argument for Weight decay and regularizationNeural Network Performs Bad On MNISTWeight decay in neural networkWhy does my loss value start at approximately -10,000 and my accuracy not improve?Loss for CNN decreases and settles but training accuracy does not improveWhat is the intuition behind Ridge Regression and Adapting Gradient Descent algorithms?Regularization term in Matrix FactorizationWeight update to fully convolutional network when supervision is only for a patchHow to balance Keras loss functions of different magnitudes
$begingroup$
I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:
for p in model.parameters():
loss += lambda * torch.sqrt(p.abs()).sum()
p.abs()
is the absolute value of p
, i.e the weights, torch.sqrt()
is the square root and .sum()
is the sum of the result for the individual weights. lambda
is the penalization factor.
With no regularization, the loss settles around 0.4
. With lambda=100
, the loss settles somewhere around 0.45
, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001
, the final value for the loss is around 0.44
. Now if I use l1/2 with the same lambda it settles around 5000
! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).
Is this to be expected or is this some error in my code or pytorch's SGD?
One more note If use l1/2 with a small lambda
like 0.001
, the loss comes down to around 0.5
and then becomes nan
around epoch 70. For lambda=0.01
, it becomes ~1.0
and then nan
around the same epoch. For lambda=0.1
, loss becomes 5
but no nan
anymore for this lambda
or any value higher (in 120 epochs total). For lambda=1.0
, loss settles at ~50
- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50
, regardless of lambda
...
neural-network loss-function pytorch regularization
$endgroup$
add a comment |
$begingroup$
I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:
for p in model.parameters():
loss += lambda * torch.sqrt(p.abs()).sum()
p.abs()
is the absolute value of p
, i.e the weights, torch.sqrt()
is the square root and .sum()
is the sum of the result for the individual weights. lambda
is the penalization factor.
With no regularization, the loss settles around 0.4
. With lambda=100
, the loss settles somewhere around 0.45
, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001
, the final value for the loss is around 0.44
. Now if I use l1/2 with the same lambda it settles around 5000
! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).
Is this to be expected or is this some error in my code or pytorch's SGD?
One more note If use l1/2 with a small lambda
like 0.001
, the loss comes down to around 0.5
and then becomes nan
around epoch 70. For lambda=0.01
, it becomes ~1.0
and then nan
around the same epoch. For lambda=0.1
, loss becomes 5
but no nan
anymore for this lambda
or any value higher (in 120 epochs total). For lambda=1.0
, loss settles at ~50
- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50
, regardless of lambda
...
neural-network loss-function pytorch regularization
$endgroup$
add a comment |
$begingroup$
I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:
for p in model.parameters():
loss += lambda * torch.sqrt(p.abs()).sum()
p.abs()
is the absolute value of p
, i.e the weights, torch.sqrt()
is the square root and .sum()
is the sum of the result for the individual weights. lambda
is the penalization factor.
With no regularization, the loss settles around 0.4
. With lambda=100
, the loss settles somewhere around 0.45
, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001
, the final value for the loss is around 0.44
. Now if I use l1/2 with the same lambda it settles around 5000
! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).
Is this to be expected or is this some error in my code or pytorch's SGD?
One more note If use l1/2 with a small lambda
like 0.001
, the loss comes down to around 0.5
and then becomes nan
around epoch 70. For lambda=0.01
, it becomes ~1.0
and then nan
around the same epoch. For lambda=0.1
, loss becomes 5
but no nan
anymore for this lambda
or any value higher (in 120 epochs total). For lambda=1.0
, loss settles at ~50
- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50
, regardless of lambda
...
neural-network loss-function pytorch regularization
$endgroup$
I am testing out square root regularization (explained ahead) in a pytorch implementation of a neural network. Square root regularization, henceforth l1/2, is just like l2 regularization, but instead of squaring the weights, I take the square root of their absolute value. To implement it I penalize the loss as such in pytorch:
for p in model.parameters():
loss += lambda * torch.sqrt(p.abs()).sum()
p.abs()
is the absolute value of p
, i.e the weights, torch.sqrt()
is the square root and .sum()
is the sum of the result for the individual weights. lambda
is the penalization factor.
With no regularization, the loss settles around 0.4
. With lambda=100
, the loss settles somewhere around 0.45
, if I use l2 or l1 regularization. Interestingly enough with lambda=0.001
, the final value for the loss is around 0.44
. Now if I use l1/2 with the same lambda it settles around 5000
! This just does not make sense to me. If the regularization is such an overwhelming factor in the loss, then SGD has got to bring the (absolute value) of the weights down until, the penalty from the regularization is balanced with the actual classification from the Cross Entropy Loss I'm using - I know this is not the case, because the train and validation accuracies, are around the same of the original network (without regularization) at the end of training. I also know that the regularization is indeed happening in all three cases, as the loss in the early epochs differ significantly).
Is this to be expected or is this some error in my code or pytorch's SGD?
One more note If use l1/2 with a small lambda
like 0.001
, the loss comes down to around 0.5
and then becomes nan
around epoch 70. For lambda=0.01
, it becomes ~1.0
and then nan
around the same epoch. For lambda=0.1
, loss becomes 5
but no nan
anymore for this lambda
or any value higher (in 120 epochs total). For lambda=1.0
, loss settles at ~50
- as expected: apparently the weights settle down at a point where the sum of their square roots equals ~ 50
, regardless of lambda
...
neural-network loss-function pytorch regularization
neural-network loss-function pytorch regularization
edited Apr 9 at 11:10
Esmailian
4,081422
4,081422
asked Apr 9 at 0:29
user2268997user2268997
1133
1133
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.
Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.
However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):
As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.
For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
$$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
where $L(theta) = l(theta) + lambda sqrt$.
You can spot the trouble! The more we get close to zero, the less we get close to zero!
$endgroup$
1
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48921%2fsquare-root-regularization-and-high-loss%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.
Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.
However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):
As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.
For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
$$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
where $L(theta) = l(theta) + lambda sqrt$.
You can spot the trouble! The more we get close to zero, the less we get close to zero!
$endgroup$
1
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
add a comment |
$begingroup$
The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.
Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.
However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):
As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.
For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
$$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
where $L(theta) = l(theta) + lambda sqrt$.
You can spot the trouble! The more we get close to zero, the less we get close to zero!
$endgroup$
1
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
add a comment |
$begingroup$
The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.
Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.
However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):
As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.
For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
$$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
where $L(theta) = l(theta) + lambda sqrt$.
You can spot the trouble! The more we get close to zero, the less we get close to zero!
$endgroup$
The problem with using $L_p$ norm with $p < 1$ for regularization is the gradient.
Regularization term is used to force the parameters to be closer to zero. For this to work, when a parameter goes closer to zero, the gradient of regularization term, i.e. its contribution in updating the parameter, should decrease as well or at least remain constant.
However, by going closer to zero, the gradient increases unboundedly for $p < 1$, which causes radical changes in parameters, leading to your observations. Here is a visual plot for root $r$, absolute $a$, and squared $s$ regularization terms in one dimension (drawn here):
As you can see, for root regularization, the gradient explodes to infinity as we get closer to zero which is in conflict with our intention to bring the parameters close to zero.
For the sake of completeness, here is what a 1D parameter update with root regularization looks like:
$$theta_n+1 = theta_n + alpha fracd L(theta)d thetaBigr|_theta=theta_n = theta_n + alpha left(fracd l(theta)d thetaBigr|_theta=theta_n pm lambda frac1colorredsqrtright)$$
where $L(theta) = l(theta) + lambda sqrt$.
You can spot the trouble! The more we get close to zero, the less we get close to zero!
edited Apr 9 at 12:11
answered Apr 9 at 10:55
EsmailianEsmailian
4,081422
4,081422
1
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
add a comment |
1
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
1
1
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
$begingroup$
Beautifully explained. Thanks!
$endgroup$
– user2268997
Apr 9 at 17:32
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48921%2fsquare-root-regularization-and-high-loss%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown