Effects of L2 loss and smooth L1 loss The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?
Word to describe a time interval
Is it ok to offer lower paid work as a trial period before negotiating for a full-time job?
Windows 10: How to Lock (not sleep) laptop on lid close?
Can we generate random numbers using irrational numbers like π and e?
Using `min_active_rowversion` for global temporary tables
For what reasons would an animal species NOT cross a *horizontal* land bridge?
Working through Single Responsibility Principle in Python when Calls are Expensive
The following signatures were invalid: EXPKEYSIG 1397BC53640DB551
How do you keep chess fun when your opponent constantly beats you?
Identify 80s or 90s comics with ripped creatures (not dwarves)
Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?
Can the DM override racial traits?
Homework question about an engine pulling a train
How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time
Did the new image of black hole confirm the general theory of relativity?
How to substitute curly brackets with round brackets in a grid of list
Match Roman Numerals
What can I do if neighbor is blocking my solar panels intentionally?
Do warforged have souls?
How to handle characters who are more educated than the author?
Can I visit the Trinity College (Cambridge) library and see some of their rare books
Why can't wing-mounted spoilers be used to steepen approaches?
What was the last x86 CPU that did not have the x87 floating-point unit built in?
60's-70's movie: home appliances revolting against the owners
Effects of L2 loss and smooth L1 loss
The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?
$begingroup$
Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?
loss-function
$endgroup$
add a comment |
$begingroup$
Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?
loss-function
$endgroup$
add a comment |
$begingroup$
Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?
loss-function
$endgroup$
Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?
loss-function
loss-function
edited Apr 3 at 11:33
bradS
657213
657213
asked Apr 3 at 4:29
HOANG GIANGHOANG GIANG
426
426
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".
Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
$$beginalign*
L_delta(y_n, f_theta(boldsymbolx_n))
=left{
beginmatrix
frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
endmatrix
right.
endalign*$$
where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.
Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
$$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
compared to squared loss, where gradient increases with the difference, i.e.
$$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$
which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
$$beginalign*
theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
&= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
endalign*$$
It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.
When to use each of them?
Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".
Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
$$beginalign*
L_delta(y_n, f_theta(boldsymbolx_n))
=left{
beginmatrix
frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
endmatrix
right.
endalign*$$
where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.
Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
$$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
compared to squared loss, where gradient increases with the difference, i.e.
$$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$
which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
$$beginalign*
theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
&= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
endalign*$$
It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.
When to use each of them?
Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.
$endgroup$
add a comment |
$begingroup$
First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".
Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
$$beginalign*
L_delta(y_n, f_theta(boldsymbolx_n))
=left{
beginmatrix
frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
endmatrix
right.
endalign*$$
where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.
Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
$$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
compared to squared loss, where gradient increases with the difference, i.e.
$$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$
which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
$$beginalign*
theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
&= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
endalign*$$
It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.
When to use each of them?
Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.
$endgroup$
add a comment |
$begingroup$
First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".
Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
$$beginalign*
L_delta(y_n, f_theta(boldsymbolx_n))
=left{
beginmatrix
frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
endmatrix
right.
endalign*$$
where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.
Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
$$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
compared to squared loss, where gradient increases with the difference, i.e.
$$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$
which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
$$beginalign*
theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
&= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
endalign*$$
It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.
When to use each of them?
Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.
$endgroup$
First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".
Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
$$beginalign*
L_delta(y_n, f_theta(boldsymbolx_n))
=left{
beginmatrix
frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
endmatrix
right.
endalign*$$
where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.
Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
$$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
compared to squared loss, where gradient increases with the difference, i.e.
$$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$
which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
$$beginalign*
theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
&= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
endalign*$$
It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.
When to use each of them?
Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.
edited Apr 3 at 9:03
answered Apr 3 at 7:38
EsmailianEsmailian
3,181320
3,181320
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown