Effects of L2 loss and smooth L1 loss Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?

The more you know, the more you don't know

What is the meaning of the simile “quick as silk”?

What causes the direction of lightning flashes?

Why wasn't DOSKEY integrated with COMMAND.COM?

Most bit efficient text communication method?

Did Deadpool rescue all of the X-Force?

Did MS DOS itself ever use blinking text?

What is the effect of altitude on true airspeed?

Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?

How do I find out the mythology and history of my Fortress?

As a beginner, should I get a Squier Strat with a SSS config or a HSS?

Are there mentions in Hinduism about instruments which allows one to know others thoughts and influence them? And is it sinful?

What's the meaning of "fortified infraction restraint"?

Sum letters are not two different

Is grep documentation wrong?

Why did Roosevelt decide to implement a maximum wage through taxation rather than a simple ceiling?

Why do we bend a book to keep it straight?

SF book about people trapped in a series of worlds they imagine

Is it possible to add Lighting Web Component in the Visual force Page?

Is it ethical to give a final exam after the professor has quit before teaching the remaining chapters of the course?

What are the out-of-universe reasons for the references to Toby Maguire-era Spider-Man in Into the Spider-Verse?

Selecting user stories during sprint planning

Catmull-Clark and Doo-Sabin Subdivision Codes

Why is Nikon 1.4g better when Nikon 1.8g is sharper?

Effects of L2 loss and smooth L1 loss

Announcing the arrival of Valued Associate #679: Cesar Manara

Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)

2019 Moderator Election Q&A - Questionnaire

2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?

Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?

edited Apr 3 at 11:33

bradS

667213

asked Apr 3 at 4:29

HOANG GIANG

426

add a comment |

Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?

edited Apr 3 at 11:33

bradS

667213

asked Apr 3 at 4:29

HOANG GIANG

426

add a comment |

Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?

edited Apr 3 at 11:33

bradS

667213

asked Apr 3 at 4:29

HOANG GIANG

426

Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?

loss-function

edited Apr 3 at 11:33

bradS

667213

asked Apr 3 at 4:29

HOANG GIANG

426

edited Apr 3 at 11:33

bradS

667213

asked Apr 3 at 4:29

HOANG GIANG

426

edited Apr 3 at 11:33

bradS

667213

edited Apr 3 at 11:33

bradS

667213

edited Apr 3 at 11:33

bradS

667213

asked Apr 3 at 4:29

HOANG GIANG

426

asked Apr 3 at 4:29

HOANG GIANG

426

asked Apr 3 at 4:29

HOANG GIANG

426

add a comment |

1 Answer
1

active

oldest

votes

First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".

Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
$$beginalign*
L_delta(y_n, f_theta(boldsymbolx_n))
=left{
beginmatrix
frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
endmatrix
right.
endalign*$$

where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.

Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
$$fracpartial deltaleftpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
compared to squared loss, where gradient increases with the difference, i.e.
$$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$

which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
$$beginalign*
theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
&= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
endalign*$$

It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.

When to use each of them?

Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.

edited Apr 3 at 9:03

answered Apr 3 at 7:38

Esmailian

3,391420

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.

When to use each of them?

edited Apr 3 at 9:03

answered Apr 3 at 7:38

Esmailian

3,391420

add a comment |

where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.

When to use each of them?

edited Apr 3 at 9:03

answered Apr 3 at 7:38

Esmailian

3,391420

add a comment |

where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.

When to use each of them?

edited Apr 3 at 9:03

answered Apr 3 at 7:38

Esmailian

3,391420

where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.

When to use each of them?

edited Apr 3 at 9:03

answered Apr 3 at 7:38

Esmailian

3,391420

edited Apr 3 at 9:03

answered Apr 3 at 7:38

Esmailian

3,391420

answered Apr 3 at 7:38

Esmailian

3,391420

answered Apr 3 at 7:38

Esmailian

3,391420

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Trjtdtk

1 Answer
1

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Post as a guest

Popular posts from this blog

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

1 Answer
1

1 Answer
1

1 Answer
1