Effects of L2 loss and smooth L1 loss The 2019 Stack Overflow Developer Survey Results Are In Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) 2019 Moderator Election Q&A - Questionnaire 2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?

Word to describe a time interval

Is it ok to offer lower paid work as a trial period before negotiating for a full-time job?

Windows 10: How to Lock (not sleep) laptop on lid close?

Can we generate random numbers using irrational numbers like π and e?

Using `min_active_rowversion` for global temporary tables

For what reasons would an animal species NOT cross a *horizontal* land bridge?

Working through Single Responsibility Principle in Python when Calls are Expensive

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

How do you keep chess fun when your opponent constantly beats you?

Identify 80s or 90s comics with ripped creatures (not dwarves)

Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?

Can the DM override racial traits?

Homework question about an engine pulling a train

How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time

Did the new image of black hole confirm the general theory of relativity?

How to substitute curly brackets with round brackets in a grid of list

Match Roman Numerals

What can I do if neighbor is blocking my solar panels intentionally?

Do warforged have souls?

How to handle characters who are more educated than the author?

Can I visit the Trinity College (Cambridge) library and see some of their rare books

Why can't wing-mounted spoilers be used to steepen approaches?

What was the last x86 CPU that did not have the x87 floating-point unit built in?

60's-70's movie: home appliances revolting against the owners



Effects of L2 loss and smooth L1 loss



The 2019 Stack Overflow Developer Survey Results Are In
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
2019 Moderator Election Q&A - Questionnaire
2019 Community Moderator Election ResultsRight choice of accuracy metric or loss functionLoss function to maximize sum of targetsDecomposable output regression neural networkLog loss and expected aggregatesCustom loss function with additional parameter in KerasL2 loss vs. mean squared lossReason for having both low loss and same predicted class?Enable to reproduce the loss of training while predictingContrastive loss problem in a character-level, siamese NN modelDifference between “reducing batch_size” and “increasing epochs” to decrease loss amount?










1












$begingroup$


Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?










share|improve this question











$endgroup$
















    1












    $begingroup$


    Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?










    share|improve this question











    $endgroup$














      1












      1








      1





      $begingroup$


      Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?










      share|improve this question











      $endgroup$




      Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $alpha = 1$) are, and when to use each of them ?







      loss-function






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 3 at 11:33









      bradS

      657213




      657213










      asked Apr 3 at 4:29









      HOANG GIANGHOANG GIANG

      426




      426




















          1 Answer
          1






          active

          oldest

          votes


















          2












          $begingroup$

          First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



          Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
          $$beginalign*
          L_delta(y_n, f_theta(boldsymbolx_n))
          =left{
          beginmatrix
          frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
          deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
          endmatrix
          right.
          endalign*$$



          where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



          Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
          $$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
          compared to squared loss, where gradient increases with the difference, i.e.
          $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



          which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
          $$beginalign*
          theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
          &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
          endalign*$$



          It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




          When to use each of them?




          Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






          share|improve this answer











          $endgroup$













            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2












            $begingroup$

            First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



            Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
            $$beginalign*
            L_delta(y_n, f_theta(boldsymbolx_n))
            =left{
            beginmatrix
            frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
            deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
            endmatrix
            right.
            endalign*$$



            where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



            Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
            $$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
            compared to squared loss, where gradient increases with the difference, i.e.
            $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



            which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
            $$beginalign*
            theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
            &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
            endalign*$$



            It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




            When to use each of them?




            Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






            share|improve this answer











            $endgroup$

















              2












              $begingroup$

              First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



              Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
              $$beginalign*
              L_delta(y_n, f_theta(boldsymbolx_n))
              =left{
              beginmatrix
              frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
              deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
              endmatrix
              right.
              endalign*$$



              where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



              Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
              $$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
              compared to squared loss, where gradient increases with the difference, i.e.
              $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



              which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
              $$beginalign*
              theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
              &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
              endalign*$$



              It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




              When to use each of them?




              Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






              share|improve this answer











              $endgroup$















                2












                2








                2





                $begingroup$

                First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



                Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
                $$beginalign*
                L_delta(y_n, f_theta(boldsymbolx_n))
                =left{
                beginmatrix
                frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
                deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
                endmatrix
                right.
                endalign*$$



                where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



                Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
                $$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
                compared to squared loss, where gradient increases with the difference, i.e.
                $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



                which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
                $$beginalign*
                theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
                &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
                endalign*$$



                It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




                When to use each of them?




                Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.






                share|improve this answer











                $endgroup$



                First, Huber loss only works in one-dimension as it requires $$left|boldsymbolaright|_2=left|boldsymbolaright|_1=delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".



                Huber loss is the same as squared loss for differences less than $delta$, and the same as absolute loss for differences larger than $delta$, i.e.
                $$beginalign*
                L_delta(y_n, f_theta(boldsymbolx_n))
                =left{
                beginmatrix
                frac12left(y_n - f_theta(boldsymbolx_n)right)^2 & left|y_n - f(boldsymbolx_n)right| leq delta,\
                deltaleft|y_n - f_theta(boldsymbolx_n)right| - frac12delta^2, & textotherwise.
                endmatrix
                right.
                endalign*$$



                where $y_n$ is the target of data point $n$, and $f_theta(boldsymbolx_n)$ is model's prediction. Note that $L_delta$ has nothing to do with $L_p$ norm, despite the similar notations.



                Because of this definition, for large differences due to outliers, gradient of loss function remains constant $pm delta$, the same as absolute loss, i.e.
                $$fracy_n - f_theta(boldsymbolx_n)rightpartial theta_i = pm delta fracpartial f_theta(boldsymbolx_n)partial theta_i$$
                compared to squared loss, where gradient increases with the difference, i.e.
                $$fracpartial frac12left(y_n - f_theta(boldsymbolx_n)right)^2partial theta_i = -left(y_n - f_theta(boldsymbolx_n)right)fracpartial f_theta(boldsymbolx_n)partial theta_i$$



                which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows:
                $$beginalign*
                theta'_i &=theta_i + lambda sum_n fracpartial f_theta(boldsymbolx_n)partial theta_ileft(y_n - f_theta(boldsymbolx_n)right) \
                &= theta_i + lambdasum_n notin textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textsmall) +lambdasum_n in textoutliers fracpartial f_theta(boldsymbolx_n)partial theta_i(textlarge)
                endalign*$$



                It is worth noting that, here, outliers are irregularities in the joint input-output space $(boldsymbolx_n, y_n)$, not necessarily just in the input space $boldsymbolx_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=(1, 2)$, $(5, 10)$, $(10, 20)$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_theta(1)=2$.




                When to use each of them?




                Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $delta$.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Apr 3 at 9:03

























                answered Apr 3 at 7:38









                EsmailianEsmailian

                3,181320




                3,181320



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48482%2feffects-of-l2-loss-and-smooth-l1-loss%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Adding axes to figuresAdding axes labels to LaTeX figuresLaTeX equivalent of ConTeXt buffersRotate a node but not its content: the case of the ellipse decorationHow to define the default vertical distance between nodes?TikZ scaling graphic and adjust node position and keep font sizeNumerical conditional within tikz keys?adding axes to shapesAlign axes across subfiguresAdding figures with a certain orderLine up nested tikz enviroments or how to get rid of themAdding axes labels to LaTeX figures

                    Luettelo Yhdysvaltain laivaston lentotukialuksista Lähteet | Navigointivalikko

                    Gary (muusikko) Sisällysluettelo Historia | Rockin' High | Lähteet | Aiheesta muualla | NavigointivalikkoInfobox OKTuomas "Gary" Keskinen Ancaran kitaristiksiProjekti Rockin' High